linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Sean Christopherson <seanjc@google.com>
Cc: David Stevens <stevensd@chromium.org>,
	Christoph Hellwig <hch@infradead.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Yu Zhang <yu.c.zhang@linux.intel.com>,
	Isaku Yamahata <isaku.yamahata@gmail.com>,
	Zhi Wang <zhi.wang.linux@gmail.com>,
	Maxim Levitsky <mlevitsk@redhat.com>,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org
Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Date: Thu, 14 Mar 2024 10:20:23 +0100	[thread overview]
Message-ID: <d2a95b5c-4c93-47b1-bb5b-ef71370be287@amd.com> (raw)
In-Reply-To: <c84fcf0a-f944-4908-b7f6-a1b66a66a6bc@amd.com>

Sending that out once more since the AMD email servers have converted it 
to HTML mail once more :(

Grrr,
Christian.

Am 14.03.24 um 10:18 schrieb Christian König:
> Am 13.03.24 um 18:26 schrieb Sean Christopherson:
>> On Wed, Mar 13, 2024, Christian König wrote:
>>> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
>>>> [SNIP]
>>>>> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
>>>>> memory the driver has allocate.
>>>>>
>>>>>> I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
>>>>>> we should instead fix the TTM allocations.  And David pointed out that that was
>>>>>> tried and got NAK'd.
>>>>> Well as far as I can see Christoph rejects the complexity coming with the
>>>>> approach of sometimes grabbing the reference and sometimes not.
>>>> Unless I've wildly misread multiple threads, that is not Christoph's objection.
>>>>   From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@infradead.org):
>>>>
>>>>     On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<hch@infradead.org>   wrote:
>>>>     >
>>>>     > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
>>>>     > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
>>>>     > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
>>>>     > > non-reference-counted pages").  I don't think it makes any sense whatsoever to
>>>>     > > remove that code and assume every driver in existence will do the right thing.
>>>>     >
>>>>     > Agreed.
>>>>     >
>>>>     > >
>>>>     > > With the cleanups done, playing nice with non-refcounted paged instead of outright
>>>>     > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
>>>>     > > maintenance cost.
>>>>     >
>>>>     > I tend to strongly disagree with that, though.  We can't just let these
>>>>     > non-refcounted pages spread everywhere and instead need to fix their
>>>>     > usage.
>>> And I can only repeat myself that I completely agree with Christoph here.
>> I am so confused.  If you agree with Christoph, why not fix the TTM allocations?
>
> Because the TTM allocation isn't broken in any way.
>
> See in some configurations TTM even uses the DMA API for those 
> allocations and that is actually something Christoph coded.
>
> What Christoph is really pointing out is that absolutely nobody should 
> put non-refcounted pages into a VMA, but again this isn't something 
> TTM does. What TTM does instead is to work with the PFN and puts that 
> into a VMA.
>
> It's just that then KVM comes along and converts the PFN back into a 
> struct page again and that is essentially what causes all the 
> problems, including CVE-2021-22543.
>
>>>>> And I have to agree that this is extremely odd.
>>>> Yes, it's odd and not ideal.  But with nested virtualization, KVM _must_ "map"
>>>> pfns directly into the guest via fields in the control structures that are
>>>> consumed by hardware.  I.e. pfns are exposed to the guest in an "out-of-band"
>>>> structure that is NOT part of the stage-2 page tables.  And wiring those up to
>>>> the MMU notifiers is extremely difficult for a variety of reasons[*].
>>>>
>>>> Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
>>>> to grab a reference to the struct page while the out-of-band mapping exists, i.e.
>>>> to pin the page to prevent use-after-free.
>>> Wait a second, as far as I know this approach doesn't work correctly in all
>>> cases. See here as well:https://lwn.net/Articles/930667/
>>>
>>> The MMU notifier is not only to make sure that pages don't go away and
>>> prevent use after free, but also that PTE updates correctly wait for ongoing
>>> DMA transfers.
>>>
>>> Otherwise quite a bunch of other semantics doesn't work correctly either.
>>>
>>>> And KVM's historical ABI is to support
>>>> any refcounted page for these out-of-band mappings, regardless of whether the
>>>> page was obtained by gup() or follow_pte().
>>> Well see the discussion from last year the LWN article summarizes.
>> Oof.  I suspect that in practice, no one has observed issues because the pages
>> in question are heavily used by the guest and don't get evicted from the page cache.
>
> Well in this case I strongly suggest to block the problematic cases.
>
> It just sounds like KVM never run into problems because nobody is 
> doing any of those problematic cases. If that's true it should be 
> doable to change the UAPI and say this specific combination is 
> forbidden because it could result in data corruption.
>
>>> I'm not an expert for KVM but as far as I can see what you guys do here is
>>> illegal and can lead to corruption.
>>>
>>> Basically you can't create a second channel to modify page content like
>>> that.
>> Well, KVM can, it just needs to honor mmu_notifier events in order to be 100%
>> robust.
>
> Yeah, completely agree.
>
>> That said, while this might be motivation to revisit tying the out-of-band mappings
>> to the mmu_notifier, it has no bearing on the outcome of Christoph's objection.
>> Because (a) AIUI, Christoph is objecting to mapping non-refcounted struct page
>> memory *anywhere*, and (b) in this series, KVM will explicitly disallow non-
>> refcounted pages from being mapped in the out-of-band structures (see below).
>>
>>>> Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
>>>> KVM resorts to conditionally grabbing references and disllowing non-refcounted
>>>> pages from being inserted into the out-of-band mappings.
>>> Why not only refcount the pages when out of band mappings are requested?
>> That's also what this series effectively does.  By default, KVM will disallow
>> inserting *any* non-refcounted memory into the out-of-band mappings.
>
> Ok than that's basically my inconvenient. I can't really criticize the 
> KVM patch because I don't really understand all the details.
>
> I'm only pointing out from a very high level view how memory 
> management works and that I see some conflict with what KVM does.
>
> As far as I can tell the cleanest approach for KVM would be to have 
> two completely separate handlings:
>
> 1. Using GUP to handle the out-of-band mappings, this one grabs 
> references and honors all the GUP restrictions with the appropriate 
> flags. When VM_PFNMAP is set then those mappings will be rejected. 
> That should also avoid any trouble with file backed mappings.
>
> 2. Using follow_pte() plus an MMU notifier and this one can even 
> handle VMAs with the VM_PFNMAP and VM_IO flag set.
>
>> "By default" because there are use cases where the guest memory pool is hidden
>> from the kernel at boot, and is fully managed by userspace.  I.e. userspace is
>> effectively responsible/trusted to not free/change the mappings for an active
>> guest.
>
> Wow, could that potentially crash the kernel?
>
> Cheers,
> Christian.


  parent reply	other threads:[~2024-03-14  9:20 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-29  2:57 [PATCH v11 0/8] KVM: allow mapping non-refcounted pages David Stevens
2024-02-29  2:57 ` [PATCH v11 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty David Stevens
2024-02-29  2:57 ` [PATCH v11 2/8] KVM: Relax BUG_ON argument validation David Stevens
2024-02-29  2:57 ` [PATCH v11 3/8] KVM: mmu: Introduce kvm_follow_pfn() David Stevens
2024-02-29  2:57 ` [PATCH v11 4/8] KVM: mmu: Improve handling of non-refcounted pfns David Stevens
2024-02-29  2:57 ` [PATCH v11 5/8] KVM: Migrate kvm_vcpu_map() to kvm_follow_pfn() David Stevens
2024-02-29  2:57 ` [PATCH v11 6/8] KVM: x86: Migrate " David Stevens
2024-02-29  2:57 ` [PATCH v11 7/8] KVM: x86/mmu: Track if sptes refer to refcounted pages David Stevens
2024-02-29  2:57 ` [PATCH v11 8/8] KVM: x86/mmu: Handle non-refcounted pages David Stevens
2024-04-04 16:03   ` Dmitry Osipenko
2024-04-15  7:28     ` David Stevens
2024-04-15  9:36       ` Paolo Bonzini
2024-02-29 13:36 ` [PATCH v11 0/8] KVM: allow mapping " Christoph Hellwig
2024-03-13  4:55   ` David Stevens
2024-03-13  9:55     ` Christian König
2024-03-13 13:34       ` Sean Christopherson
2024-03-13 14:37         ` Christian König
2024-03-13 14:48           ` Sean Christopherson
     [not found]             ` <9e604f99-5b63-44d7-8476-00859dae1dc4@amd.com>
2024-03-13 15:09               ` Christian König
2024-03-13 15:47               ` Sean Christopherson
     [not found]                 ` <93df19f9-6dab-41fc-bbcd-b108e52ff50b@amd.com>
2024-03-13 17:26                   ` Sean Christopherson
     [not found]                     ` <c84fcf0a-f944-4908-b7f6-a1b66a66a6bc@amd.com>
2024-03-14  9:20                       ` Christian König [this message]
2024-03-14 11:31                         ` David Stevens
2024-03-14 11:51                           ` Christian König
2024-03-14 14:45                             ` Sean Christopherson
2024-03-18  1:26                             ` Christoph Hellwig
2024-03-18 13:10                               ` Paolo Bonzini
2024-03-18 23:20                                 ` Christoph Hellwig
2024-03-14 16:17                           ` Sean Christopherson
2024-03-14 17:19                             ` Sean Christopherson
2024-03-15 17:59                               ` Sean Christopherson
2024-03-20 20:54                                 ` Axel Rasmussen
2024-03-13 13:33     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d2a95b5c-4c93-47b1-bb5b-ef71370be287@amd.com \
    --to=christian.koenig@amd.com \
    --cc=hch@infradead.org \
    --cc=isaku.yamahata@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=stevensd@chromium.org \
    --cc=yu.c.zhang@linux.intel.com \
    --cc=zhi.wang.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).