linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: "Huang, Kai" <kai.huang@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "Luck, Tony" <tony.luck@intel.com>,
	"bagasdotme@gmail.com" <bagasdotme@gmail.com>,
	"ak@linux.intel.com" <ak@linux.intel.com>,
	"Wysocki, Rafael J" <rafael.j.wysocki@intel.com>,
	"kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>, "Christopherson,,
	Sean" <seanjc@google.com>,
	"Chatre, Reinette" <reinette.chatre@intel.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Yamahata, Isaku" <isaku.yamahata@intel.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"Shahar, Sagi" <sagis@google.com>,
	"imammedo@redhat.com" <imammedo@redhat.com>,
	"Gao, Chao" <chao.gao@intel.com>,
	"Brown, Len" <len.brown@intel.com>,
	"sathyanarayanan.kuppuswamy@linux.intel.com" 
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>
Subject: Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
Date: Mon, 9 Jan 2023 08:51:20 -0800	[thread overview]
Message-ID: <e02fd75d-e9d4-15f1-eb9c-31cf3cc9ddc1@intel.com> (raw)
In-Reply-To: <bc11552572428c3b29b67852b062c387ecd7be45.camel@intel.com>

On 1/9/23 03:48, Huang, Kai wrote:
...
>>>>> This approach works as in practice all boot-time present DIMMs are TDX
>>>>> convertible memory.  However, if any non-TDX-convertible memory has been
>>>>> hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
>>>>> module, the module initialization will fail.
>>>
>>> I really don't know what this is trying to say.
> 
> My intention is to explain and call out that such design (use all memory regions
> in memblock at the time of module initialization) works in practice, as long as
> non-CMR memory hasn't been added via memory hotplug.
> 
> Not sure if it is necessary, but I was thinking it may help reviewer to judge
> whether such design is acceptable.

This is yet another case where you've mechanically described the "what",
but left out the implications or the underlying basis "why".

I'd take a more methodical approach to describe what is going on here.
List the steps that must occur, or at least *one* example of those steps
and how they intereact with the code in this patch.  Then, explain the
fallout.

I also don't think it's quite right to call out "CXL memory via kmem
driver".  If the CXL memory was "System RAM", it should get covered by a
CMR and TDMR.  The kmem driver can still go wild with it.

>>> *How* and *why* does this module initialization failure occur?
> 
> If we pass any non-CMR memory to the TDX module, the SEAMCALL (TDH.SYS.CONFIG)
> will fail.

I'm frankly lost now.  Please go back and try to explain this better.
Let me know if you want to iterate on this faster than resending this
series five more times.  I've got some ideas.

>>>>> This can also be enhanced in the future, i.e. by allowing adding non-TDX
>>>>> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>>>>> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>>>>> needs to guarantee memory pages for TDX guests are always allocated from
>>>>> the "TDX-capable" nodes.
>>>
>>> Why does it need to be enhanced?  What's the problem?
> 
> The problem is after TDX module initialization, no more memory can be hot-added
> to the page allocator.
> 
> Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> utilize all memory.
> 
> But probably it is not necessarily to call out in the changelog?

Let's say that we add this TDX-compatible-node ABI in the future.  What
will old code do that doesn't know about this ABI?

...
>>>>> +       list_for_each_entry(tmb, &tdx_memlist, list) {
>>>>> +               /*
>>>>> +                * The new range is TDX memory if it is fully covered by
>>>>> +                * any TDX memory block.
>>>>> +                *
>>>>> +                * Note TDX memory blocks are originated from memblock
>>>>> +                * memory regions, which can only be contiguous when two
>>>>> +                * regions have different NUMA nodes or flags.  Therefore
>>>>> +                * the new range cannot cross multiple TDX memory blocks.
>>>>> +                */
>>>>> +               if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
>>>>> +                       return true;
>>>>> +       }
>>>>> +       return false;
>>>>> +}
>>>
>>> I don't really like that comment.  It should first state its behavior
>>> and assumptions, like:
>>>
>>>     This check assumes that the start_pfn<->end_pfn range does not
>>>     cross multiple tdx_memlist entries.
>>>
>>> Only then should it describe why that is OK:
>>>
>>>     A single memory hotplug even across mutliple memblocks (from
>>>     which tdx_memlist entries are derived) is impossible.  ... then
>>>     actually explain
>>>
> 
> How about below?
> 
>         /*
>          * This check assumes that the start_pfn<->end_pfn range does not cross
>          * multiple tdx_memlist entries. A single memory hotplug event across
>          * multiple memblocks (from which tdx_memlist entries are derived) is
>          * impossible. That means start_pfn<->end_pfn range cannot exceed a
>          * tdx_memlist entry, and the new range is TDX memory if it is fully
>          * covered by any tdx_memlist entry.
>          */

I was hoping you would actually explain why it is impossible.

Is there something fundamental that keeps a memory area that spans two
nodes from being removed and then a new area added that is comprised of
a single node?

Boot time:

	| memblock  |  memblock |
	<--Node=0--> <--Node=1-->

Funky hotplug... nothing to see here, then:

	<--------Node=2-------->

I would believe that there is no current bare-metal TDX system that has
an implementation like this.  But, the comments above speak like it's
fundamentally impossible.  That should be clarified.

In other words, that comment talks about memblock attributes as being
the core underlying reason that that simplified check is OK.  Is that
it, or is it really the reduced hotplug feature set on TDX systems?


...
>>>>> +        * Build the list of "TDX-usable" memory regions which cover all
>>>>> +        * pages in the page allocator to guarantee that.  Do it while
>>>>> +        * holding mem_hotplug_lock read-lock as the memory hotplug code
>>>>> +        * path reads the @tdx_memlist to reject any new memory.
>>>>> +        */
>>>>> +       get_online_mems();
>>>
>>> Oh, it actually uses the memory hotplug locking for list protection.
>>> That's at least a bit subtle.  Please document that somewhere in the
>>> functions that actually manipulate the list.
> 
> add_tdx_memblock() and free_tdx_memlist() eventually calls list_add_tail() and
> list_del() to manipulate the list, but they actually takes 'struct list_head
> *tmb_list' as argument. 'tdx_memlist' is passed to build_tdx_memlist() as input.
> 
> Do you mean document the locking around the implementation of add_tdx_memblock()
> and free_tdx_memlist()?
> 
> Or should we just mention it around the 'tdx_memlist' variable?
> 
> /* All TDX-usable memory regions. Protected by memory hotplug locking. */
> static LIST_HEAD(tdx_memlist);

I don't think I'd hate it being in all three spots.  Also "protected by
memory hotplug locking" is pretty generic.  Please be more specific.

>>> I think it's also worth saying something here about the high-level
>>> effects of what's going on:
>>>
>>>     Take a snapshot of the memory configuration (memblocks).  This
>>>     snapshot will be used to enable TDX support for *this* memory
>>>     configuration only.  Use a memory hotplug notifier to ensure
>>>     that no other RAM can be added outside of this configuration.
>>>
>>> That's it, right?
> 
> Yes. I'll somehow include above into the comment around get_online_mems().
> 
> But should I move "Use a memory hotplug notifier ..." part to:
> 
>         err = register_memory_notifier(&tdx_memory_nb);
> 
> because this is where we actually use the memory hotplug notifier?

I actually want that snippet in the changelog.

>>>>> +       ret = build_tdx_memlist(&tdx_memlist);
>>>>> +       if (ret)
>>>>> +               goto out;
>>>>> +
>>>>>         /*
>>>>>          * TODO:
>>>>>          *
>>>>> -        *  - Build the list of TDX-usable memory regions.
>>>>>          *  - Construct a list of TDMRs to cover all TDX-usable memory
>>>>>          *    regions.
>>>>>          *  - Pick up one TDX private KeyID as the global KeyID.
>>>>> @@ -241,6 +394,11 @@ static int init_tdx_module(void)
>>>>>          */
>>>>>         ret = -EINVAL;
>>>>>  out:
>>>>> +       /*
>>>>> +        * @tdx_memlist is written here and read at memory hotplug time.
>>>>> +        * Lock out memory hotplug code while building it.
>>>>> +        */
>>>>> +       put_online_mems();
>>>>>         return ret;
>>>>>  }
>>>
>>> You would also be wise to have the folks who do a lot of memory hotplug
>>> work today look at this sooner rather than later.  I _think_ what you
>>> have here is OK, but I'm really rusty on the code itself.
>>>
> 
> Thanks for advice. Will do.
> 


  reply	other threads:[~2023-01-09 16:52 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
2022-12-09  6:52 ` [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
2023-01-06 19:04   ` Dave Hansen
2022-12-09  6:52 ` [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2023-01-06 17:09   ` Dave Hansen
2023-01-08 22:25     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
2023-01-06 19:04   ` Dave Hansen
2022-12-09  6:52 ` [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
2023-01-06 17:14   ` Dave Hansen
2023-01-08 22:26     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL Kai Huang
2023-01-06 17:29   ` Dave Hansen
2023-01-09 10:30     ` Huang, Kai
2023-01-09 19:54       ` Dave Hansen
2023-01-09 22:10         ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2023-01-06 17:46   ` Dave Hansen
2023-01-09 10:25     ` Huang, Kai
2023-01-09 19:52       ` Dave Hansen
2023-01-09 22:07         ` Huang, Kai
2023-01-09 22:11           ` Dave Hansen
2022-12-09  6:52 ` [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
2023-01-06 18:18   ` Dave Hansen
2023-01-09 11:48     ` Huang, Kai
2023-01-09 16:51       ` Dave Hansen [this message]
2023-01-10 12:09         ` Huang, Kai
2023-01-10 16:18           ` Dave Hansen
2023-01-11 10:00             ` Huang, Kai
2023-01-12  0:56               ` Huang, Ying
2023-01-12  1:18                 ` Huang, Kai
2023-01-12  1:59                   ` Huang, Ying
2023-01-12  2:22                     ` Huang, Kai
2023-01-12 11:33         ` Huang, Kai
2023-01-18 11:08   ` Huang, Kai
2023-01-18 13:57     ` David Hildenbrand
2023-01-18 19:38       ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
2023-01-06 19:24   ` Dave Hansen
2023-01-10  0:40     ` Huang, Kai
2023-01-10  0:47       ` Dave Hansen
2023-01-10  2:23         ` Huang, Kai
2023-01-10 19:12           ` Dave Hansen
2023-01-11  9:23             ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 09/16] x86/virt/tdx: Fill out " Kai Huang
2023-01-06 19:36   ` Dave Hansen
2023-01-10  0:45     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2023-01-06 21:53   ` Dave Hansen
2023-01-10  0:49     ` Huang, Kai
2023-01-07  0:47   ` Dave Hansen
2023-01-10  0:47     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
2023-01-06 22:07   ` Dave Hansen
2023-01-10  1:19     ` Huang, Kai
2023-01-10  1:22       ` Dave Hansen
2023-01-10 11:01         ` Huang, Kai
2023-01-10 15:19           ` Dave Hansen
2023-01-11 10:57             ` Huang, Kai
2023-01-11 16:16               ` Dave Hansen
2023-01-11 22:10                 ` Huang, Kai
2023-01-10 11:01       ` Huang, Kai
2023-01-10 15:17         ` Dave Hansen
2022-12-09  6:52 ` [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module Kai Huang
2023-01-06 22:21   ` Dave Hansen
2023-01-10 10:48     ` Huang, Kai
2023-01-10 16:25       ` Dave Hansen
2023-01-10 23:33         ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2023-01-06 22:49   ` Dave Hansen
2023-01-10 10:15     ` Huang, Kai
2023-01-10 16:53       ` Dave Hansen
2023-01-11  0:06         ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs Kai Huang
2023-01-07  0:17   ` Dave Hansen
2023-01-10 10:23     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
2023-01-07  0:35   ` Dave Hansen
2023-01-10 11:29     ` Huang, Kai
2023-01-10 15:27       ` Dave Hansen
2023-01-11  0:13         ` Huang, Kai
2023-01-11  0:30           ` Dave Hansen
2023-01-11  1:58             ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 16/16] Documentation/x86: Add documentation for TDX host support Kai Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e02fd75d-e9d4-15f1-eb9c-31cf3cc9ddc1@intel.com \
    --to=dave.hansen@intel.com \
    --cc=ak@linux.intel.com \
    --cc=bagasdotme@gmail.com \
    --cc=chao.gao@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=imammedo@redhat.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=reinette.chatre@intel.com \
    --cc=sagis@google.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).