Re: [PATCH 3/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Dexuan Cui <decui@microsoft.com>
Cc: "ak@linux.intel.com" <ak@linux.intel.com>,
	"arnd@arndb.de" <arnd@arndb.de>, "bp@alien8.de" <bp@alien8.de>,
	"brijesh.singh@amd.com" <brijesh.singh@amd.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"jane.chu@oracle.com" <jane.chu@oracle.com>,
	"kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>,
	KY Srinivasan <kys@microsoft.com>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"luto@kernel.org" <luto@kernel.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"sathyanarayanan.kuppuswamy@linux.intel.com" 
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	"seanjc@google.com" <seanjc@google.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"wei.liu@kernel.org" <wei.liu@kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 3/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()
Date: Thu, 24 Nov 2022 10:51:25 +0300	[thread overview]
Message-ID: <20221124075125.56cpbkmjyr26dzsn@box.shutemov.name> (raw)
In-Reply-To: <SA1PR21MB133536EA0C26DFE0168E2F98BF0C9@SA1PR21MB1335.namprd21.prod.outlook.com>

On Wed, Nov 23, 2022 at 11:51:11PM +0000, Dexuan Cui wrote:
> > From: Kirill A. Shutemov <kirill@shutemov.name>
> > Sent: Monday, November 21, 2022 4:24 PM
> > 
> > On Mon, Nov 21, 2022 at 11:51:48AM -0800, Dexuan Cui wrote:
> > > When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> > > allocates buffers using vzalloc(), and needs to share the buffers with the
> > > host OS by calling set_memory_decrypted(), which is not working for
> > > vmalloc() yet. Add the support by handling the pages one by one.
> > 
> > Why do you use vmalloc here in the first place?
> 
> We changed to vmalloc() long ago, mainly for 2 reasons:
> 
> 1) __alloc_pages() only allows us to allocate up to 4MB of contiguous pages, but
> we need a 16MB buffer in the Hyper-V vNIC driver for better performance.
> 
> 2) Due to memory fragmentation, we have seen that the page allocator can fail
> to allocate 2 contigous pages, though the system has a lot of free memory. We
> need to support Hyper-V vNIC hot addition, so we changed to vmalloc. See
> 
> b679ef73edc2 ("hyperv: Add support for physically discontinuous receive buffer")
> 06b47aac4924 ("Drivers: net-next: hyperv: Increase the size of the sendbuf region")
> 
> > Will you also adjust direct mapping to have shared bit set?
> > 
> > If not, we will have problems with load_unaligned_zeropad() when it will
> > access shared pages via non-shared mapping.
> > 
> > If direct mapping is adjusted, we will get direct mapping fragmentation.
> 
> load_unaligned_zeropad() was added 10 years ago by Linus in
> e419b4cc5856 ("vfs: make word-at-a-time accesses handle a non-existing page") 
> so this seemingly-strange usage is legitimate.
> 
> Sorry I don't know how to adjust direct mapping. Do you mean I should do
> something like the below in tdx_enc_status_changed_for_vmalloc() for
> every 'start_va':
>   pa = slow_virt_to_phys(start_va);
>   set_memory_decrypted(phys_to_virt(pa), 1);
> ?
> 
> But IIRC this issue is not specific to vmalloc()? e.g. we get 1 page by
> __get_free_pages(GFP_KERNEL, 0) or kmalloc(PAGE_SIZE, GFP_KERNEL)
> and we call set_memory_decrypted() for the page. How can we make
> sure the callers of load_unaligned_zeropad() can't access the page
> via non-shared mapping?

__get_free_pages() and kmalloc() returns pointer to the page in the direct
mapping. set_memory_decrypted() adjust direct mapping to have the shared
bit set. Everything is fine.

> It looks like you have a patchset to address the issue (it looks like it
> hasn't been merged into the mainline?) ?
> https://lwn.net/ml/linux-kernel/20220614120231.48165-11-kirill.shutemov@linux.intel.com/

It addresses similar, but different issue. It is only relevant for
unaccepted memory support.

> BTW, I'll drop tdx_enc_status_changed_for_vmalloc() and try to enhance the
> existing tdx_enc_status() to support both direct mapping and vmalloc().
> 
> > Maybe tap into swiotlb buffer using DMA API?
> 
> I doubt the Hyper-V vNIC driver here can call dma_alloc_coherent() to
> get a 16MB buffer from swiotlb buffers. I'm looking at dma_alloc_coherent() ->
> dma_alloc_attrs() -> dma_direct_alloc(), which typically calls 
> __dma_direct_alloc_pages() to allocate congituous memory pages (which
> can't exceed the 4MB limit. Note there is no virtual IOMMU in a guest on Hyper-V).
> 
> It looks like we can't force dma_direct_alloc() to call dma_direct_alloc_no_mapping(),
> because even if we set the DMA_ATTR_NO_KERNEL_MAPPING flag,
> force_dma_unencrypted() is still always true for a TDX guest.

The point is not in reaching dma_direct_alloc_no_mapping(). The idea is
allocate from existing swiotlb that already has shared bit set in direct
mapping.

vmap area that maps pages allocated from swiotlb also should work fine.

To be honest, I don't understand DMA API well enough. I need to experiment
with it to see what works for the case.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov