RE: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for netvsc driver

From: Haiyang Zhang <haiyangz@microsoft.com>
To: Michael Kelley <mikelley@microsoft.com>,
	Tianyu Lan <ltykernel@gmail.com>,
	KY Srinivasan <kys@microsoft.com>,
	Stephen Hemminger <sthemmin@microsoft.com>,
	"wei.liu@kernel.org" <wei.liu@kernel.org>,
	Dexuan Cui <decui@microsoft.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"bp@alien8.de" <bp@alien8.de>, "x86@kernel.org" <x86@kernel.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"luto@kernel.org" <luto@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
	"boris.ostrovsky@oracle.com" <boris.ostrovsky@oracle.com>,
	"jgross@suse.com" <jgross@suse.com>,
	"sstabellini@kernel.org" <sstabellini@kernel.org>,
	"joro@8bytes.org" <joro@8bytes.org>,
	"will@kernel.org" <will@kernel.org>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"kuba@kernel.org" <kuba@kernel.org>,
	"jejb@linux.ibm.com" <jejb@linux.ibm.com>,
	"martin.petersen@oracle.com" <martin.petersen@oracle.com>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"arnd@arndb.de" <arnd@arndb.de>, "hch@lst.de" <hch@lst.de>,
	"m.szyprowski@samsung.com" <m.szyprowski@samsung.com>,
	"robin.murphy@arm.com" <robin.murphy@arm.com>,
	"brijesh.singh@amd.com" <brijesh.singh@amd.com>,
	Tianyu Lan <Tianyu.Lan@microsoft.com>,
	"thomas.lendacky@amd.com" <thomas.lendacky@amd.com>,
	"pgonda@google.com" <pgonda@google.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>,
	"rppt@kernel.org" <rppt@kernel.org>,
	"sfr@canb.auug.org.au" <sfr@canb.auug.org.au>,
	"aneesh.kumar@linux.ibm.com" <aneesh.kumar@linux.ibm.com>,
	"saravanand@fb.com" <saravanand@fb.com>,
	"krish.sadhukhan@oracle.com" <krish.sadhukhan@oracle.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	"tj@kernel.org" <tj@kernel.org>,
	"rientjes@google.com" <rientjes@google.com>
Cc: "iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	vkuznets <vkuznets@redhat.com>,
	"parri.andrea@gmail.com" <parri.andrea@gmail.com>,
	"dave.hansen@intel.com" <dave.hansen@intel.com>
Subject: RE: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for netvsc driver
Date: Wed, 15 Sep 2021 16:46:57 +0000	[thread overview]
Message-ID: <MN2PR21MB12959F10240EC1BB2270B345CADB9@MN2PR21MB1295.namprd21.prod.outlook.com> (raw)
In-Reply-To: <MWHPR21MB15939A5D74CA1DF25EE816ADD7DB9@MWHPR21MB1593.namprd21.prod.outlook.com>

> -----Original Message-----
> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Wednesday, September 15, 2021 12:22 PM
> To: Tianyu Lan <ltykernel@gmail.com>; KY Srinivasan <kys@microsoft.com>;

> > +				memset(vmap_pages, 0,
> > +				       sizeof(*vmap_pages) * vmap_page_index);
> > +				vmap_page_index = 0;
> > +
> > +				for (j = 0; j < i; j++)
> > +					__free_pages(pages[j], alloc_unit);
> > +
> > +				kfree(pages);
> > +				alloc_unit = 1;
> 
> This is the case where a large enough contiguous physical memory chunk
> could not be found.  But rather than dropping all the way down to single
> pages, would it make sense to try something smaller, but not 1?  For
> example, cut the alloc_unit in half and try again.  But I'm not sure of
> all the implications.

I had the same question. But probably gradually decrementing uses too much
time?

> 
> > +				goto retry;
> > +			}
> > +		}
> > +
> > +		pages[i] = page;
> > +		for (j = 0; j < alloc_unit; j++)
> > +			vmap_pages[vmap_page_index++] = page++;
> > +	}
> > +
> > +	vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
> > +	kfree(vmap_pages);
> > +
> > +	*pages_array = pages;
> > +	return vaddr;
> > +
> > +cleanup:
> > +	for (j = 0; j < i; j++)
> > +		__free_pages(pages[i], alloc_unit);
> > +
> > +	kfree(pages);
> > +	kfree(vmap_pages);
> > +	return NULL;
> > +}
> > +
> > +static void *netvsc_map_pages(struct page **pages, int count, int
> > +alloc_unit) {
> > +	int pg_count = count * alloc_unit;
> > +	struct page *page;
> > +	unsigned long *pfns;
> > +	int pfn_index = 0;
> > +	void *vaddr;
> > +	int i, j;
> > +
> > +	if (!pages)
> > +		return NULL;
> > +
> > +	pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return NULL;
> > +
> > +	for (i = 0; i < count; i++) {
> > +		page = pages[i];
> > +		if (!page) {
> > +			pr_warn("page is not available %d.\n", i);
> > +			return NULL;
> > +		}
> > +
> > +		for (j = 0; j < alloc_unit; j++) {
> > +			pfns[pfn_index++] = page_to_pfn(page++) +
> > +				(ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
> > +		}
> > +	}
> > +
> > +	vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
> > +	kfree(pfns);
> > +	return vaddr;
> > +}
> > +
> 
> I think you are proposing this approach to allocating memory for the
> send and receive buffers so that you can avoid having two virtual
> mappings for the memory, per comments from Christop Hellwig.  But
> overall, the approach seems a bit complex and I wonder if it is worth it.
> If allocating large contiguous chunks of physical memory is successful,
> then there is some memory savings in that the data structures needed to
> keep track of the physical pages is smaller than the equivalent page
> tables might be.  But if you have to revert to allocating individual
> pages, then the memory savings is reduced.
> 
> Ultimately, the list of actual PFNs has to be kept somewhere.  Another
> approach would be to do the reverse of what hv_map_memory() from the v4
> patch series does.  I.e., you could do virt_to_phys() on each virtual
> address that maps above VTOM, and subtract out the shared_gpa_boundary
> to get the
> list of actual PFNs that need to be freed.   This way you don't have two
> copies
> of the list of PFNs -- one with and one without the shared_gpa_boundary
> added.
> But it comes at the cost of additional code so that may not be a great
> idea.
> 
> I think what you have here works, and I don't have a clearly better
> solution at the moment except perhaps to revert to the v4 solution and
> just have two virtual mappings.  I'll keep thinking about it.  Maybe
> Christop has other thoughts.
> 
> >  static int netvsc_init_buf(struct hv_device *device,
> >  			   struct netvsc_device *net_device,
> >  			   const struct netvsc_device_info *device_info) @@ -
> 337,7 +462,7
> > @@ static int netvsc_init_buf(struct hv_device *device,
> >  	struct nvsp_1_message_send_receive_buffer_complete *resp;
> >  	struct net_device *ndev = hv_get_drvdata(device);
> >  	struct nvsp_message *init_packet;
> > -	unsigned int buf_size;
> > +	unsigned int buf_size, alloc_unit;
> >  	size_t map_words;
> >  	int i, ret = 0;
> >
> > @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device
> *device,
> >  		buf_size = min_t(unsigned int, buf_size,
> >  				 NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
> >
> > -	net_device->recv_buf = vzalloc(buf_size);
> > +	if (hv_isolation_type_snp())
> > +		net_device->recv_buf =
> > +			netvsc_alloc_pages(&net_device->recv_pages,
> > +					   &net_device->recv_page_count,
> > +					   buf_size);
> > +	else
> > +		net_device->recv_buf = vzalloc(buf_size);
> > +
> 
> I wonder if it is necessary to have two different code paths here.  The
> allocating and freeing of the send and receive buffers is not perf
> sensitive, and it seems like netvsc_alloc_pages() could be used
> regardless of whether SNP Isolation is in effect.  To my thinking, one
> code path is better than two code paths unless there's a compelling
> reason to have two.

I still prefer keeping the simple vzalloc for the non isolated VMs, because
simple code path usually means more robust. 
I don't know how much time difference between the two, but in some cases 
we really care about boot time? 
Also in the multi vPort case for MANA, we potentially support hundreds of 
vPorts, and there will be the same number of synthetic NICs associated with 
them. So even small time difference in the initialization time may add up.

Thanks,
- Haiyang