RE: [RFC PATCH 08/18] virt/mshv: map and unmap guest memory

From: Michael Kelley <mikelley@microsoft.com>
To: Nuno Das Neves <nunodasneves@linux.microsoft.com>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: "virtualization@lists.linux-foundation.org" 
	<virtualization@lists.linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"viremana@linux.microsoft.com" <viremana@linux.microsoft.com>,
	Sunil Muthuswamy <sunilmut@microsoft.com>,
	"wei.liu@kernel.org" <wei.liu@kernel.org>,
	Lillian Grassin-Drake <Lillian.GrassinDrake@microsoft.com>,
	KY Srinivasan <kys@microsoft.com>
Subject: RE: [RFC PATCH 08/18] virt/mshv: map and unmap guest memory
Date: Mon, 8 Mar 2021 19:30:00 +0000	[thread overview]
Message-ID: <MWHPR21MB15934FDC8DBE4088E8227AAFD7939@MWHPR21MB1593.namprd21.prod.outlook.com> (raw)
In-Reply-To: <d63330fa-de83-85de-c8ec-74cc90d680e3@linux.microsoft.com>

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Monday, March 8, 2021 11:14 AM
> 
> On 2/8/2021 11:45 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November
> 20, 2020 4:30 PM
> >>

[snip]

> >> @@ -245,16 +249,318 @@ hv_call_delete_partition(u64 partition_id)
> >>  	return -hv_status_to_errno(status);
> >>  }
> >>
> >> +static int
> >> +hv_call_map_gpa_pages(u64 partition_id,
> >> +		      u64 gpa_target,
> >> +		      u64 page_count, u32 flags,
> >> +		      struct page **pages)
> >> +{
> >> +	struct hv_map_gpa_pages *input_page;
> >> +	int status;
> >> +	int i;
> >> +	struct page **p;
> >> +	u32 completed = 0;
> >> +	u64 hypercall_status;
> >> +	unsigned long remaining = page_count;
> >> +	int rep_count;
> >> +	unsigned long irq_flags;
> >> +	int ret = 0;
> >> +
> >> +	while (remaining) {
> >> +
> >> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> >> +
> >> +		local_irq_save(irq_flags);
> >> +		input_page = (struct hv_map_gpa_pages *)(*this_cpu_ptr(
> >> +			hyperv_pcpu_input_arg));
> >> +
> >> +		input_page->target_partition_id = partition_id;
> >> +		input_page->target_gpa_base = gpa_target;
> >> +		input_page->map_flags = flags;
> >> +
> >> +		for (i = 0, p = pages; i < rep_count; i++, p++)
> >> +			input_page->source_gpa_page_list[i] =
> >> +				page_to_pfn(*p) & HV_MAP_GPA_MASK;
> >
> > The masking seems a bit weird.  The mask allows for up to 64G page frames,
> > which is 256 Tbytes of total physical memory, which is probably the current
> > Hyper-V limit on memory size (48 bit physical address space, though 52 bit
> > physical address spaces are coming).  So the masking shouldn't ever be doing
> > anything.   And if it was doing something, that probably should be treated as
> > an error rather than simply dropping the high bits.
> 
> Good point - It looks like the mask isn't needed.
> 
> >
> > Note that this code does not handle the case where PAGE_SIZE !=
> > HV_HYP_PAGE_SIZE.  But maybe we'll never run the root partition with a
> > page size other than 4K.
> >
> 
> For now on x86 it won't happen, but maybe on ARM?
> It shouldn't be hard to support this case, especially since
> PAGE_SIZE >= HV_HYP_PAGE_SIZE. Do you think we need it in this patch set?

No, from my perspective, this case does not need to be handled in 
this patch set.

> 
> >> +		hypercall_status = hv_do_rep_hypercall(
> >> +			HVCALL_MAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> >> +		local_irq_restore(irq_flags);
> >> +
> >> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> >> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> >> +				HV_HYPERCALL_REP_COMP_OFFSET;
> >> +
> >> +		if (status == HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> +						    partition_id, 256);
> >
> > Why adding 256 pages?  I'm just contrasting with other places that add
> > 1 page at a time.  Maybe a comment to explain ....
> >
> 
> Empirically determined. I'll add a #define and comment.
> 
> >> +			if (ret)
> >> +				break;
> >> +		} else if (status != HV_STATUS_SUCCESS) {
> >> +			pr_err("%s: completed %llu out of %llu, %s\n",
> >> +			       __func__,
> >> +			       page_count - remaining, page_count,
> >> +			       hv_status_to_string(status));
> >> +			ret = -hv_status_to_errno(status);
> >> +			break;
> >> +		}
> >> +
> >> +		pages += completed;
> >> +		remaining -= completed;
> >> +		gpa_target += completed;
> >> +	}
> >> +
> >> +	if (ret && completed) {
> >
> > Is the above the right test?  Completed could be zero from the most
> > recent iteration, but still could be partially succeeded based on a previous
> > successful iteration.   I think this needs to check whether remaining equals
> > page_count.
> >
> 
> You're right; I'll change it to (ret && remaining < page_count)
> 
> >> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> >> +		       __func__);
> >> +		ret = -EBADFD;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +hv_call_unmap_gpa_pages(u64 partition_id,
> >> +			u64 gpa_target,
> >> +			u64 page_count, u32 flags)
> >> +{
> >> +	struct hv_unmap_gpa_pages *input_page;
> >> +	int status;
> >> +	int ret = 0;
> >> +	u32 completed = 0;
> >> +	u64 hypercall_status;
> >> +	unsigned long remaining = page_count;
> >> +	int rep_count;
> >> +	unsigned long irq_flags;
> >> +
> >> +	local_irq_save(irq_flags);
> >> +	input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
> >> +		hyperv_pcpu_input_arg));
> >> +
> >> +	input_page->target_partition_id = partition_id;
> >> +	input_page->target_gpa_base = gpa_target;
> >> +	input_page->unmap_flags = flags;
> >> +
> >> +	while (remaining) {
> >> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> >> +		hypercall_status = hv_do_rep_hypercall(
> >> +			HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> >
> > Similarly, this code doesn't handle PAGE_SIZE != HV_HYP_PAGE_SIZE.
> >
> 
> As above - do we need this for this patch set? This won't happen on x86.

Again, not needed from my perspective.

> 
> >> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> >> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> >> +				HV_HYPERCALL_REP_COMP_OFFSET;
> >> +		if (status != HV_STATUS_SUCCESS) {
> >> +			pr_err("%s: completed %llu out of %llu, %s\n",
> >> +			       __func__,
> >> +			       page_count - remaining, page_count,
> >> +			       hv_status_to_string(status));
> >> +			ret = -hv_status_to_errno(status);
> >> +			break;
> >> +		}
> >> +
> >> +		remaining -= completed;
> >> +		gpa_target += completed;
> >> +		input_page->target_gpa_base = gpa_target;
> >> +	}
> >> +	local_irq_restore(irq_flags);
> >
> > I have some concern about holding interrupts disabled for this long.
> >
> 
> How about I move the interrupt enabling/disabling inside the loop? i.e.:
>         while (remaining) {
>                 local_irq_save(irq_flags);
>                 input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
>                         hyperv_pcpu_input_arg));
> 
>                 input_page->target_partition_id = partition_id;
>                 input_page->target_gpa_base = gpa_target;
>                 input_page->unmap_flags = flags;
>                 rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
>                 status = hv_do_rep_hypercall(
>                         HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
>                 local_irq_restore(irq_flags);
> 
>                 completed = (status & HV_HYPERCALL_REP_COMP_MASK) >>
>                                 HV_HYPERCALL_REP_COMP_OFFSET;
>                 status &= HV_HYPERCALL_RESULT_MASK;
>                 if (status != HV_STATUS_SUCCESS) {
>                         pr_err("%s: completed %llu out of %llu, %s\n",
>                                __func__,
>                                page_count - remaining, page_count,
>                                hv_status_to_string(status));
>                         ret = hv_status_to_errno(status);
>                         break;
>                 }
> 
>                 remaining -= completed;
>                 gpa_target += completed;
>         }
> 
> 

Yes, that would help.

> >> +
> >> +	if (ret && completed) {
> >
> > Same comment as before.
> >
> 
> Ditto as above.
> 
> >> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> >> +		       __func__);
> >> +		ret = -EBADFD;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static long
> >> +mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
> >> +				struct mshv_user_mem_region __user *user_mem)
> >> +{
> >> +	struct mshv_user_mem_region mem;
> >> +	struct mshv_mem_region *region;
> >> +	int completed;
> >> +	unsigned long remaining, batch_size;
> >> +	int i;
> >> +	struct page **pages;
> >> +	u64 page_count, user_start, user_end, gpfn_start, gpfn_end;
> >> +	u64 region_page_count, region_user_start, region_user_end;
> >> +	u64 region_gpfn_start, region_gpfn_end;
> >> +	long ret = 0;
> >> +
> >> +	/* Check we have enough slots*/
> >> +	if (partition->regions.count == MSHV_MAX_MEM_REGIONS) {
> >> +		pr_err("%s: not enough memory region slots\n", __func__);
> >> +		return -ENOSPC;
> >> +	}
> >> +
> >> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
> >> +		return -EFAULT;
> >> +
> >> +	if (!mem.size ||
> >> +	    mem.size & (PAGE_SIZE - 1) ||
> >> +	    mem.userspace_addr & (PAGE_SIZE - 1) ||
> >
> > There's a PAGE_ALIGNED macro that expresses exactly what
> > each of the previous two tests is doing.
> >
> 
> Since these need to be HV_HYP_PAGE_SIZE aligned, I will add a
> HV_HYP_PAGE_ALIGNED macro for this.

I was thinking that PAGE_SIZE and PAGE_ALIGNED are correct.   If
this code were running on an ARM64 system with a 64K page
size, the 64K alignment would be fine and will make sense from
the user space perspective.   You don't want to be mapping part
of a user space page.  And 64K alignment will certainly satisfy
Hyper-V's requirement for 4K alignment.  The real requirement
from Hyper-V's standpoint is that the alignment not be smaller
than 4K.  But maybe I'm misunderstanding.

Michael