Re: [PATCH] vfio/type1: Restore mapping performance with mdev support

From: Kirti Wankhede <kwankhede@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: <cjia@nvidia.com>, <linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>
Subject: Re: [PATCH] vfio/type1: Restore mapping performance with mdev support
Date: Thu, 15 Dec 2016 23:27:54 +0530	[thread overview]
Message-ID: <02707161-145f-25f3-ab47-c63d1de81e02@nvidia.com> (raw)
In-Reply-To: <20161215010347.3942360a@t450s.home>

On 12/15/2016 1:33 PM, Alex Williamson wrote:
> On Thu, 15 Dec 2016 12:05:35 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 12/14/2016 2:28 AM, Alex Williamson wrote:
>>> As part of the mdev support, type1 now gets a task reference per
>>> vfio_dma and uses that to get an mm reference for the task while
>>> working on accounting.  That's the correct thing to do for paths
>>> where we can't rely on using current, but there are still hot paths
>>> where we can optimize because we know we're invoked by the user.
>>>
>>> Specifically, vfio_pin_pages_remote() is only called when the user
>>> does DMA mapping (vfio_dma_do_map) or if an IOMMU group is added to
>>> a container with existing mappings (vfio_iommu_replay).  We can
>>> therefore use current->mm as well as rlimit() and capable() directly
>>> rather than going through the high overhead path via the stored
>>> task_struct.  We also know that vfio_dma_do_unmap() is only called
>>> via user ioctl, so we can also tune that path to be more lightweight.
>>>
>>> In a synthetic guest mapping test emulating a 1TB VM backed by a
>>> single 4GB range remapped multiple times across the address space,
>>> the mdev changes to the type1 backend introduced a roughly 25% hit
>>> in runtime of this test.  These changes restore it to nearly the
>>> previous performance for the interfaces exercised here,
>>> VFIO_IOMMU_MAP_DMA and release on close.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c |  145 +++++++++++++++++++++------------------
>>>  1 file changed, 79 insertions(+), 66 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 9815e45..8dfeafb 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -103,6 +103,10 @@ struct vfio_pfn {
>>>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>>>  					(!list_empty(&iommu->domain_list))
>>>  
>>> +/* Make function bool options readable */
>>> +#define IS_CURRENT	(true)
>>> +#define DO_ACCOUNTING	(true)
>>> +
>>>  static int put_pfn(unsigned long pfn, int prot);
>>>  
>>>  /*
>>> @@ -264,7 +268,8 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>>>  	kfree(vwork);
>>>  }
>>>  
>>> -static void vfio_lock_acct(struct task_struct *task, long npage)
>>> +static void vfio_lock_acct(struct task_struct *task,
>>> +			   long npage, bool is_current)
>>>  {
>>>  	struct vwork *vwork;
>>>  	struct mm_struct *mm;
>>> @@ -272,24 +277,31 @@ static void vfio_lock_acct(struct task_struct *task, long npage)
>>>  	if (!npage)
>>>  		return;
>>>  
>>> -	mm = get_task_mm(task);
>>> +	mm = is_current ? task->mm : get_task_mm(task);
>>>  	if (!mm)
>>> -		return; /* process exited or nothing to do */
>>> +		return; /* process exited */
>>>  
>>>  	if (down_write_trylock(&mm->mmap_sem)) {
>>>  		mm->locked_vm += npage;
>>>  		up_write(&mm->mmap_sem);
>>> -		mmput(mm);
>>> +		if (!is_current)
>>> +			mmput(mm);
>>>  		return;
>>>  	}
>>>  
>>> +	if (is_current) {
>>> +		mm = get_task_mm(task);
>>> +		if (!mm)
>>> +			return;
>>> +	}
>>> +
>>>  	/*
>>>  	 * Couldn't get mmap_sem lock, so must setup to update
>>>  	 * mm->locked_vm later. If locked_vm were atomic, we
>>>  	 * wouldn't need this silliness
>>>  	 */
>>>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>>> -	if (!vwork) {
>>> +	if (WARN_ON(!vwork)) {
>>>  		mmput(mm);
>>>  		return;
>>>  	}
>>> @@ -345,13 +357,13 @@ static int put_pfn(unsigned long pfn, int prot)
>>>  }
>>>  
>>>  static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>> -			 int prot, unsigned long *pfn)
>>> +			 int prot, unsigned long *pfn, bool is_current)
>>>  {
>>>  	struct page *page[1];
>>>  	struct vm_area_struct *vma;
>>>  	int ret;
>>>  
>>> -	if (mm == current->mm) {
>>> +	if (is_current) {  
>>
>> With this change, if vfio_pin_page_external() gets called from QEMU
>> process context, for example in response to some BAR0 register access,
>> it will still fallback to slow path, get_user_pages_remote(). We don't
>> have to change this function. This path already takes care of taking
>> best possible path.
>>
>> That also makes me think, vfio_pin_page_external() uses task structure
>> to get mlock limit and capability. Expectation is mdev vendor driver
>> shouldn't pin all system memory, but if any mdev driver does that, then
>> that driver might see such performance impact. Should we optimize this
>> path if (dma->task == current)?
> 
> Hi Kirti,
> 
> I was actually trying to avoid the (task == current) test with this
> change because I wasn't sure how reliable it is.  Is there a
> possibility that this test generates a false positive if current
> coincidentally matches our task and does that allow us the same
> opportunities for making use of current that we have when we know in a
> process context execution path?  The above change makes this a more
> direct association.  Can you show that inferring the process context is
> correct?  Thanks,

We do hold the usage count of task structure, get_task_struct(current),
before saving its reference in dma->task which is released,
put_task_struct(), from vfio_remove_dma(). That makes sure that we have
a valid reference to task structure till we remove/free that dma
structure. Why would the check (dma->task == current) be false positive?
Vendor driver can call vfio_pin_pages() on access to some emulated
register from the same task who have mapped dma range, in that case this
check would be true.

Thanks,
Kirti