Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

From: Jason Wang <jasowang@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
Date: Wed, 26 Dec 2018 11:57:32 +0800	[thread overview]
Message-ID: <70978ed8-bf76-693a-0e11-d31b6234af5c@redhat.com> (raw)
In-Reply-To: <20181225071501-mutt-send-email-mst@kernel.org>

On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>
>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>> access the metadata directly.
>>>>>>>>
>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>
>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>
>>>>>>>> Before: ~5.0Mpps
>>>>>>>> After:  ~6.1Mpps
>>>>>>>>
>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>>> ---
>>>>>>>>      drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      drivers/vhost/vhost.h |  11 +++
>>>>>>>>      2 files changed, 189 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>      		vq->indirect = NULL;
>>>>>>>>      		vq->heads = NULL;
>>>>>>>>      		vq->dev = dev;
>>>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>      		mutex_init(&vq->mutex);
>>>>>>>>      		vhost_vq_reset(dev, vq);
>>>>>>>>      		if (vq->handle_kick)
>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>      	spin_unlock(&dev->iotlb_lock);
>>>>>>>>      }
>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>> +			   size_t size, int write)
>>>>>>>> +{
>>>>>>>> +	struct page **pages;
>>>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>> +	int npinned;
>>>>>>>> +	void *vaddr;
>>>>>>>> +
>>>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>> +	if (!pages)
>>>>>>>> +		return -ENOMEM;
>>>>>>>> +
>>>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>> +	if (npinned != npages)
>>>>>>>> +		goto err;
>>>>>>>> +
>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>> implementation in particular isn't a good idea
>>>>>>> as it keeps the page around forever.
>>>> The pages wil be released during set features.
>>>>
>>>>
>>>>>>> So no THP, no NUMA rebalancing,
>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>> consider the gain we have?
>>> We as in vhost? networking isn't the only thing guest does.
>>> We don't even know if this guest does a lot of networking.
>>> You don't
>>> know what else is in this huge page. Can be something very important
>>> that guest touches all the time.
>>
>> Well, the probability should be very small consider we usually give several
>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>> hugepage with metadata can still be merged by THP.  Anyway, I can test the
>> differences.
> Thanks!
>
>>>> For NUMA rebalancing, I'm even not quite sure if
>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>> in two nodes.
>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>> other functionality of the guest.
>>>
>>> That's what bothers me with this approach.
>>
>> So:
>>
>> - The rest of the pages could still be balanced to other nodes, no?
>>
>> - try to balance metadata pages (belongs to co-operate processes) itself is
>> still questionable
> I am not sure why. It should be easy enough to force the VCPU and vhost
> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
> Clearly sometimes this would be necessary for load balancing reasons.

Yes, but it looks to me the part of motivation of auto NUMA is to avoid 
manual pinning.

> With autonuma after a while (could take seconds but it will happen) the
> memory will migrate.
>

Yes. As you mentioned during the discuss, I wonder we could do it 
similarly through mmu notifier like APIC access page in commit 
c24ae0dcd3e ("kvm: x86: Unpin and remove kvm_arch->apic_access_page")

>
>
>>>
>>>
>>>
>>>>>> This is the price of all GUP users not only vhost itself.
>>>>> Yes. GUP is just not a great interface for vhost to use.
>>>> Zerocopy codes (enabled by defualt) use them for years.
>>> But only for TX and temporarily. We pin, read, unpin.
>>
>> Probably not. For several reasons that the page will be not be released soon
>> or held for a very long period of time or even forever.
>
> With zero copy? Well it's pinned until transmit. Takes a while
> but could be enough for autocopy to work esp since
> its the packet memory so not reused immediately.
>
>>> Your patch is different
>>>
>>> - it writes into memory and GUP has known issues with file
>>>     backed memory
>>
>> The ordinary user for vhost is anonymous pages I think?
>
> It's not the most common scenario and not the fastest one
> (e.g. THP does not work) but file backed is useful sometimes.
> It would not be nice at all to corrupt guest memory in that case.

Ok.

>
>>> - it keeps pages pinned forever
>>>
>>>
>>>
>>>>>> What's more
>>>>>> important, the goal is not to be left too much behind for other backends
>>>>>> like DPDK or AF_XDP (all of which are using GUP).
>>>>> So these guys assume userspace knows what it's doing.
>>>>> We can't assume that.
>>>> What kind of assumption do you they have?
>>>>
>>>>
>>>>>>> userspace-controlled
>>>>>>> amount of memory locked up and not accounted for.
>>>>>> It's pretty easy to add this since the slow path was still kept. If we
>>>>>> exceeds the limitation, we can switch back to slow path.
>>>>>>
>>>>>>> Don't get me wrong it's a great patch in an ideal world.
>>>>>>> But then in an ideal world no barriers smap etc are necessary at all.
>>>>>> Again, this is only for metadata accessing not the data which has been used
>>>>>> for years for real use cases.
>>>>>>
>>>>>> For SMAP, it makes senses for the address that kernel can not forcast. But
>>>>>> it's not the case for the vhost metadata since we know the address will be
>>>>>> accessed very frequently. For speculation barrier, it helps nothing for the
>>>>>> data path of vhost which is a kthread.
>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>> step which makes some difference.
>>>> The problem is not kthread but the address of userspace address. The
>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>
>>>> Thanks.
>>> It's true for a real life applications but a malicious one
>>> can call the setup ioctls any number of times. And SMAP is
>>> all about malcious applications.
>>
>> We don't do this in the path of ioctl, there's no context switch between
>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>> from accessing userspace pages unexpectedly which is not the case for
>> metadata access.
>>
>> Thanks
> OK let's forget smap for now.

Some numbers I measured:

On an old Sandy bridge machine without SMAP support. Remove speculation 
barrier boost the performance from 4.6Mpps to 5.1Mpps

On a newer Broadwell machine with SMAP support. Remove speculation 
barrier only gives 2%-5% improvement, disable SMAP completely through 
Kconfig boost 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps 
- 6.1Mpps, it only bypass SMAP for metadata).

So it looks like for recent machine, SMAP becomes pain point when the 
copy is short (e.g 64B) for high PPS.

Thanks

>
>>>>>> Packet or AF_XDP benefit from
>>>>>> accessing metadata directly, we should do it as well.
>>>>>>
>>>>>> Thanks