[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
@ 2018-12-13 10:10 Jason Wang
  2018-12-13 10:10 ` [PATCH net-next 1/3] vhost: generalize adding used elem Jason Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Jason Wang @ 2018-12-13 10:10 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel

Hi:

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling.

Test shows about 24% improvement on TX PPS. It should benefit other
cases as well.

Please review

Jason Wang (3):
  vhost: generalize adding used elem
  vhost: fine grain userspace memory accessors
  vhost: access vq metadata through kernel virtual address

 drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/vhost.h |  11 ++
 2 files changed, 266 insertions(+), 26 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH net-next 1/3] vhost: generalize adding used elem
  2018-12-13 10:10 [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Jason Wang
@ 2018-12-13 10:10 ` Jason Wang
  2018-12-13 19:41   ` Michael S. Tsirkin
  2018-12-13 10:10 ` [PATCH net-next 2/3] vhost: fine grain userspace memory accessors Jason Wang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-13 10:10 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel

Use one generic vhost_copy_to_user() instead of two dedicated
accessor. This will simplify the conversion to fine grain accessors.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 3a5f81a66d34..1c54ec1b82f8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2164,16 +2164,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 	start = vq->last_used_idx & (vq->num - 1);
 	used = vq->used->ring + start;
-	if (count == 1) {
-		if (vhost_put_user(vq, heads[0].id, &used->id)) {
-			vq_err(vq, "Failed to write used id");
-			return -EFAULT;
-		}
-		if (vhost_put_user(vq, heads[0].len, &used->len)) {
-			vq_err(vq, "Failed to write used len");
-			return -EFAULT;
-		}
-	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
 		vq_err(vq, "Failed to write used");
 		return -EFAULT;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next 2/3] vhost: fine grain userspace memory accessors
  2018-12-13 10:10 [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Jason Wang
  2018-12-13 10:10 ` [PATCH net-next 1/3] vhost: generalize adding used elem Jason Wang
@ 2018-12-13 10:10 ` Jason Wang
  2018-12-13 10:10 ` [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address Jason Wang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Jason Wang @ 2018-12-13 10:10 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel

This is used to hide the metadata address from virtqueue helpers. This
will allow to implement a vmap based fast accessing to metadata.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 94 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 77 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1c54ec1b82f8..bafe39d2e637 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -871,6 +871,34 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 	ret; \
 })
 
+static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+			      vhost_avail_event(vq));
+}
+
+static inline int vhost_put_used(struct vhost_virtqueue *vq,
+				 struct vring_used_elem *head, int idx,
+				 int count)
+{
+	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
+				  count * sizeof(*head));
+}
+
+static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
+
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+			      &vq->used->flags);
+}
+
+static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
+
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+			      &vq->used->idx);
+}
+
 #define vhost_get_user(vq, x, ptr, type)		\
 ({ \
 	int ret; \
@@ -895,6 +923,43 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 #define vhost_get_used(vq, x, ptr) \
 	vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
 
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
+				      __virtio16 *idx)
+{
+	return vhost_get_avail(vq, *idx, &vq->avail->idx);
+}
+
+static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
+				       __virtio16 *head, int idx)
+{
+	return vhost_get_avail(vq, *head,
+			       &vq->avail->ring[idx & (vq->num - 1)]);
+}
+
+static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
+					__virtio16 *flags)
+{
+	return vhost_get_avail(vq, *flags, &vq->avail->flags);
+}
+
+static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
+				       __virtio16 *event)
+{
+	return vhost_get_avail(vq, *event, vhost_used_event(vq));
+}
+
+static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
+				     __virtio16 *idx)
+{
+	return vhost_get_used(vq, *idx, &vq->used->idx);
+}
+
+static inline int vhost_get_desc(struct vhost_virtqueue *vq,
+				 struct vring_desc *desc, int idx)
+{
+	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
+}
+
 static int vhost_new_umem_range(struct vhost_umem *umem,
 				u64 start, u64 size, u64 end,
 				u64 userspace_addr, int perm)
@@ -1751,8 +1816,7 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
 	void __user *used;
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
-			   &vq->used->flags) < 0)
+	if (vhost_put_used_flags(vq))
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		/* Make sure the flag is seen before log. */
@@ -1770,8 +1834,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
 {
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
-			   vhost_avail_event(vq)))
+	if (vhost_put_avail_event(vq))
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		void __user *used;
@@ -1808,7 +1871,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 		r = -EFAULT;
 		goto err;
 	}
-	r = vhost_get_used(vq, last_used_idx, &vq->used->idx);
+	r = vhost_get_used_idx(vq, &last_used_idx);
 	if (r) {
 		vq_err(vq, "Can't access used idx at %p\n",
 		       &vq->used->idx);
@@ -2007,7 +2070,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	last_avail_idx = vq->last_avail_idx;
 
 	if (vq->avail_idx == vq->last_avail_idx) {
-		if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
+		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
 			vq_err(vq, "Failed to access avail idx at %p\n",
 				&vq->avail->idx);
 			return -EFAULT;
@@ -2034,8 +2097,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Grab the next descriptor number they're advertising, and increment
 	 * the index we've seen. */
-	if (unlikely(vhost_get_avail(vq, ring_head,
-		     &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
+	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
 		       &vq->avail->ring[last_avail_idx % vq->num]);
@@ -2070,8 +2132,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->num, head);
 			return -EINVAL;
 		}
-		ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
-					   sizeof desc);
+		ret = vhost_get_desc(vq, &desc, i);
 		if (unlikely(ret)) {
 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
 			       i, vq->desc + i);
@@ -2164,7 +2225,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 	start = vq->last_used_idx & (vq->num - 1);
 	used = vq->used->ring + start;
-	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+	if (vhost_put_used(vq, heads, start, count)) {
 		vq_err(vq, "Failed to write used");
 		return -EFAULT;
 	}
@@ -2208,8 +2269,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 
 	/* Make sure buffer is written before we update index. */
 	smp_wmb();
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
-			   &vq->used->idx)) {
+	if (vhost_put_used_idx(vq)) {
 		vq_err(vq, "Failed to increment used idx");
 		return -EFAULT;
 	}
@@ -2241,7 +2301,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 
 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 		__virtio16 flags;
-		if (vhost_get_avail(vq, flags, &vq->avail->flags)) {
+		if (vhost_get_avail_flags(vq, &flags)) {
 			vq_err(vq, "Failed to get flags");
 			return true;
 		}
@@ -2255,7 +2315,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (unlikely(!v))
 		return true;
 
-	if (vhost_get_avail(vq, event, vhost_used_event(vq))) {
+	if (vhost_get_used_event(vq, &event)) {
 		vq_err(vq, "Failed to get used event idx");
 		return true;
 	}
@@ -2300,7 +2360,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (vq->avail_idx != vq->last_avail_idx)
 		return false;
 
-	r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+	r = vhost_get_avail_idx(vq, &avail_idx);
 	if (unlikely(r))
 		return false;
 	vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
@@ -2336,7 +2396,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
 	smp_mb();
-	r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+	r = vhost_get_avail_idx(vq, &avail_idx);
 	if (r) {
 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
 		       &vq->avail->idx, r);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-13 10:10 [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Jason Wang
  2018-12-13 10:10 ` [PATCH net-next 1/3] vhost: generalize adding used elem Jason Wang
  2018-12-13 10:10 ` [PATCH net-next 2/3] vhost: fine grain userspace memory accessors Jason Wang
@ 2018-12-13 10:10 ` Jason Wang
  2018-12-13 15:44   ` Michael S. Tsirkin
  2018-12-14 14:48   ` kbuild test robot
  2018-12-13 15:27 ` [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Michael S. Tsirkin
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Jason Wang @ 2018-12-13 10:10 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel

It was noticed that the copy_user() friends that was used to access
virtqueue metdata tends to be very expensive for dataplane
implementation like vhost since it involves lots of software check,
speculation barrier, hardware feature toggling (e.g SMAP). The
extra cost will be more obvious when transferring small packets.

This patch tries to eliminate those overhead by pin vq metadata pages
and access them through vmap(). During SET_VRING_ADDR, we will setup
those mappings and memory accessors are modified to use pointers to
access the metadata directly.

Note, this was only done when device IOTLB is not enabled. We could
use similar method to optimize it in the future.

Tests shows about ~24% improvement on TX PPS when using virtio-user +
vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):

Before: ~5.0Mpps
After:  ~6.1Mpps

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h |  11 +++
 2 files changed, 189 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bafe39d2e637..1bd24203afb6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vq->indirect = NULL;
 		vq->heads = NULL;
 		vq->dev = dev;
+		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
+		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
+		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
 		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
@@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
 	spin_unlock(&dev->iotlb_lock);
 }
 
+static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
+			   size_t size, int write)
+{
+	struct page **pages;
+	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
+	int npinned;
+	void *vaddr;
+
+	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	npinned = get_user_pages_fast(uaddr, npages, write, pages);
+	if (npinned != npages)
+		goto err;
+
+	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+	if (!vaddr)
+		goto err;
+
+	map->pages = pages;
+	map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
+	map->npages = npages;
+
+	return 0;
+
+err:
+	if (npinned > 0)
+		release_pages(pages, npinned);
+	kfree(pages);
+	return -EFAULT;
+}
+
+static void vhost_uninit_vmap(struct vhost_vmap *map)
+{
+	if (!map->addr)
+		return;
+
+	vunmap(map->addr);
+	release_pages(map->pages, map->npages);
+	kfree(map->pages);
+
+	map->addr = NULL;
+	map->pages = NULL;
+	map->npages = 0;
+}
+
+static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
+{
+	vhost_uninit_vmap(&vq->avail_ring);
+	vhost_uninit_vmap(&vq->desc_ring);
+	vhost_uninit_vmap(&vq->used_ring);
+}
+
+static int vhost_setup_vmaps(struct vhost_virtqueue *vq, unsigned long avail,
+			     unsigned long desc, unsigned long used)
+{
+	size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+	size_t avail_size, desc_size, used_size;
+	int ret;
+
+	vhost_clean_vmaps(vq);
+
+	avail_size = sizeof(*vq->avail) +
+		     sizeof(*vq->avail->ring) * vq->num + event;
+	ret = vhost_init_vmap(&vq->avail_ring, avail, avail_size, false);
+	if (ret) {
+		vq_err(vq, "Fail to setup vmap for avail ring!\n");
+		goto err_avail;
+	}
+
+	desc_size = sizeof(*vq->desc) * vq->num;
+	ret = vhost_init_vmap(&vq->desc_ring, desc, desc_size, false);
+	if (ret) {
+		vq_err(vq, "Fail to setup vmap for desc ring!\n");
+		goto err_desc;
+	}
+
+	used_size = sizeof(*vq->used) +
+		    sizeof(*vq->used->ring) * vq->num + event;
+	ret = vhost_init_vmap(&vq->used_ring, used, used_size, true);
+	if (ret) {
+		vq_err(vq, "Fail to setup vmap for used ring!\n");
+		goto err_used;
+	}
+
+	return 0;
+
+err_used:
+	vhost_uninit_vmap(&vq->used_ring);
+err_desc:
+	vhost_uninit_vmap(&vq->avail_ring);
+err_avail:
+	return -EFAULT;
+}
+
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
 	int i;
@@ -626,6 +725,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 		if (dev->vqs[i]->call_ctx)
 			eventfd_ctx_put(dev->vqs[i]->call_ctx);
 		vhost_vq_reset(dev, dev->vqs[i]);
+		vhost_clean_vmaps(dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
 	if (dev->log_ctx)
@@ -873,6 +973,14 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 
 static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		*((__virtio16 *)&used->ring[vq->num]) =
+			cpu_to_vhost16(vq, vq->avail_idx);
+		return 0;
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
 			      vhost_avail_event(vq));
 }
@@ -881,6 +989,13 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
 				 struct vring_used_elem *head, int idx,
 				 int count)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		memcpy(used->ring + idx, head, count * sizeof(*head));
+		return 0;
+	}
+
 	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
 				  count * sizeof(*head));
 }
@@ -888,6 +1003,13 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
 static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
 
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		used->flags = cpu_to_vhost16(vq, vq->used_flags);
+		return 0;
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
 			      &vq->used->flags);
 }
@@ -895,6 +1017,13 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
 static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
 
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
+		return 0;
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
 			      &vq->used->idx);
 }
@@ -926,12 +1055,26 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
 static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
 				      __virtio16 *idx)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		*idx = avail->idx;
+		return 0;
+	}
+
 	return vhost_get_avail(vq, *idx, &vq->avail->idx);
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 				       __virtio16 *head, int idx)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		*head = avail->ring[idx & (vq->num - 1)];
+		return 0;
+	}
+
 	return vhost_get_avail(vq, *head,
 			       &vq->avail->ring[idx & (vq->num - 1)]);
 }
@@ -939,24 +1082,52 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
 					__virtio16 *flags)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		*flags = avail->flags;
+		return 0;
+	}
+
 	return vhost_get_avail(vq, *flags, &vq->avail->flags);
 }
 
 static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
 				       __virtio16 *event)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		*event = (__virtio16)avail->ring[vq->num];
+		return 0;
+	}
+
 	return vhost_get_avail(vq, *event, vhost_used_event(vq));
 }
 
 static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
 				     __virtio16 *idx)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		*idx = used->idx;
+		return 0;
+	}
+
 	return vhost_get_used(vq, *idx, &vq->used->idx);
 }
 
 static inline int vhost_get_desc(struct vhost_virtqueue *vq,
 				 struct vring_desc *desc, int idx)
 {
+	if (!vq->iotlb) {
+		struct vring_desc *d = vq->desc_ring.addr;
+
+		*desc = *(d + idx);
+		return 0;
+	}
+
 	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
 }
 
@@ -1551,6 +1722,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 			}
 		}
 
+		if (!vq->iotlb && vhost_setup_vmaps(vq, a.avail_user_addr,
+							a.desc_user_addr,
+							a.used_user_addr)) {
+			r = -EINVAL;
+			break;
+		}
+
 		vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
 		vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
 		vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 466ef7542291..89dc0ad3d055 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -80,6 +80,12 @@ enum vhost_uaddr_type {
 	VHOST_NUM_ADDRS = 3,
 };
 
+struct vhost_vmap {
+	struct page **pages;
+	void *addr;
+	int npages;
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -90,6 +96,11 @@ struct vhost_virtqueue {
 	struct vring_desc __user *desc;
 	struct vring_avail __user *avail;
 	struct vring_used __user *used;
+
+	struct vhost_vmap avail_ring;
+	struct vhost_vmap desc_ring;
+	struct vhost_vmap used_ring;
+
 	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
 	struct file *kick;
 	struct eventfd_ctx *call_ctx;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-13 10:10 [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Jason Wang
                   ` (2 preceding siblings ...)
  2018-12-13 10:10 ` [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address Jason Wang
@ 2018-12-13 15:27 ` Michael S. Tsirkin
  2018-12-14  3:42   ` Jason Wang
  2018-12-13 20:12 ` Michael S. Tsirkin
  2018-12-14 15:16 ` Michael S. Tsirkin
  5 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-13 15:27 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.

Userspace accesses through remapping tricks and next time there's a need
for a new barrier we are left to figure it out by ourselves.  I don't
like the idea I have to say.  As a first step, why don't we switch to
unsafe_put_user/unsafe_get_user etc?
That would be more of an apples to apples comparison, would it not?


> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
> 
> Please review
> 
> Jason Wang (3):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>  drivers/vhost/vhost.h |  11 ++
>  2 files changed, 266 insertions(+), 26 deletions(-)
> 
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-13 10:10 ` [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address Jason Wang
@ 2018-12-13 15:44   ` Michael S. Tsirkin
  2018-12-13 21:18     ` Konrad Rzeszutek Wilk
  2018-12-14  3:57     ` Jason Wang
  2018-12-14 14:48   ` kbuild test robot
  1 sibling, 2 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-13 15:44 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> It was noticed that the copy_user() friends that was used to access
> virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software check,
> speculation barrier, hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets.
> 
> This patch tries to eliminate those overhead by pin vq metadata pages
> and access them through vmap(). During SET_VRING_ADDR, we will setup
> those mappings and memory accessors are modified to use pointers to
> access the metadata directly.
> 
> Note, this was only done when device IOTLB is not enabled. We could
> use similar method to optimize it in the future.
> 
> Tests shows about ~24% improvement on TX PPS when using virtio-user +
> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> 
> Before: ~5.0Mpps
> After:  ~6.1Mpps
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |  11 +++
>  2 files changed, 189 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bafe39d2e637..1bd24203afb6 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vq->indirect = NULL;
>  		vq->heads = NULL;
>  		vq->dev = dev;
> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>  		mutex_init(&vq->mutex);
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>  	spin_unlock(&dev->iotlb_lock);
>  }
>  
> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> +			   size_t size, int write)
> +{
> +	struct page **pages;
> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> +	int npinned;
> +	void *vaddr;
> +
> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> +	if (npinned != npages)
> +		goto err;
> +

As I said I have doubts about the whole approach, but this
implementation in particular isn't a good idea
as it keeps the page around forever.
So no THP, no NUMA rebalancing, userspace-controlled
amount of memory locked up and not accounted for.

Don't get me wrong it's a great patch in an ideal world.
But then in an ideal world no barriers smap etc are necessary at all.


> +	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> +	if (!vaddr)
> +		goto err;
> +
> +	map->pages = pages;
> +	map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
> +	map->npages = npages;
> +
> +	return 0;
> +
> +err:
> +	if (npinned > 0)
> +		release_pages(pages, npinned);
> +	kfree(pages);
> +	return -EFAULT;
> +}
> +
> +static void vhost_uninit_vmap(struct vhost_vmap *map)
> +{
> +	if (!map->addr)
> +		return;
> +
> +	vunmap(map->addr);
> +	release_pages(map->pages, map->npages);
> +	kfree(map->pages);
> +
> +	map->addr = NULL;
> +	map->pages = NULL;
> +	map->npages = 0;
> +}
> +
> +static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
> +{
> +	vhost_uninit_vmap(&vq->avail_ring);
> +	vhost_uninit_vmap(&vq->desc_ring);
> +	vhost_uninit_vmap(&vq->used_ring);
> +}
> +
> +static int vhost_setup_vmaps(struct vhost_virtqueue *vq, unsigned long avail,
> +			     unsigned long desc, unsigned long used)
> +{
> +	size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +	size_t avail_size, desc_size, used_size;
> +	int ret;
> +
> +	vhost_clean_vmaps(vq);
> +
> +	avail_size = sizeof(*vq->avail) +
> +		     sizeof(*vq->avail->ring) * vq->num + event;
> +	ret = vhost_init_vmap(&vq->avail_ring, avail, avail_size, false);
> +	if (ret) {
> +		vq_err(vq, "Fail to setup vmap for avail ring!\n");
> +		goto err_avail;
> +	}
> +
> +	desc_size = sizeof(*vq->desc) * vq->num;
> +	ret = vhost_init_vmap(&vq->desc_ring, desc, desc_size, false);
> +	if (ret) {
> +		vq_err(vq, "Fail to setup vmap for desc ring!\n");
> +		goto err_desc;
> +	}
> +
> +	used_size = sizeof(*vq->used) +
> +		    sizeof(*vq->used->ring) * vq->num + event;
> +	ret = vhost_init_vmap(&vq->used_ring, used, used_size, true);
> +	if (ret) {
> +		vq_err(vq, "Fail to setup vmap for used ring!\n");
> +		goto err_used;
> +	}
> +
> +	return 0;
> +
> +err_used:
> +	vhost_uninit_vmap(&vq->used_ring);
> +err_desc:
> +	vhost_uninit_vmap(&vq->avail_ring);
> +err_avail:
> +	return -EFAULT;
> +}
> +
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
>  	int i;
> @@ -626,6 +725,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  		if (dev->vqs[i]->call_ctx)
>  			eventfd_ctx_put(dev->vqs[i]->call_ctx);
>  		vhost_vq_reset(dev, dev->vqs[i]);
> +		vhost_clean_vmaps(dev->vqs[i]);
>  	}
>  	vhost_dev_free_iovecs(dev);
>  	if (dev->log_ctx)
> @@ -873,6 +973,14 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>  
>  static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		*((__virtio16 *)&used->ring[vq->num]) =
> +			cpu_to_vhost16(vq, vq->avail_idx);
> +		return 0;
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
>  			      vhost_avail_event(vq));
>  }
> @@ -881,6 +989,13 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  				 struct vring_used_elem *head, int idx,
>  				 int count)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		memcpy(used->ring + idx, head, count * sizeof(*head));
> +		return 0;
> +	}
> +
>  	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
>  				  count * sizeof(*head));
>  }
> @@ -888,6 +1003,13 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		used->flags = cpu_to_vhost16(vq, vq->used_flags);
> +		return 0;
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
>  			      &vq->used->flags);
>  }
> @@ -895,6 +1017,13 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>  
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> +		return 0;
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
>  			      &vq->used->idx);
>  }
> @@ -926,12 +1055,26 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>  static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>  				      __virtio16 *idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		*idx = avail->idx;
> +		return 0;
> +	}
> +
>  	return vhost_get_avail(vq, *idx, &vq->avail->idx);
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  				       __virtio16 *head, int idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		*head = avail->ring[idx & (vq->num - 1)];
> +		return 0;
> +	}
> +
>  	return vhost_get_avail(vq, *head,
>  			       &vq->avail->ring[idx & (vq->num - 1)]);
>  }
> @@ -939,24 +1082,52 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>  					__virtio16 *flags)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		*flags = avail->flags;
> +		return 0;
> +	}
> +
>  	return vhost_get_avail(vq, *flags, &vq->avail->flags);
>  }
>  
>  static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>  				       __virtio16 *event)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		*event = (__virtio16)avail->ring[vq->num];
> +		return 0;
> +	}
> +
>  	return vhost_get_avail(vq, *event, vhost_used_event(vq));
>  }
>  
>  static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>  				     __virtio16 *idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		*idx = used->idx;
> +		return 0;
> +	}
> +
>  	return vhost_get_used(vq, *idx, &vq->used->idx);
>  }
>  
>  static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>  				 struct vring_desc *desc, int idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_desc *d = vq->desc_ring.addr;
> +
> +		*desc = *(d + idx);
> +		return 0;
> +	}
> +
>  	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
>  }
>  
> @@ -1551,6 +1722,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  			}
>  		}
>  
> +		if (!vq->iotlb && vhost_setup_vmaps(vq, a.avail_user_addr,
> +							a.desc_user_addr,
> +							a.used_user_addr)) {
> +			r = -EINVAL;
> +			break;
> +		}
> +
>  		vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
>  		vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
>  		vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 466ef7542291..89dc0ad3d055 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -80,6 +80,12 @@ enum vhost_uaddr_type {
>  	VHOST_NUM_ADDRS = 3,
>  };
>  
> +struct vhost_vmap {
> +	struct page **pages;
> +	void *addr;
> +	int npages;
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -90,6 +96,11 @@ struct vhost_virtqueue {
>  	struct vring_desc __user *desc;
>  	struct vring_avail __user *avail;
>  	struct vring_used __user *used;
> +
> +	struct vhost_vmap avail_ring;
> +	struct vhost_vmap desc_ring;
> +	struct vhost_vmap used_ring;
> +
>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>  	struct file *kick;
>  	struct eventfd_ctx *call_ctx;
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 1/3] vhost: generalize adding used elem
  2018-12-13 10:10 ` [PATCH net-next 1/3] vhost: generalize adding used elem Jason Wang
@ 2018-12-13 19:41   ` Michael S. Tsirkin
  2018-12-14  4:00     ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-13 19:41 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Thu, Dec 13, 2018 at 06:10:20PM +0800, Jason Wang wrote:
> Use one generic vhost_copy_to_user() instead of two dedicated
> accessor. This will simplify the conversion to fine grain accessors.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>


The reason we did it like this is because it was faster.
Want to try benchmarking before we change it?

> ---
>  drivers/vhost/vhost.c | 11 +----------
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 3a5f81a66d34..1c54ec1b82f8 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2164,16 +2164,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>  
>  	start = vq->last_used_idx & (vq->num - 1);
>  	used = vq->used->ring + start;
> -	if (count == 1) {
> -		if (vhost_put_user(vq, heads[0].id, &used->id)) {
> -			vq_err(vq, "Failed to write used id");
> -			return -EFAULT;
> -		}
> -		if (vhost_put_user(vq, heads[0].len, &used->len)) {
> -			vq_err(vq, "Failed to write used len");
> -			return -EFAULT;
> -		}
> -	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
> +	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
>  		vq_err(vq, "Failed to write used");
>  		return -EFAULT;
>  	}
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-13 10:10 [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Jason Wang
                   ` (3 preceding siblings ...)
  2018-12-13 15:27 ` [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Michael S. Tsirkin
@ 2018-12-13 20:12 ` Michael S. Tsirkin
  2018-12-14  4:29   ` Jason Wang
  2018-12-14 15:16 ` Michael S. Tsirkin
  5 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-13 20:12 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.
> 
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
> 
> Please review

I think the idea of speeding up userspace access is a good one.
However I think that moving all checks to start is way too aggressive.
Instead, let's batch things up but let's not keep them
around forever.
Here are some ideas:

1. Disable preemption, process a small number of small packets
   directly in an atomic context. This should cut latency
   down significantly, the tricky part is to only do it
   on a light load and disable this
   for the streaming case otherwise it's unfair.
   This might fail, if it does just bounce things out to
   a thread.

2. Switch to unsafe_put_user/unsafe_get_user,
   and batch up multiple accesses.

3. Allow adding a fixup point manually,
   such that multiple independent get_user accesses
   can get a single fixup (will allow better compiler
   optimizations).

> Jason Wang (3):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>  drivers/vhost/vhost.h |  11 ++
>  2 files changed, 266 insertions(+), 26 deletions(-)
> 
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-13 15:44   ` Michael S. Tsirkin
@ 2018-12-13 21:18     ` Konrad Rzeszutek Wilk
  2018-12-13 21:58       ` Michael S. Tsirkin
  2018-12-14  3:57     ` Jason Wang
  1 sibling, 1 reply; 39+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-12-13 21:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel

.giant snip..
> > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > +	if (npinned != npages)
> > +		goto err;
> > +
> 
> As I said I have doubts about the whole approach, but this
> implementation in particular isn't a good idea
> as it keeps the page around forever.
> So no THP, no NUMA rebalancing, userspace-controlled
> amount of memory locked up and not accounted for.
> 
> Don't get me wrong it's a great patch in an ideal world.
> But then in an ideal world no barriers smap etc are necessary at all.

So .. suggestions on how this could be accepted? As in other ways
where we still get vmap and the issues you mentioned are not troubling you?

Thanks!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-13 21:18     ` Konrad Rzeszutek Wilk
@ 2018-12-13 21:58       ` Michael S. Tsirkin
  0 siblings, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-13 21:58 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel

On Thu, Dec 13, 2018 at 04:18:40PM -0500, Konrad Rzeszutek Wilk wrote:
> .giant snip..
> > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > +	if (npinned != npages)
> > > +		goto err;
> > > +
> > 
> > As I said I have doubts about the whole approach, but this
> > implementation in particular isn't a good idea
> > as it keeps the page around forever.
> > So no THP, no NUMA rebalancing, userspace-controlled
> > amount of memory locked up and not accounted for.
> > 
> > Don't get me wrong it's a great patch in an ideal world.
> > But then in an ideal world no barriers smap etc are necessary at all.
> 
> So .. suggestions on how this could be accepted? As in other ways
> where we still get vmap and the issues you mentioned are not troubling you?
> 
> Thanks!

I'd suggest leave vmap alone and find ways to speed up accesses
that can fault.

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-13 15:27 ` [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Michael S. Tsirkin
@ 2018-12-14  3:42   ` Jason Wang
  2018-12-14 12:33     ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-14  3:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
> Userspace accesses through remapping tricks and next time there's a need
> for a new barrier we are left to figure it out by ourselves.


I don't get here, do you mean spec barriers? It's completely unnecessary 
for vhost which is kernel thread. And even if you're right, vhost is not 
the only place, there's lots of vmap() based accessing in kernel. Think 
in another direction, this means we won't suffer form unnecessary 
barriers for kthread like vhost in the future, we will manually pick the 
one we really need (but it should have little possibility).

Please notice we only access metdata through remapping not the data 
itself. This idea has been used for high speed userspace backend for 
years, e.g packet socket or recent AF_XDP. The only difference is the 
page was remap to from kernel to userspace.


>    I don't
> like the idea I have to say.  As a first step, why don't we switch to
> unsafe_put_user/unsafe_get_user etc?


Several reasons:

- They only have x86 variant, it won't have any difference for the rest 
of architecture.

- unsafe_put_user/unsafe_get_user is not sufficient for accessing 
structures (e.g accessing descriptor) or arrays (batching).

- Unless we can batch at least the accessing of two places in three of 
avail, used and descriptor in one run. There will be no difference. E.g 
we can batch updating used ring, but it won't make any difference in 
this case.


> That would be more of an apples to apples comparison, would it not?


Apples to apples comparison only help if we are the No.1. But the fact 
is we are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is 
the fastest method AFAIK.


Thanks


>
>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Please review
>>
>> Jason Wang (3):
>>    vhost: generalize adding used elem
>>    vhost: fine grain userspace memory accessors
>>    vhost: access vq metadata through kernel virtual address
>>
>>   drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>>   drivers/vhost/vhost.h |  11 ++
>>   2 files changed, 266 insertions(+), 26 deletions(-)
>>
>> -- 
>> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-13 15:44   ` Michael S. Tsirkin
  2018-12-13 21:18     ` Konrad Rzeszutek Wilk
@ 2018-12-14  3:57     ` Jason Wang
  2018-12-14 12:36       ` Michael S. Tsirkin
  2018-12-15 21:15       ` David Miller
  1 sibling, 2 replies; 39+ messages in thread
From: Jason Wang @ 2018-12-14  3:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>> It was noticed that the copy_user() friends that was used to access
>> virtqueue metdata tends to be very expensive for dataplane
>> implementation like vhost since it involves lots of software check,
>> speculation barrier, hardware feature toggling (e.g SMAP). The
>> extra cost will be more obvious when transferring small packets.
>>
>> This patch tries to eliminate those overhead by pin vq metadata pages
>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>> those mappings and memory accessors are modified to use pointers to
>> access the metadata directly.
>>
>> Note, this was only done when device IOTLB is not enabled. We could
>> use similar method to optimize it in the future.
>>
>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>
>> Before: ~5.0Mpps
>> After:  ~6.1Mpps
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>   drivers/vhost/vhost.h |  11 +++
>>   2 files changed, 189 insertions(+)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index bafe39d2e637..1bd24203afb6 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>   		vq->indirect = NULL;
>>   		vq->heads = NULL;
>>   		vq->dev = dev;
>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>   		mutex_init(&vq->mutex);
>>   		vhost_vq_reset(dev, vq);
>>   		if (vq->handle_kick)
>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>   	spin_unlock(&dev->iotlb_lock);
>>   }
>>   
>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>> +			   size_t size, int write)
>> +{
>> +	struct page **pages;
>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>> +	int npinned;
>> +	void *vaddr;
>> +
>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>> +	if (!pages)
>> +		return -ENOMEM;
>> +
>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>> +	if (npinned != npages)
>> +		goto err;
>> +
> As I said I have doubts about the whole approach, but this
> implementation in particular isn't a good idea
> as it keeps the page around forever.
> So no THP, no NUMA rebalancing,


This is the price of all GUP users not only vhost itself. What's more 
important, the goal is not to be left too much behind for other backends 
like DPDK or AF_XDP (all of which are using GUP).


> userspace-controlled
> amount of memory locked up and not accounted for.


It's pretty easy to add this since the slow path was still kept. If we 
exceeds the limitation, we can switch back to slow path.


>
> Don't get me wrong it's a great patch in an ideal world.
> But then in an ideal world no barriers smap etc are necessary at all.


Again, this is only for metadata accessing not the data which has been 
used for years for real use cases.

For SMAP, it makes senses for the address that kernel can not forcast. 
But it's not the case for the vhost metadata since we know the address 
will be accessed very frequently. For speculation barrier, it helps 
nothing for the data path of vhost which is a kthread. Packet or AF_XDP 
benefit from accessing metadata directly, we should do it as well.

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 1/3] vhost: generalize adding used elem
  2018-12-13 19:41   ` Michael S. Tsirkin
@ 2018-12-14  4:00     ` Jason Wang
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Wang @ 2018-12-14  4:00 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/14 上午3:41, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:20PM +0800, Jason Wang wrote:
>> Use one generic vhost_copy_to_user() instead of two dedicated
>> accessor. This will simplify the conversion to fine grain accessors.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>
> The reason we did it like this is because it was faster.
> Want to try benchmarking before we change it?


Faster is a good side effect of such conversion. Otherwise we need two 
new fine grain memory accessor.

But, anyway, I can test its impact.

Thanks


>
>> ---
>>   drivers/vhost/vhost.c | 11 +----------
>>   1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 3a5f81a66d34..1c54ec1b82f8 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -2164,16 +2164,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>>   
>>   	start = vq->last_used_idx & (vq->num - 1);
>>   	used = vq->used->ring + start;
>> -	if (count == 1) {
>> -		if (vhost_put_user(vq, heads[0].id, &used->id)) {
>> -			vq_err(vq, "Failed to write used id");
>> -			return -EFAULT;
>> -		}
>> -		if (vhost_put_user(vq, heads[0].len, &used->len)) {
>> -			vq_err(vq, "Failed to write used len");
>> -			return -EFAULT;
>> -		}
>> -	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
>> +	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
>>   		vq_err(vq, "Failed to write used");
>>   		return -EFAULT;
>>   	}
>> -- 
>> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-13 20:12 ` Michael S. Tsirkin
@ 2018-12-14  4:29   ` Jason Wang
  2018-12-14 12:52     ` Michael S. Tsirkin
  2018-12-15 19:43     ` David Miller
  0 siblings, 2 replies; 39+ messages in thread
From: Jason Wang @ 2018-12-14  4:29 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel

On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
>>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Please review
> I think the idea of speeding up userspace access is a good one.
> However I think that moving all checks to start is way too aggressive.

So did packet and AF_XDP. Anyway, sharing address space and access them 
directly is the fastest way. Performance is the major consideration for 
people to choose backend. Compare to userspace implementation, vhost 
does not have security advantages at any level. If vhost is still slow, 
people will start to develop backends based on e.g AF_XDP.

> Instead, let's batch things up but let's not keep them
> around forever.
> Here are some ideas:
>
>
> 1. Disable preemption, process a small number of small packets
>     directly in an atomic context. This should cut latency
>     down significantly, the tricky part is to only do it
>     on a light load and disable this
>     for the streaming case otherwise it's unfair.
>     This might fail, if it does just bounce things out to
>     a thread.

I'm not sure what context you meant here. Is this for TX path of TUN? 
But a fundamental difference is my series is targeted for extreme heavy 
load not light one, 100% cpu for vhost is expected.

>
> 2. Switch to unsafe_put_user/unsafe_get_user,
>     and batch up multiple accesses.

As I said, unless we can batch accessing of two difference places of 
three of avail, descriptor and used. It won't help for batching the 
accessing of a single place like used. I'm even not sure this can be 
done consider the case of packed virtqueue, we have a single descriptor 
ring. Batching through unsafe helpers may not help in this case since 
it's equivalent to safe ones . And This requires non trivial refactoring 
of vhost. And such refactoring itself make give us noticeable impact 
(e.g it may lead regression).

>
> 3. Allow adding a fixup point manually,
>     such that multiple independent get_user accesses
>     can get a single fixup (will allow better compiler
>     optimizations).
>

So for metadata access, I don't see how you suggest here can help in the 
case of heavy workload.

For data access, this may help but I've played to batch the data copy to 
reduce SMAP/spec barriers in vhost-net but I don't see performance 
improvement.

Thanks

>
>
>
>> Jason Wang (3):
>>    vhost: generalize adding used elem
>>    vhost: fine grain userspace memory accessors
>>    vhost: access vq metadata through kernel virtual address
>>
>>   drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>>   drivers/vhost/vhost.h |  11 ++
>>   2 files changed, 266 insertions(+), 26 deletions(-)
>>
>> -- 
>> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-14  3:42   ` Jason Wang
@ 2018-12-14 12:33     ` Michael S. Tsirkin
  2018-12-14 15:31       ` Michael S. Tsirkin
  2018-12-24  8:32       ` Jason Wang
  0 siblings, 2 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-14 12:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
> 
> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > Userspace accesses through remapping tricks and next time there's a need
> > for a new barrier we are left to figure it out by ourselves.
> 
> 
> I don't get here, do you mean spec barriers?

I mean the next barrier people decide to put into userspace
memory accesses.

> It's completely unnecessary for
> vhost which is kernel thread.

It's defence in depth. Take a look at the commit that added them.
And yes quite possibly in most cases we actually have a spec
barrier in the validation phase. If we do let's use the
unsafe variants so they can be found.

> And even if you're right, vhost is not the
> only place, there's lots of vmap() based accessing in kernel.

For sure. But if one can get by without get user pages, one
really should. Witness recently uncovered mess with file
backed storage.

> Think in
> another direction, this means we won't suffer form unnecessary barriers for
> kthread like vhost in the future, we will manually pick the one we really
> need

I personally think we should err on the side of caution not on the side of
performance.

> (but it should have little possibility).

History seems to teach otherwise.

> Please notice we only access metdata through remapping not the data itself.
> This idea has been used for high speed userspace backend for years, e.g
> packet socket or recent AF_XDP.

I think their justification for the higher risk is that they are mostly
designed for priveledged userspace.

> The only difference is the page was remap to
> from kernel to userspace.

At least that avoids the g.u.p mess.

> 
> >    I don't
> > like the idea I have to say.  As a first step, why don't we switch to
> > unsafe_put_user/unsafe_get_user etc?
> 
> 
> Several reasons:
> 
> - They only have x86 variant, it won't have any difference for the rest of
> architecture.

Is there an issue on other architectures? If yes they can be extended
there.

> - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
> (e.g accessing descriptor) or arrays (batching).

So you want unsafe_copy_xxx_user? I can do this. Hang on will post.

> - Unless we can batch at least the accessing of two places in three of
> avail, used and descriptor in one run. There will be no difference. E.g we
> can batch updating used ring, but it won't make any difference in this case.
> 

So let's batch them all?


> > That would be more of an apples to apples comparison, would it not?
> 
> 
> Apples to apples comparison only help if we are the No.1. But the fact is we
> are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
> fastest method AFAIK.
> 
> 
> Thanks

We need to speed up the packet access itself too though.
You can't vmap all of guest memory.


> 
> > 
> > 
> > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > cases as well.
> > > 
> > > Please review
> > > 
> > > Jason Wang (3):
> > >    vhost: generalize adding used elem
> > >    vhost: fine grain userspace memory accessors
> > >    vhost: access vq metadata through kernel virtual address
> > > 
> > >   drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> > >   drivers/vhost/vhost.h |  11 ++
> > >   2 files changed, 266 insertions(+), 26 deletions(-)
> > > 
> > > -- 
> > > 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-14  3:57     ` Jason Wang
@ 2018-12-14 12:36       ` Michael S. Tsirkin
  2018-12-24  7:53         ` Jason Wang
  2018-12-15 21:15       ` David Miller
  1 sibling, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-14 12:36 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> 
> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > It was noticed that the copy_user() friends that was used to access
> > > virtqueue metdata tends to be very expensive for dataplane
> > > implementation like vhost since it involves lots of software check,
> > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > extra cost will be more obvious when transferring small packets.
> > > 
> > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > those mappings and memory accessors are modified to use pointers to
> > > access the metadata directly.
> > > 
> > > Note, this was only done when device IOTLB is not enabled. We could
> > > use similar method to optimize it in the future.
> > > 
> > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > 
> > > Before: ~5.0Mpps
> > > After:  ~6.1Mpps
> > > 
> > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > ---
> > >   drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > >   drivers/vhost/vhost.h |  11 +++
> > >   2 files changed, 189 insertions(+)
> > > 
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index bafe39d2e637..1bd24203afb6 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > >   		vq->indirect = NULL;
> > >   		vq->heads = NULL;
> > >   		vq->dev = dev;
> > > +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > >   		mutex_init(&vq->mutex);
> > >   		vhost_vq_reset(dev, vq);
> > >   		if (vq->handle_kick)
> > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > >   	spin_unlock(&dev->iotlb_lock);
> > >   }
> > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > +			   size_t size, int write)
> > > +{
> > > +	struct page **pages;
> > > +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > +	int npinned;
> > > +	void *vaddr;
> > > +
> > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > +	if (!pages)
> > > +		return -ENOMEM;
> > > +
> > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > +	if (npinned != npages)
> > > +		goto err;
> > > +
> > As I said I have doubts about the whole approach, but this
> > implementation in particular isn't a good idea
> > as it keeps the page around forever.
> > So no THP, no NUMA rebalancing,
> 
> 
> This is the price of all GUP users not only vhost itself.

Yes. GUP is just not a great interface for vhost to use.

> What's more
> important, the goal is not to be left too much behind for other backends
> like DPDK or AF_XDP (all of which are using GUP).


So these guys assume userspace knows what it's doing.
We can't assume that.

> 
> > userspace-controlled
> > amount of memory locked up and not accounted for.
> 
> 
> It's pretty easy to add this since the slow path was still kept. If we
> exceeds the limitation, we can switch back to slow path.
> 
> > 
> > Don't get me wrong it's a great patch in an ideal world.
> > But then in an ideal world no barriers smap etc are necessary at all.
> 
> 
> Again, this is only for metadata accessing not the data which has been used
> for years for real use cases.
> 
> For SMAP, it makes senses for the address that kernel can not forcast. But
> it's not the case for the vhost metadata since we know the address will be
> accessed very frequently. For speculation barrier, it helps nothing for the
> data path of vhost which is a kthread.

I don't see how a kthread makes any difference. We do have a validation
step which makes some difference.

> Packet or AF_XDP benefit from
> accessing metadata directly, we should do it as well.
> 
> Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-14  4:29   ` Jason Wang
@ 2018-12-14 12:52     ` Michael S. Tsirkin
  2018-12-15 19:43     ` David Miller
  1 sibling, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-14 12:52 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Fri, Dec 14, 2018 at 12:29:54PM +0800, Jason Wang wrote:
> 
> On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > > 
> > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > cases as well.
> > > 
> > > Please review
> > I think the idea of speeding up userspace access is a good one.
> > However I think that moving all checks to start is way too aggressive.
> 
> 
> So did packet and AF_XDP. Anyway, sharing address space and access them
> directly is the fastest way. Performance is the major consideration for
> people to choose backend. Compare to userspace implementation, vhost does
> not have security advantages at any level. If vhost is still slow, people
> will start to develop backends based on e.g AF_XDP.
> 

Let them what's wrong with that?

> > Instead, let's batch things up but let's not keep them
> > around forever.
> > Here are some ideas:
> > 
> > 
> > 1. Disable preemption, process a small number of small packets
> >     directly in an atomic context. This should cut latency
> >     down significantly, the tricky part is to only do it
> >     on a light load and disable this
> >     for the streaming case otherwise it's unfair.
> >     This might fail, if it does just bounce things out to
> >     a thread.
> 
> 
> I'm not sure what context you meant here. Is this for TX path of TUN? But a
> fundamental difference is my series is targeted for extreme heavy load not
> light one, 100% cpu for vhost is expected.

Interesting. You only shared a TCP RR result though.
What's the performance gain in a heavy load case?

> 
> > 
> > 2. Switch to unsafe_put_user/unsafe_get_user,
> >     and batch up multiple accesses.
> 
> 
> As I said, unless we can batch accessing of two difference places of three
> of avail, descriptor and used. It won't help for batching the accessing of a
> single place like used. I'm even not sure this can be done consider the case
> of packed virtqueue, we have a single descriptor ring.

So that's one of the reasons packed should be faster. Single access
and you get the descriptor no messy redirects. Somehow your
benchmarking so far didn't show a gain with vhost and
packed though - do you know what's wrong?

> Batching through
> unsafe helpers may not help in this case since it's equivalent to safe ones
> . And This requires non trivial refactoring of vhost. And such refactoring
> itself make give us noticeable impact (e.g it may lead regression).
> 
> 
> > 
> > 3. Allow adding a fixup point manually,
> >     such that multiple independent get_user accesses
> >     can get a single fixup (will allow better compiler
> >     optimizations).
> > 
> 
> So for metadata access, I don't see how you suggest here can help in the
> case of heavy workload.
> 
> For data access, this may help but I've played to batch the data copy to
> reduce SMAP/spec barriers in vhost-net but I don't see performance
> improvement.
> 
> Thanks

So how about we try to figure what's going on actually?
Can you drop the barriers and show the same gain?
E.g. vmap does not use a huge page IIRC so in fact it
can be slower than direct access. It's not a magic
faster way.



> 
> > 
> > 
> > 
> > > Jason Wang (3):
> > >    vhost: generalize adding used elem
> > >    vhost: fine grain userspace memory accessors
> > >    vhost: access vq metadata through kernel virtual address
> > > 
> > >   drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> > >   drivers/vhost/vhost.h |  11 ++
> > >   2 files changed, 266 insertions(+), 26 deletions(-)
> > > 
> > > -- 
> > > 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-13 10:10 ` [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address Jason Wang
  2018-12-13 15:44   ` Michael S. Tsirkin
@ 2018-12-14 14:48   ` kbuild test robot
  1 sibling, 0 replies; 39+ messages in thread
From: kbuild test robot @ 2018-12-14 14:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: kbuild-all, mst, jasowang, kvm, virtualization, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2071 bytes --]

Hi Jason,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Jason-Wang/vhost-accelerate-metadata-access-through-vmap/20181214-200417
config: mips-malta_kvm_defconfig (attached as .config)
compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=mips 

All errors (new ones prefixed by >>):

   drivers//vhost/vhost.c: In function 'vhost_init_vmap':
>> drivers//vhost/vhost.c:648:3: error: implicit declaration of function 'release_pages'; did you mean 'release_task'? [-Werror=implicit-function-declaration]
      release_pages(pages, npinned);
      ^~~~~~~~~~~~~
      release_task
   cc1: some warnings being treated as errors

vim +648 drivers//vhost/vhost.c

   619	
   620	static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
   621				   size_t size, int write)
   622	{
   623		struct page **pages;
   624		int npages = DIV_ROUND_UP(size, PAGE_SIZE);
   625		int npinned;
   626		void *vaddr;
   627	
   628		pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
   629		if (!pages)
   630			return -ENOMEM;
   631	
   632		npinned = get_user_pages_fast(uaddr, npages, write, pages);
   633		if (npinned != npages)
   634			goto err;
   635	
   636		vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
   637		if (!vaddr)
   638			goto err;
   639	
   640		map->pages = pages;
   641		map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
   642		map->npages = npages;
   643	
   644		return 0;
   645	
   646	err:
   647		if (npinned > 0)
 > 648			release_pages(pages, npinned);
   649		kfree(pages);
   650		return -EFAULT;
   651	}
   652	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 19468 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-13 10:10 [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Jason Wang
                   ` (4 preceding siblings ...)
  2018-12-13 20:12 ` Michael S. Tsirkin
@ 2018-12-14 15:16 ` Michael S. Tsirkin
  5 siblings, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-14 15:16 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.
> 
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.

BTW if the issue is all the error checking, maybe we should
consider using something like uaccess_catch and check
in a single place.


> Please review
> 
> Jason Wang (3):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>  drivers/vhost/vhost.h |  11 ++
>  2 files changed, 266 insertions(+), 26 deletions(-)
> 
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-14 12:33     ` Michael S. Tsirkin
@ 2018-12-14 15:31       ` Michael S. Tsirkin
  2018-12-24  8:32       ` Jason Wang
  1 sibling, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-14 15:31 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Fri, Dec 14, 2018 at 07:33:20AM -0500, Michael S. Tsirkin wrote:
> > - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
> > (e.g accessing descriptor) or arrays (batching).
> 
> So you want unsafe_copy_xxx_user? I can do this. Hang on will post.

Like this basically? Warning: completely untested.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index b5e58cc0c5e7..d2afd70793ca 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -728,5 +728,10 @@ do {										\
 	if (unlikely(__gu_err)) goto err_label;					\
 } while (0)
 
+#define unsafe_copy_from_user(to, from, n)					\
+	 __copy_user_ll(to, (__force const void *)from, n)
+#define unsafe_copy_to_user(to, from, n)					\
+	 __copy_user_ll((__force void *)to, from, n)
+
 #endif /* _ASM_X86_UACCESS_H */
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index efe79c1cdd47..b9d515ba2920 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -271,6 +271,8 @@ extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
 #define user_access_end() do { } while (0)
 #define unsafe_get_user(x, ptr, err) do { if (unlikely(__get_user(x, ptr))) goto err; } while (0)
 #define unsafe_put_user(x, ptr, err) do { if (unlikely(__put_user(x, ptr))) goto err; } while (0)
+#define unsafe_copy_to_user(from, to, n) copy_to_user(from, to, n)
+#define unsafe_copy_from_user(from, to, n) copy_from_user(from, to, n)
 #endif
 
 #ifdef CONFIG_HARDENED_USERCOPY

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-14  4:29   ` Jason Wang
  2018-12-14 12:52     ` Michael S. Tsirkin
@ 2018-12-15 19:43     ` David Miller
  2018-12-16 19:57       ` Michael S. Tsirkin
  1 sibling, 1 reply; 39+ messages in thread
From: David Miller @ 2018-12-15 19:43 UTC (permalink / raw)
  To: jasowang; +Cc: mst, kvm, virtualization, netdev, linux-kernel

From: Jason Wang <jasowang@redhat.com>
Date: Fri, 14 Dec 2018 12:29:54 +0800

> 
> On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>> Hi:
>>>
>>> This series tries to access virtqueue metadata through kernel virtual
>>> address instead of copy_user() friends since they had too much
>>> overheads like checks, spec barriers or even hardware feature
>>> toggling.
>>>
>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>> cases as well.
>>>
>>> Please review
>> I think the idea of speeding up userspace access is a good one.
>> However I think that moving all checks to start is way too aggressive.
> 
> 
> So did packet and AF_XDP. Anyway, sharing address space and access
> them directly is the fastest way. Performance is the major
> consideration for people to choose backend. Compare to userspace
> implementation, vhost does not have security advantages at any
> level. If vhost is still slow, people will start to develop backends
> based on e.g AF_XDP.

Exactly, this is precisely how this kind of problem should be solved.

Michael, I strongly support the approach Jason is taking here, and I
would like to ask you to seriously reconsider your objections.

Thank you.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-14  3:57     ` Jason Wang
  2018-12-14 12:36       ` Michael S. Tsirkin
@ 2018-12-15 21:15       ` David Miller
  1 sibling, 0 replies; 39+ messages in thread
From: David Miller @ 2018-12-15 21:15 UTC (permalink / raw)
  To: jasowang; +Cc: mst, kvm, virtualization, netdev, linux-kernel

From: Jason Wang <jasowang@redhat.com>
Date: Fri, 14 Dec 2018 11:57:35 +0800

> This is the price of all GUP users not only vhost itself. What's more
> important, the goal is not to be left too much behind for other
> backends like DPDK or AF_XDP (all of which are using GUP).

+1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-15 19:43     ` David Miller
@ 2018-12-16 19:57       ` Michael S. Tsirkin
  2018-12-24  8:44         ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-16 19:57 UTC (permalink / raw)
  To: David Miller; +Cc: jasowang, kvm, virtualization, netdev, linux-kernel

On Sat, Dec 15, 2018 at 11:43:08AM -0800, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Fri, 14 Dec 2018 12:29:54 +0800
> 
> > 
> > On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> >> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> >>> Hi:
> >>>
> >>> This series tries to access virtqueue metadata through kernel virtual
> >>> address instead of copy_user() friends since they had too much
> >>> overheads like checks, spec barriers or even hardware feature
> >>> toggling.
> >>>
> >>> Test shows about 24% improvement on TX PPS. It should benefit other
> >>> cases as well.
> >>>
> >>> Please review
> >> I think the idea of speeding up userspace access is a good one.
> >> However I think that moving all checks to start is way too aggressive.
> > 
> > 
> > So did packet and AF_XDP. Anyway, sharing address space and access
> > them directly is the fastest way. Performance is the major
> > consideration for people to choose backend. Compare to userspace
> > implementation, vhost does not have security advantages at any
> > level. If vhost is still slow, people will start to develop backends
> > based on e.g AF_XDP.
> 
> Exactly, this is precisely how this kind of problem should be solved.
> 
> Michael, I strongly support the approach Jason is taking here, and I
> would like to ask you to seriously reconsider your objections.
> 
> Thank you.

Okay. Won't be the first time I'm wrong.

Let's say we ignore security aspects, but we need to make sure the
following all keep working (broken with this revision):
- file backed memory (I didn't see where we mark memory dirty -
  if we don't we get guest memory corruption on close, if we do
  then host crash as https://lwn.net/Articles/774411/ seems to apply here?)
- THP
- auto-NUMA

Because vhost isn't like AF_XDP where you can just tell people "use
hugetlbfs" and "data is removed on close" - people are using it in lots
of configurations with guest memory shared between rings and unrelated
data.

Jason, thoughts on these?

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-14 12:36       ` Michael S. Tsirkin
@ 2018-12-24  7:53         ` Jason Wang
  2018-12-24 18:10           ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-24  7:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>> It was noticed that the copy_user() friends that was used to access
>>>> virtqueue metdata tends to be very expensive for dataplane
>>>> implementation like vhost since it involves lots of software check,
>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>> extra cost will be more obvious when transferring small packets.
>>>>
>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>> those mappings and memory accessors are modified to use pointers to
>>>> access the metadata directly.
>>>>
>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>> use similar method to optimize it in the future.
>>>>
>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>
>>>> Before: ~5.0Mpps
>>>> After:  ~6.1Mpps
>>>>
>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>> ---
>>>>    drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>    drivers/vhost/vhost.h |  11 +++
>>>>    2 files changed, 189 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>> index bafe39d2e637..1bd24203afb6 100644
>>>> --- a/drivers/vhost/vhost.c
>>>> +++ b/drivers/vhost/vhost.c
>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>    		vq->indirect = NULL;
>>>>    		vq->heads = NULL;
>>>>    		vq->dev = dev;
>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>    		mutex_init(&vq->mutex);
>>>>    		vhost_vq_reset(dev, vq);
>>>>    		if (vq->handle_kick)
>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>    	spin_unlock(&dev->iotlb_lock);
>>>>    }
>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>> +			   size_t size, int write)
>>>> +{
>>>> +	struct page **pages;
>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>> +	int npinned;
>>>> +	void *vaddr;
>>>> +
>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>> +	if (!pages)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>> +	if (npinned != npages)
>>>> +		goto err;
>>>> +
>>> As I said I have doubts about the whole approach, but this
>>> implementation in particular isn't a good idea
>>> as it keeps the page around forever.


The pages wil be released during set features.


>>> So no THP, no NUMA rebalancing,


For THP, we will probably miss 2 or 4 pages, but does this really matter 
consider the gain we have? For NUMA rebalancing, I'm even not quite sure 
if it can helps for the case of IPC (vhost). It looks to me the worst 
case it may cause page to be thrash between nodes if vhost and userspace 
are running in two nodes.


>>
>> This is the price of all GUP users not only vhost itself.
> Yes. GUP is just not a great interface for vhost to use.


Zerocopy codes (enabled by defualt) use them for years.


>
>> What's more
>> important, the goal is not to be left too much behind for other backends
>> like DPDK or AF_XDP (all of which are using GUP).
>
> So these guys assume userspace knows what it's doing.
> We can't assume that.


What kind of assumption do you they have?


>
>>> userspace-controlled
>>> amount of memory locked up and not accounted for.
>>
>> It's pretty easy to add this since the slow path was still kept. If we
>> exceeds the limitation, we can switch back to slow path.
>>
>>> Don't get me wrong it's a great patch in an ideal world.
>>> But then in an ideal world no barriers smap etc are necessary at all.
>>
>> Again, this is only for metadata accessing not the data which has been used
>> for years for real use cases.
>>
>> For SMAP, it makes senses for the address that kernel can not forcast. But
>> it's not the case for the vhost metadata since we know the address will be
>> accessed very frequently. For speculation barrier, it helps nothing for the
>> data path of vhost which is a kthread.
> I don't see how a kthread makes any difference. We do have a validation
> step which makes some difference.


The problem is not kthread but the address of userspace address. The 
addresses of vq metadata tends to be consistent for a while, and vhost 
knows they will be frequently. SMAP doesn't help too much in this case.

Thanks.


>
>> Packet or AF_XDP benefit from
>> accessing metadata directly, we should do it as well.
>>
>> Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-14 12:33     ` Michael S. Tsirkin
  2018-12-14 15:31       ` Michael S. Tsirkin
@ 2018-12-24  8:32       ` Jason Wang
  2018-12-24 18:12         ` Michael S. Tsirkin
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-24  8:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> This series tries to access virtqueue metadata through kernel virtual
>>>> address instead of copy_user() friends since they had too much
>>>> overheads like checks, spec barriers or even hardware feature
>>>> toggling.
>>> Userspace accesses through remapping tricks and next time there's a need
>>> for a new barrier we are left to figure it out by ourselves.
>>
>> I don't get here, do you mean spec barriers?
> I mean the next barrier people decide to put into userspace
> memory accesses.
>
>> It's completely unnecessary for
>> vhost which is kernel thread.
> It's defence in depth. Take a look at the commit that added them.
> And yes quite possibly in most cases we actually have a spec
> barrier in the validation phase. If we do let's use the
> unsafe variants so they can be found.


unsafe variants can only work if you can batch userspace access. This is 
not necessarily the case for light load.


>
>> And even if you're right, vhost is not the
>> only place, there's lots of vmap() based accessing in kernel.
> For sure. But if one can get by without get user pages, one
> really should. Witness recently uncovered mess with file
> backed storage.


We only pin metadata pages, I don't believe they will be used by any DMA.


>
>> Think in
>> another direction, this means we won't suffer form unnecessary barriers for
>> kthread like vhost in the future, we will manually pick the one we really
>> need
> I personally think we should err on the side of caution not on the side of
> performance.


So what you suggest may lead unnecessary performance regression 
(10%-20%) which is part of the goal of this series. We should audit and 
only use the one we really need instead of depending on copy_user() 
friends().

If we do it our own, it could be slow for for security fix but it's no 
less safe than before with performance kept.


>
>> (but it should have little possibility).
> History seems to teach otherwise.


What case did you mean here?


>
>> Please notice we only access metdata through remapping not the data itself.
>> This idea has been used for high speed userspace backend for years, e.g
>> packet socket or recent AF_XDP.
> I think their justification for the higher risk is that they are mostly
> designed for priveledged userspace.


I think it's the same with TUN/TAP, privileged process can pass them to 
unprivileged ones.


>
>> The only difference is the page was remap to
>> from kernel to userspace.
> At least that avoids the g.u.p mess.


I'm still not very clear at the point. We only pin 2 or 4 pages, they're 
several other cases that will pin much more.


>
>>>     I don't
>>> like the idea I have to say.  As a first step, why don't we switch to
>>> unsafe_put_user/unsafe_get_user etc?
>>
>> Several reasons:
>>
>> - They only have x86 variant, it won't have any difference for the rest of
>> architecture.
> Is there an issue on other architectures? If yes they can be extended
> there.


Consider the unexpected amount of work and in the best case it can give 
the same performance to vmap(). I'm not sure it's worth.


>
>> - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
>> (e.g accessing descriptor) or arrays (batching).
> So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
>
>> - Unless we can batch at least the accessing of two places in three of
>> avail, used and descriptor in one run. There will be no difference. E.g we
>> can batch updating used ring, but it won't make any difference in this case.
>>
> So let's batch them all?


Batching might not help for the case of light load. And we need to 
measure the gain/cost of batching itself.


>
>
>>> That would be more of an apples to apples comparison, would it not?
>>
>> Apples to apples comparison only help if we are the No.1. But the fact is we
>> are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
>> fastest method AFAIK.
>>
>>
>> Thanks
> We need to speed up the packet access itself too though.
> You can't vmap all of guest memory.


This series only pin and vmap very few pages (metadata).

Thanks


>
>
>>>
>>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>>> cases as well.
>>>>
>>>> Please review
>>>>
>>>> Jason Wang (3):
>>>>     vhost: generalize adding used elem
>>>>     vhost: fine grain userspace memory accessors
>>>>     vhost: access vq metadata through kernel virtual address
>>>>
>>>>    drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>>>>    drivers/vhost/vhost.h |  11 ++
>>>>    2 files changed, 266 insertions(+), 26 deletions(-)
>>>>
>>>> -- 
>>>> 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-16 19:57       ` Michael S. Tsirkin
@ 2018-12-24  8:44         ` Jason Wang
  2018-12-24 19:09           ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-24  8:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/17 上午3:57, Michael S. Tsirkin wrote:
> On Sat, Dec 15, 2018 at 11:43:08AM -0800, David Miller wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Date: Fri, 14 Dec 2018 12:29:54 +0800
>>
>>> On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>> Hi:
>>>>>
>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>> address instead of copy_user() friends since they had too much
>>>>> overheads like checks, spec barriers or even hardware feature
>>>>> toggling.
>>>>>
>>>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>>>> cases as well.
>>>>>
>>>>> Please review
>>>> I think the idea of speeding up userspace access is a good one.
>>>> However I think that moving all checks to start is way too aggressive.
>>>
>>> So did packet and AF_XDP. Anyway, sharing address space and access
>>> them directly is the fastest way. Performance is the major
>>> consideration for people to choose backend. Compare to userspace
>>> implementation, vhost does not have security advantages at any
>>> level. If vhost is still slow, people will start to develop backends
>>> based on e.g AF_XDP.
>> Exactly, this is precisely how this kind of problem should be solved.
>>
>> Michael, I strongly support the approach Jason is taking here, and I
>> would like to ask you to seriously reconsider your objections.
>>
>> Thank you.
> Okay. Won't be the first time I'm wrong.
>
> Let's say we ignore security aspects, but we need to make sure the
> following all keep working (broken with this revision):
> - file backed memory (I didn't see where we mark memory dirty -
>    if we don't we get guest memory corruption on close, if we do
>    then host crash as https://lwn.net/Articles/774411/ seems to apply here?)


We only pin metadata pages, so I don't think they can be used for DMA. 
So it was probably not an issue. The real issue is zerocopy codes, maybe 
it's time to disable it by default?


> - THP


We will miss 2 or 4 pages for THP, I wonder whether or not it's measurable.


> - auto-NUMA


I'm not sure auto-NUMA will help for the case of IPC. It can damage the 
performance in the worst case if vhost and userspace are running in two 
different nodes. Anyway I can measure.


>
> Because vhost isn't like AF_XDP where you can just tell people "use
> hugetlbfs" and "data is removed on close" - people are using it in lots
> of configurations with guest memory shared between rings and unrelated
> data.


This series doesn't share data, only metadata is shared.


>
> Jason, thoughts on these?
>

Based on the above, I can measure the impact of THP to see how it impacts.

For unsafe variants, it can only work for when we can batch the access 
and it needs non trivial rework on the vhost codes with unexpected 
amount of work for archs other than x86. I'm not sure it's worth to try.

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-24  7:53         ` Jason Wang
@ 2018-12-24 18:10           ` Michael S. Tsirkin
  2018-12-25 10:05             ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-24 18:10 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> 
> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > It was noticed that the copy_user() friends that was used to access
> > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > implementation like vhost since it involves lots of software check,
> > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > extra cost will be more obvious when transferring small packets.
> > > > > 
> > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > those mappings and memory accessors are modified to use pointers to
> > > > > access the metadata directly.
> > > > > 
> > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > use similar method to optimize it in the future.
> > > > > 
> > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > 
> > > > > Before: ~5.0Mpps
> > > > > After:  ~6.1Mpps
> > > > > 
> > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > ---
> > > > >    drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > >    drivers/vhost/vhost.h |  11 +++
> > > > >    2 files changed, 189 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > --- a/drivers/vhost/vhost.c
> > > > > +++ b/drivers/vhost/vhost.c
> > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > >    		vq->indirect = NULL;
> > > > >    		vq->heads = NULL;
> > > > >    		vq->dev = dev;
> > > > > +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > >    		mutex_init(&vq->mutex);
> > > > >    		vhost_vq_reset(dev, vq);
> > > > >    		if (vq->handle_kick)
> > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > >    	spin_unlock(&dev->iotlb_lock);
> > > > >    }
> > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > +			   size_t size, int write)
> > > > > +{
> > > > > +	struct page **pages;
> > > > > +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > +	int npinned;
> > > > > +	void *vaddr;
> > > > > +
> > > > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > +	if (!pages)
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > +	if (npinned != npages)
> > > > > +		goto err;
> > > > > +
> > > > As I said I have doubts about the whole approach, but this
> > > > implementation in particular isn't a good idea
> > > > as it keeps the page around forever.
> 
> 
> The pages wil be released during set features.
> 
> 
> > > > So no THP, no NUMA rebalancing,
> 
> 
> For THP, we will probably miss 2 or 4 pages, but does this really matter
> consider the gain we have?

We as in vhost? networking isn't the only thing guest does.
We don't even know if this guest does a lot of networking.
You don't
know what else is in this huge page. Can be something very important
that guest touches all the time.

> For NUMA rebalancing, I'm even not quite sure if
> it can helps for the case of IPC (vhost). It looks to me the worst case it
> may cause page to be thrash between nodes if vhost and userspace are running
> in two nodes.


So again it's a gain for vhost but has a completely unpredictable effect on
other functionality of the guest.

That's what bothers me with this approach.




> 
> > > 
> > > This is the price of all GUP users not only vhost itself.
> > Yes. GUP is just not a great interface for vhost to use.
> 
> 
> Zerocopy codes (enabled by defualt) use them for years.

But only for TX and temporarily. We pin, read, unpin.

Your patch is different

- it writes into memory and GUP has known issues with file
  backed memory
- it keeps pages pinned forever



> 
> > 
> > > What's more
> > > important, the goal is not to be left too much behind for other backends
> > > like DPDK or AF_XDP (all of which are using GUP).
> > 
> > So these guys assume userspace knows what it's doing.
> > We can't assume that.
> 
> 
> What kind of assumption do you they have?
> 
> 
> > 
> > > > userspace-controlled
> > > > amount of memory locked up and not accounted for.
> > > 
> > > It's pretty easy to add this since the slow path was still kept. If we
> > > exceeds the limitation, we can switch back to slow path.
> > > 
> > > > Don't get me wrong it's a great patch in an ideal world.
> > > > But then in an ideal world no barriers smap etc are necessary at all.
> > > 
> > > Again, this is only for metadata accessing not the data which has been used
> > > for years for real use cases.
> > > 
> > > For SMAP, it makes senses for the address that kernel can not forcast. But
> > > it's not the case for the vhost metadata since we know the address will be
> > > accessed very frequently. For speculation barrier, it helps nothing for the
> > > data path of vhost which is a kthread.
> > I don't see how a kthread makes any difference. We do have a validation
> > step which makes some difference.
> 
> 
> The problem is not kthread but the address of userspace address. The
> addresses of vq metadata tends to be consistent for a while, and vhost knows
> they will be frequently. SMAP doesn't help too much in this case.
> 
> Thanks.

It's true for a real life applications but a malicious one
can call the setup ioctls any number of times. And SMAP is
all about malcious applications.

> 
> > 
> > > Packet or AF_XDP benefit from
> > > accessing metadata directly, we should do it as well.
> > > 
> > > Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-24  8:32       ` Jason Wang
@ 2018-12-24 18:12         ` Michael S. Tsirkin
  2018-12-25 10:09           ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-24 18:12 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
> 
> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> > On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
> > > On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > Hi:
> > > > > 
> > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > address instead of copy_user() friends since they had too much
> > > > > overheads like checks, spec barriers or even hardware feature
> > > > > toggling.
> > > > Userspace accesses through remapping tricks and next time there's a need
> > > > for a new barrier we are left to figure it out by ourselves.
> > > 
> > > I don't get here, do you mean spec barriers?
> > I mean the next barrier people decide to put into userspace
> > memory accesses.
> > 
> > > It's completely unnecessary for
> > > vhost which is kernel thread.
> > It's defence in depth. Take a look at the commit that added them.
> > And yes quite possibly in most cases we actually have a spec
> > barrier in the validation phase. If we do let's use the
> > unsafe variants so they can be found.
> 
> 
> unsafe variants can only work if you can batch userspace access. This is not
> necessarily the case for light load.


Do we care a lot about the light load? How would you benchmark it?


> 
> > 
> > > And even if you're right, vhost is not the
> > > only place, there's lots of vmap() based accessing in kernel.
> > For sure. But if one can get by without get user pages, one
> > really should. Witness recently uncovered mess with file
> > backed storage.
> 
> 
> We only pin metadata pages, I don't believe they will be used by any DMA.

It doesn't matter really, if you dirty pages behind the MM back
the problem is there.

> 
> > 
> > > Think in
> > > another direction, this means we won't suffer form unnecessary barriers for
> > > kthread like vhost in the future, we will manually pick the one we really
> > > need
> > I personally think we should err on the side of caution not on the side of
> > performance.
> 
> 
> So what you suggest may lead unnecessary performance regression (10%-20%)
> which is part of the goal of this series. We should audit and only use the
> one we really need instead of depending on copy_user() friends().
> 
> If we do it our own, it could be slow for for security fix but it's no less
> safe than before with performance kept.
> 
> 
> > 
> > > (but it should have little possibility).
> > History seems to teach otherwise.
> 
> 
> What case did you mean here?
> 
> 
> > 
> > > Please notice we only access metdata through remapping not the data itself.
> > > This idea has been used for high speed userspace backend for years, e.g
> > > packet socket or recent AF_XDP.
> > I think their justification for the higher risk is that they are mostly
> > designed for priveledged userspace.
> 
> 
> I think it's the same with TUN/TAP, privileged process can pass them to
> unprivileged ones.
> 
> 
> > 
> > > The only difference is the page was remap to
> > > from kernel to userspace.
> > At least that avoids the g.u.p mess.
> 
> 
> I'm still not very clear at the point. We only pin 2 or 4 pages, they're
> several other cases that will pin much more.
> 
> 
> > 
> > > >     I don't
> > > > like the idea I have to say.  As a first step, why don't we switch to
> > > > unsafe_put_user/unsafe_get_user etc?
> > > 
> > > Several reasons:
> > > 
> > > - They only have x86 variant, it won't have any difference for the rest of
> > > architecture.
> > Is there an issue on other architectures? If yes they can be extended
> > there.
> 
> 
> Consider the unexpected amount of work and in the best case it can give the
> same performance to vmap(). I'm not sure it's worth.
> 
> 
> > 
> > > - unsafe_put_user/unsafe_get_user is not sufficient for accessing structures
> > > (e.g accessing descriptor) or arrays (batching).
> > So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
> > 
> > > - Unless we can batch at least the accessing of two places in three of
> > > avail, used and descriptor in one run. There will be no difference. E.g we
> > > can batch updating used ring, but it won't make any difference in this case.
> > > 
> > So let's batch them all?
> 
> 
> Batching might not help for the case of light load. And we need to measure
> the gain/cost of batching itself.
> 
> 
> > 
> > 
> > > > That would be more of an apples to apples comparison, would it not?
> > > 
> > > Apples to apples comparison only help if we are the No.1. But the fact is we
> > > are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
> > > fastest method AFAIK.
> > > 
> > > 
> > > Thanks
> > We need to speed up the packet access itself too though.
> > You can't vmap all of guest memory.
> 
> 
> This series only pin and vmap very few pages (metadata).
> 
> Thanks
> 
> 
> > 
> > 
> > > > 
> > > > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > > > cases as well.
> > > > > 
> > > > > Please review
> > > > > 
> > > > > Jason Wang (3):
> > > > >     vhost: generalize adding used elem
> > > > >     vhost: fine grain userspace memory accessors
> > > > >     vhost: access vq metadata through kernel virtual address
> > > > > 
> > > > >    drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> > > > >    drivers/vhost/vhost.h |  11 ++
> > > > >    2 files changed, 266 insertions(+), 26 deletions(-)
> > > > > 
> > > > > -- 
> > > > > 2.17.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-24  8:44         ` Jason Wang
@ 2018-12-24 19:09           ` Michael S. Tsirkin
  0 siblings, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-24 19:09 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Miller, kvm, virtualization, netdev, linux-kernel

On Mon, Dec 24, 2018 at 04:44:14PM +0800, Jason Wang wrote:
> 
> On 2018/12/17 上午3:57, Michael S. Tsirkin wrote:
> > On Sat, Dec 15, 2018 at 11:43:08AM -0800, David Miller wrote:
> > > From: Jason Wang <jasowang@redhat.com>
> > > Date: Fri, 14 Dec 2018 12:29:54 +0800
> > > 
> > > > On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > > Hi:
> > > > > > 
> > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > address instead of copy_user() friends since they had too much
> > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > toggling.
> > > > > > 
> > > > > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > > > > cases as well.
> > > > > > 
> > > > > > Please review
> > > > > I think the idea of speeding up userspace access is a good one.
> > > > > However I think that moving all checks to start is way too aggressive.
> > > > 
> > > > So did packet and AF_XDP. Anyway, sharing address space and access
> > > > them directly is the fastest way. Performance is the major
> > > > consideration for people to choose backend. Compare to userspace
> > > > implementation, vhost does not have security advantages at any
> > > > level. If vhost is still slow, people will start to develop backends
> > > > based on e.g AF_XDP.
> > > Exactly, this is precisely how this kind of problem should be solved.
> > > 
> > > Michael, I strongly support the approach Jason is taking here, and I
> > > would like to ask you to seriously reconsider your objections.
> > > 
> > > Thank you.
> > Okay. Won't be the first time I'm wrong.
> > 
> > Let's say we ignore security aspects, but we need to make sure the
> > following all keep working (broken with this revision):
> > - file backed memory (I didn't see where we mark memory dirty -
> >    if we don't we get guest memory corruption on close, if we do
> >    then host crash as https://lwn.net/Articles/774411/ seems to apply here?)
> 
> 
> We only pin metadata pages, so I don't think they can be used for DMA. So it
> was probably not an issue. The real issue is zerocopy codes, maybe it's time
> to disable it by default?
> 
> 
> > - THP
> 
> 
> We will miss 2 or 4 pages for THP, I wonder whether or not it's measurable.
> 
> 
> > - auto-NUMA
> 
> 
> I'm not sure auto-NUMA will help for the case of IPC. It can damage the
> performance in the worst case if vhost and userspace are running in two
> different nodes. Anyway I can measure.
> 
> 
> > 
> > Because vhost isn't like AF_XDP where you can just tell people "use
> > hugetlbfs" and "data is removed on close" - people are using it in lots
> > of configurations with guest memory shared between rings and unrelated
> > data.
> 
> 
> This series doesn't share data, only metadata is shared.

Let me clarify - I mean that metadata is in same huge page with
unrelated guest data. 

> 
> > 
> > Jason, thoughts on these?
> > 
> 
> Based on the above, I can measure the impact of THP to see how it impacts.
> 
> For unsafe variants, it can only work for when we can batch the access and
> it needs non trivial rework on the vhost codes with unexpected amount of
> work for archs other than x86. I'm not sure it's worth to try.
> 
> Thanks

Yes I think we need better APIs in vhost. Right now
we have an API to get and translate a single buffer.
We should have one that gets a batch of descriptors
and stores it, then one that translates this batch.

IMHO this will benefit everyone even if we do vmap due to
better code locality.

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-24 18:10           ` Michael S. Tsirkin
@ 2018-12-25 10:05             ` Jason Wang
  2018-12-25 12:50               ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-25 10:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>> implementation like vhost since it involves lots of software check,
>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>
>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>> access the metadata directly.
>>>>>>
>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>> use similar method to optimize it in the future.
>>>>>>
>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>
>>>>>> Before: ~5.0Mpps
>>>>>> After:  ~6.1Mpps
>>>>>>
>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>> ---
>>>>>>     drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>     drivers/vhost/vhost.h |  11 +++
>>>>>>     2 files changed, 189 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>> --- a/drivers/vhost/vhost.c
>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>     		vq->indirect = NULL;
>>>>>>     		vq->heads = NULL;
>>>>>>     		vq->dev = dev;
>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>     		mutex_init(&vq->mutex);
>>>>>>     		vhost_vq_reset(dev, vq);
>>>>>>     		if (vq->handle_kick)
>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>     	spin_unlock(&dev->iotlb_lock);
>>>>>>     }
>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>> +			   size_t size, int write)
>>>>>> +{
>>>>>> +	struct page **pages;
>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>> +	int npinned;
>>>>>> +	void *vaddr;
>>>>>> +
>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>> +	if (!pages)
>>>>>> +		return -ENOMEM;
>>>>>> +
>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>> +	if (npinned != npages)
>>>>>> +		goto err;
>>>>>> +
>>>>> As I said I have doubts about the whole approach, but this
>>>>> implementation in particular isn't a good idea
>>>>> as it keeps the page around forever.
>>
>> The pages wil be released during set features.
>>
>>
>>>>> So no THP, no NUMA rebalancing,
>>
>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>> consider the gain we have?
> We as in vhost? networking isn't the only thing guest does.
> We don't even know if this guest does a lot of networking.
> You don't
> know what else is in this huge page. Can be something very important
> that guest touches all the time.


Well, the probability should be very small consider we usually give 
several gigabytes to guest. The rest of the pages that doesn't sit in 
the same hugepage with metadata can still be merged by THP.  Anyway, I 
can test the differences.


>
>> For NUMA rebalancing, I'm even not quite sure if
>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>> may cause page to be thrash between nodes if vhost and userspace are running
>> in two nodes.
>
> So again it's a gain for vhost but has a completely unpredictable effect on
> other functionality of the guest.
>
> That's what bothers me with this approach.


So:

- The rest of the pages could still be balanced to other nodes, no?

- try to balance metadata pages (belongs to co-operate processes) itself 
is still questionable


>
>
>
>
>>>> This is the price of all GUP users not only vhost itself.
>>> Yes. GUP is just not a great interface for vhost to use.
>>
>> Zerocopy codes (enabled by defualt) use them for years.
> But only for TX and temporarily. We pin, read, unpin.


Probably not. For several reasons that the page will be not be released 
soon or held for a very long period of time or even forever.


>
> Your patch is different
>
> - it writes into memory and GUP has known issues with file
>    backed memory


The ordinary user for vhost is anonymous pages I think?


> - it keeps pages pinned forever
>
>
>
>>>> What's more
>>>> important, the goal is not to be left too much behind for other backends
>>>> like DPDK or AF_XDP (all of which are using GUP).
>>> So these guys assume userspace knows what it's doing.
>>> We can't assume that.
>>
>> What kind of assumption do you they have?
>>
>>
>>>>> userspace-controlled
>>>>> amount of memory locked up and not accounted for.
>>>> It's pretty easy to add this since the slow path was still kept. If we
>>>> exceeds the limitation, we can switch back to slow path.
>>>>
>>>>> Don't get me wrong it's a great patch in an ideal world.
>>>>> But then in an ideal world no barriers smap etc are necessary at all.
>>>> Again, this is only for metadata accessing not the data which has been used
>>>> for years for real use cases.
>>>>
>>>> For SMAP, it makes senses for the address that kernel can not forcast. But
>>>> it's not the case for the vhost metadata since we know the address will be
>>>> accessed very frequently. For speculation barrier, it helps nothing for the
>>>> data path of vhost which is a kthread.
>>> I don't see how a kthread makes any difference. We do have a validation
>>> step which makes some difference.
>>
>> The problem is not kthread but the address of userspace address. The
>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>> they will be frequently. SMAP doesn't help too much in this case.
>>
>> Thanks.
> It's true for a real life applications but a malicious one
> can call the setup ioctls any number of times. And SMAP is
> all about malcious applications.


We don't do this in the path of ioctl, there's no context switch between 
userspace and kernel in the worker thread. SMAP is used to prevent 
kernel from accessing userspace pages unexpectedly which is not the case 
for metadata access.

Thanks


>
>>>> Packet or AF_XDP benefit from
>>>> accessing metadata directly, we should do it as well.
>>>>
>>>> Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-24 18:12         ` Michael S. Tsirkin
@ 2018-12-25 10:09           ` Jason Wang
  2018-12-25 12:52             ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-25 10:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
> On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
>> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
>>> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>>>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>> address instead of copy_user() friends since they had too much
>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>> toggling.
>>>>> Userspace accesses through remapping tricks and next time there's a need
>>>>> for a new barrier we are left to figure it out by ourselves.
>>>> I don't get here, do you mean spec barriers?
>>> I mean the next barrier people decide to put into userspace
>>> memory accesses.
>>>
>>>> It's completely unnecessary for
>>>> vhost which is kernel thread.
>>> It's defence in depth. Take a look at the commit that added them.
>>> And yes quite possibly in most cases we actually have a spec
>>> barrier in the validation phase. If we do let's use the
>>> unsafe variants so they can be found.
>>
>> unsafe variants can only work if you can batch userspace access. This is not
>> necessarily the case for light load.
>
> Do we care a lot about the light load? How would you benchmark it?
>

If we can reduce the latency that's will be more than what we expect.

1 byte TCP_RR shows 1.5%-2% improvement.


>>>> And even if you're right, vhost is not the
>>>> only place, there's lots of vmap() based accessing in kernel.
>>> For sure. But if one can get by without get user pages, one
>>> really should. Witness recently uncovered mess with file
>>> backed storage.
>>
>> We only pin metadata pages, I don't believe they will be used by any DMA.
> It doesn't matter really, if you dirty pages behind the MM back
> the problem is there.


Ok, but the usual case is anonymous pages, do we use file backed pages 
for user of vhost? And even if we use sometime, according to the pointer 
it's not something that can fix, RFC has been posted to solve this issue.

Thanks



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-25 10:05             ` Jason Wang
@ 2018-12-25 12:50               ` Michael S. Tsirkin
  2018-12-26  3:57                 ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-25 12:50 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> 
> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> > > On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > > > On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > > > It was noticed that the copy_user() friends that was used to access
> > > > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > > > implementation like vhost since it involves lots of software check,
> > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > > > extra cost will be more obvious when transferring small packets.
> > > > > > > 
> > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > > > those mappings and memory accessors are modified to use pointers to
> > > > > > > access the metadata directly.
> > > > > > > 
> > > > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > > > use similar method to optimize it in the future.
> > > > > > > 
> > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > 
> > > > > > > Before: ~5.0Mpps
> > > > > > > After:  ~6.1Mpps
> > > > > > > 
> > > > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > > > ---
> > > > > > >     drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > >     drivers/vhost/vhost.h |  11 +++
> > > > > > >     2 files changed, 189 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > > > >     		vq->indirect = NULL;
> > > > > > >     		vq->heads = NULL;
> > > > > > >     		vq->dev = dev;
> > > > > > > +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > > > +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > > > +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > > >     		mutex_init(&vq->mutex);
> > > > > > >     		vhost_vq_reset(dev, vq);
> > > > > > >     		if (vq->handle_kick)
> > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > > >     	spin_unlock(&dev->iotlb_lock);
> > > > > > >     }
> > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > > > +			   size_t size, int write)
> > > > > > > +{
> > > > > > > +	struct page **pages;
> > > > > > > +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > > > +	int npinned;
> > > > > > > +	void *vaddr;
> > > > > > > +
> > > > > > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > > > +	if (!pages)
> > > > > > > +		return -ENOMEM;
> > > > > > > +
> > > > > > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > +	if (npinned != npages)
> > > > > > > +		goto err;
> > > > > > > +
> > > > > > As I said I have doubts about the whole approach, but this
> > > > > > implementation in particular isn't a good idea
> > > > > > as it keeps the page around forever.
> > > 
> > > The pages wil be released during set features.
> > > 
> > > 
> > > > > > So no THP, no NUMA rebalancing,
> > > 
> > > For THP, we will probably miss 2 or 4 pages, but does this really matter
> > > consider the gain we have?
> > We as in vhost? networking isn't the only thing guest does.
> > We don't even know if this guest does a lot of networking.
> > You don't
> > know what else is in this huge page. Can be something very important
> > that guest touches all the time.
> 
> 
> Well, the probability should be very small consider we usually give several
> gigabytes to guest. The rest of the pages that doesn't sit in the same
> hugepage with metadata can still be merged by THP.  Anyway, I can test the
> differences.

Thanks!

> 
> > 
> > > For NUMA rebalancing, I'm even not quite sure if
> > > it can helps for the case of IPC (vhost). It looks to me the worst case it
> > > may cause page to be thrash between nodes if vhost and userspace are running
> > > in two nodes.
> > 
> > So again it's a gain for vhost but has a completely unpredictable effect on
> > other functionality of the guest.
> > 
> > That's what bothers me with this approach.
> 
> 
> So:
> 
> - The rest of the pages could still be balanced to other nodes, no?
> 
> - try to balance metadata pages (belongs to co-operate processes) itself is
> still questionable

I am not sure why. It should be easy enough to force the VCPU and vhost
to move (e.g. start them pinned to 1 cpu, then pin them to another one).
Clearly sometimes this would be necessary for load balancing reasons.
With autonuma after a while (could take seconds but it will happen) the
memory will migrate.




> 
> > 
> > 
> > 
> > 
> > > > > This is the price of all GUP users not only vhost itself.
> > > > Yes. GUP is just not a great interface for vhost to use.
> > > 
> > > Zerocopy codes (enabled by defualt) use them for years.
> > But only for TX and temporarily. We pin, read, unpin.
> 
> 
> Probably not. For several reasons that the page will be not be released soon
> or held for a very long period of time or even forever.


With zero copy? Well it's pinned until transmit. Takes a while
but could be enough for autocopy to work esp since
its the packet memory so not reused immediately.

> 
> > 
> > Your patch is different
> > 
> > - it writes into memory and GUP has known issues with file
> >    backed memory
> 
> 
> The ordinary user for vhost is anonymous pages I think?


It's not the most common scenario and not the fastest one
(e.g. THP does not work) but file backed is useful sometimes.
It would not be nice at all to corrupt guest memory in that case.


> 
> > - it keeps pages pinned forever
> > 
> > 
> > 
> > > > > What's more
> > > > > important, the goal is not to be left too much behind for other backends
> > > > > like DPDK or AF_XDP (all of which are using GUP).
> > > > So these guys assume userspace knows what it's doing.
> > > > We can't assume that.
> > > 
> > > What kind of assumption do you they have?
> > > 
> > > 
> > > > > > userspace-controlled
> > > > > > amount of memory locked up and not accounted for.
> > > > > It's pretty easy to add this since the slow path was still kept. If we
> > > > > exceeds the limitation, we can switch back to slow path.
> > > > > 
> > > > > > Don't get me wrong it's a great patch in an ideal world.
> > > > > > But then in an ideal world no barriers smap etc are necessary at all.
> > > > > Again, this is only for metadata accessing not the data which has been used
> > > > > for years for real use cases.
> > > > > 
> > > > > For SMAP, it makes senses for the address that kernel can not forcast. But
> > > > > it's not the case for the vhost metadata since we know the address will be
> > > > > accessed very frequently. For speculation barrier, it helps nothing for the
> > > > > data path of vhost which is a kthread.
> > > > I don't see how a kthread makes any difference. We do have a validation
> > > > step which makes some difference.
> > > 
> > > The problem is not kthread but the address of userspace address. The
> > > addresses of vq metadata tends to be consistent for a while, and vhost knows
> > > they will be frequently. SMAP doesn't help too much in this case.
> > > 
> > > Thanks.
> > It's true for a real life applications but a malicious one
> > can call the setup ioctls any number of times. And SMAP is
> > all about malcious applications.
> 
> 
> We don't do this in the path of ioctl, there's no context switch between
> userspace and kernel in the worker thread. SMAP is used to prevent kernel
> from accessing userspace pages unexpectedly which is not the case for
> metadata access.
> 
> Thanks

OK let's forget smap for now.

> 
> > 
> > > > > Packet or AF_XDP benefit from
> > > > > accessing metadata directly, we should do it as well.
> > > > > 
> > > > > Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-25 10:09           ` Jason Wang
@ 2018-12-25 12:52             ` Michael S. Tsirkin
  2018-12-26  3:59               ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-25 12:52 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Tue, Dec 25, 2018 at 06:09:19PM +0800, Jason Wang wrote:
> 
> On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
> > On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
> > > On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
> > > > On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
> > > > > On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > > > > > Hi:
> > > > > > > 
> > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > toggling.
> > > > > > Userspace accesses through remapping tricks and next time there's a need
> > > > > > for a new barrier we are left to figure it out by ourselves.
> > > > > I don't get here, do you mean spec barriers?
> > > > I mean the next barrier people decide to put into userspace
> > > > memory accesses.
> > > > 
> > > > > It's completely unnecessary for
> > > > > vhost which is kernel thread.
> > > > It's defence in depth. Take a look at the commit that added them.
> > > > And yes quite possibly in most cases we actually have a spec
> > > > barrier in the validation phase. If we do let's use the
> > > > unsafe variants so they can be found.
> > > 
> > > unsafe variants can only work if you can batch userspace access. This is not
> > > necessarily the case for light load.
> > 
> > Do we care a lot about the light load? How would you benchmark it?
> > 
> 
> If we can reduce the latency that's will be more than what we expect.
> 
> 1 byte TCP_RR shows 1.5%-2% improvement.

It's nice but not great. E.g. adaptive polling would be
a better approach to work on latency imho.

> 
> > > > > And even if you're right, vhost is not the
> > > > > only place, there's lots of vmap() based accessing in kernel.
> > > > For sure. But if one can get by without get user pages, one
> > > > really should. Witness recently uncovered mess with file
> > > > backed storage.
> > > 
> > > We only pin metadata pages, I don't believe they will be used by any DMA.
> > It doesn't matter really, if you dirty pages behind the MM back
> > the problem is there.
> 
> 
> Ok, but the usual case is anonymous pages, do we use file backed pages for
> user of vhost?

Some people use file backed pages for vms.
Nothing prevents them from using vhost as well.

> And even if we use sometime, according to the pointer it's
> not something that can fix, RFC has been posted to solve this issue.
> 
> Thanks

Except it's not broken if we don't to gup + write.
So yea, wait for rfc to be merged.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-25 12:50               ` Michael S. Tsirkin
@ 2018-12-26  3:57                 ` Jason Wang
  2018-12-26 15:02                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-26  3:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>
>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>> access the metadata directly.
>>>>>>>>
>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>
>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>
>>>>>>>> Before: ~5.0Mpps
>>>>>>>> After:  ~6.1Mpps
>>>>>>>>
>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>>> ---
>>>>>>>>      drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      drivers/vhost/vhost.h |  11 +++
>>>>>>>>      2 files changed, 189 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>      		vq->indirect = NULL;
>>>>>>>>      		vq->heads = NULL;
>>>>>>>>      		vq->dev = dev;
>>>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>      		mutex_init(&vq->mutex);
>>>>>>>>      		vhost_vq_reset(dev, vq);
>>>>>>>>      		if (vq->handle_kick)
>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>      	spin_unlock(&dev->iotlb_lock);
>>>>>>>>      }
>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>> +			   size_t size, int write)
>>>>>>>> +{
>>>>>>>> +	struct page **pages;
>>>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>> +	int npinned;
>>>>>>>> +	void *vaddr;
>>>>>>>> +
>>>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>> +	if (!pages)
>>>>>>>> +		return -ENOMEM;
>>>>>>>> +
>>>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>> +	if (npinned != npages)
>>>>>>>> +		goto err;
>>>>>>>> +
>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>> implementation in particular isn't a good idea
>>>>>>> as it keeps the page around forever.
>>>> The pages wil be released during set features.
>>>>
>>>>
>>>>>>> So no THP, no NUMA rebalancing,
>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>> consider the gain we have?
>>> We as in vhost? networking isn't the only thing guest does.
>>> We don't even know if this guest does a lot of networking.
>>> You don't
>>> know what else is in this huge page. Can be something very important
>>> that guest touches all the time.
>>
>> Well, the probability should be very small consider we usually give several
>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>> hugepage with metadata can still be merged by THP.  Anyway, I can test the
>> differences.
> Thanks!
>
>>>> For NUMA rebalancing, I'm even not quite sure if
>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>> in two nodes.
>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>> other functionality of the guest.
>>>
>>> That's what bothers me with this approach.
>>
>> So:
>>
>> - The rest of the pages could still be balanced to other nodes, no?
>>
>> - try to balance metadata pages (belongs to co-operate processes) itself is
>> still questionable
> I am not sure why. It should be easy enough to force the VCPU and vhost
> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
> Clearly sometimes this would be necessary for load balancing reasons.


Yes, but it looks to me the part of motivation of auto NUMA is to avoid 
manual pinning.


> With autonuma after a while (could take seconds but it will happen) the
> memory will migrate.
>

Yes. As you mentioned during the discuss, I wonder we could do it 
similarly through mmu notifier like APIC access page in commit 
c24ae0dcd3e ("kvm: x86: Unpin and remove kvm_arch->apic_access_page")


>
>
>>>
>>>
>>>
>>>>>> This is the price of all GUP users not only vhost itself.
>>>>> Yes. GUP is just not a great interface for vhost to use.
>>>> Zerocopy codes (enabled by defualt) use them for years.
>>> But only for TX and temporarily. We pin, read, unpin.
>>
>> Probably not. For several reasons that the page will be not be released soon
>> or held for a very long period of time or even forever.
>
> With zero copy? Well it's pinned until transmit. Takes a while
> but could be enough for autocopy to work esp since
> its the packet memory so not reused immediately.
>
>>> Your patch is different
>>>
>>> - it writes into memory and GUP has known issues with file
>>>     backed memory
>>
>> The ordinary user for vhost is anonymous pages I think?
>
> It's not the most common scenario and not the fastest one
> (e.g. THP does not work) but file backed is useful sometimes.
> It would not be nice at all to corrupt guest memory in that case.


Ok.


>
>>> - it keeps pages pinned forever
>>>
>>>
>>>
>>>>>> What's more
>>>>>> important, the goal is not to be left too much behind for other backends
>>>>>> like DPDK or AF_XDP (all of which are using GUP).
>>>>> So these guys assume userspace knows what it's doing.
>>>>> We can't assume that.
>>>> What kind of assumption do you they have?
>>>>
>>>>
>>>>>>> userspace-controlled
>>>>>>> amount of memory locked up and not accounted for.
>>>>>> It's pretty easy to add this since the slow path was still kept. If we
>>>>>> exceeds the limitation, we can switch back to slow path.
>>>>>>
>>>>>>> Don't get me wrong it's a great patch in an ideal world.
>>>>>>> But then in an ideal world no barriers smap etc are necessary at all.
>>>>>> Again, this is only for metadata accessing not the data which has been used
>>>>>> for years for real use cases.
>>>>>>
>>>>>> For SMAP, it makes senses for the address that kernel can not forcast. But
>>>>>> it's not the case for the vhost metadata since we know the address will be
>>>>>> accessed very frequently. For speculation barrier, it helps nothing for the
>>>>>> data path of vhost which is a kthread.
>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>> step which makes some difference.
>>>> The problem is not kthread but the address of userspace address. The
>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>
>>>> Thanks.
>>> It's true for a real life applications but a malicious one
>>> can call the setup ioctls any number of times. And SMAP is
>>> all about malcious applications.
>>
>> We don't do this in the path of ioctl, there's no context switch between
>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>> from accessing userspace pages unexpectedly which is not the case for
>> metadata access.
>>
>> Thanks
> OK let's forget smap for now.


Some numbers I measured:

On an old Sandy bridge machine without SMAP support. Remove speculation 
barrier boost the performance from 4.6Mpps to 5.1Mpps

On a newer Broadwell machine with SMAP support. Remove speculation 
barrier only gives 2%-5% improvement, disable SMAP completely through 
Kconfig boost 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps 
- 6.1Mpps, it only bypass SMAP for metadata).

So it looks like for recent machine, SMAP becomes pain point when the 
copy is short (e.g 64B) for high PPS.

Thanks


>
>>>>>> Packet or AF_XDP benefit from
>>>>>> accessing metadata directly, we should do it as well.
>>>>>>
>>>>>> Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
  2018-12-25 12:52             ` Michael S. Tsirkin
@ 2018-12-26  3:59               ` Jason Wang
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Wang @ 2018-12-26  3:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/25 下午8:52, Michael S. Tsirkin wrote:
> On Tue, Dec 25, 2018 at 06:09:19PM +0800, Jason Wang wrote:
>> On 2018/12/25 上午2:12, Michael S. Tsirkin wrote:
>>> On Mon, Dec 24, 2018 at 04:32:39PM +0800, Jason Wang wrote:
>>>> On 2018/12/14 下午8:33, Michael S. Tsirkin wrote:
>>>>> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>>>>>> Hi:
>>>>>>>>
>>>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>>>> address instead of copy_user() friends since they had too much
>>>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>>>> toggling.
>>>>>>> Userspace accesses through remapping tricks and next time there's a need
>>>>>>> for a new barrier we are left to figure it out by ourselves.
>>>>>> I don't get here, do you mean spec barriers?
>>>>> I mean the next barrier people decide to put into userspace
>>>>> memory accesses.
>>>>>
>>>>>> It's completely unnecessary for
>>>>>> vhost which is kernel thread.
>>>>> It's defence in depth. Take a look at the commit that added them.
>>>>> And yes quite possibly in most cases we actually have a spec
>>>>> barrier in the validation phase. If we do let's use the
>>>>> unsafe variants so they can be found.
>>>> unsafe variants can only work if you can batch userspace access. This is not
>>>> necessarily the case for light load.
>>> Do we care a lot about the light load? How would you benchmark it?
>>>
>> If we can reduce the latency that's will be more than what we expect.
>>
>> 1 byte TCP_RR shows 1.5%-2% improvement.
> It's nice but not great. E.g. adaptive polling would be
> a better approach to work on latency imho.


Actually this is another advantages of vmap():

For split ring, we will poll avail idx

For packed ring, we will poll wrap counter

Either of which can not be batched.


>
>>>>>> And even if you're right, vhost is not the
>>>>>> only place, there's lots of vmap() based accessing in kernel.
>>>>> For sure. But if one can get by without get user pages, one
>>>>> really should. Witness recently uncovered mess with file
>>>>> backed storage.
>>>> We only pin metadata pages, I don't believe they will be used by any DMA.
>>> It doesn't matter really, if you dirty pages behind the MM back
>>> the problem is there.
>>
>> Ok, but the usual case is anonymous pages, do we use file backed pages for
>> user of vhost?
> Some people use file backed pages for vms.
> Nothing prevents them from using vhost as well.


Ok.


>
>> And even if we use sometime, according to the pointer it's
>> not something that can fix, RFC has been posted to solve this issue.
>>
>> Thanks
> Except it's not broken if we don't to gup + write.
> So yea, wait for rfc to be merged.
>

Yes.

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-26  3:57                 ` Jason Wang
@ 2018-12-26 15:02                   ` Michael S. Tsirkin
  2018-12-27  9:39                     ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-26 15:02 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
> 
> On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
> > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> > > On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> > > > > On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> > > > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > > > > > On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > > > > > It was noticed that the copy_user() friends that was used to access
> > > > > > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > > > > > implementation like vhost since it involves lots of software check,
> > > > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > > > > > extra cost will be more obvious when transferring small packets.
> > > > > > > > > 
> > > > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > > > > > those mappings and memory accessors are modified to use pointers to
> > > > > > > > > access the metadata directly.
> > > > > > > > > 
> > > > > > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > > > > > use similar method to optimize it in the future.
> > > > > > > > > 
> > > > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > > > 
> > > > > > > > > Before: ~5.0Mpps
> > > > > > > > > After:  ~6.1Mpps
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >      drivers/vhost/vhost.h |  11 +++
> > > > > > > > >      2 files changed, 189 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > > > > > >      		vq->indirect = NULL;
> > > > > > > > >      		vq->heads = NULL;
> > > > > > > > >      		vq->dev = dev;
> > > > > > > > > +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > > > > > +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > > > > > +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > > > > >      		mutex_init(&vq->mutex);
> > > > > > > > >      		vhost_vq_reset(dev, vq);
> > > > > > > > >      		if (vq->handle_kick)
> > > > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > > > > >      	spin_unlock(&dev->iotlb_lock);
> > > > > > > > >      }
> > > > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > > > > > +			   size_t size, int write)
> > > > > > > > > +{
> > > > > > > > > +	struct page **pages;
> > > > > > > > > +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > > > > > +	int npinned;
> > > > > > > > > +	void *vaddr;
> > > > > > > > > +
> > > > > > > > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > > > > > +	if (!pages)
> > > > > > > > > +		return -ENOMEM;
> > > > > > > > > +
> > > > > > > > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > > > +	if (npinned != npages)
> > > > > > > > > +		goto err;
> > > > > > > > > +
> > > > > > > > As I said I have doubts about the whole approach, but this
> > > > > > > > implementation in particular isn't a good idea
> > > > > > > > as it keeps the page around forever.
> > > > > The pages wil be released during set features.
> > > > > 
> > > > > 
> > > > > > > > So no THP, no NUMA rebalancing,
> > > > > For THP, we will probably miss 2 or 4 pages, but does this really matter
> > > > > consider the gain we have?
> > > > We as in vhost? networking isn't the only thing guest does.
> > > > We don't even know if this guest does a lot of networking.
> > > > You don't
> > > > know what else is in this huge page. Can be something very important
> > > > that guest touches all the time.
> > > 
> > > Well, the probability should be very small consider we usually give several
> > > gigabytes to guest. The rest of the pages that doesn't sit in the same
> > > hugepage with metadata can still be merged by THP.  Anyway, I can test the
> > > differences.
> > Thanks!
> > 
> > > > > For NUMA rebalancing, I'm even not quite sure if
> > > > > it can helps for the case of IPC (vhost). It looks to me the worst case it
> > > > > may cause page to be thrash between nodes if vhost and userspace are running
> > > > > in two nodes.
> > > > So again it's a gain for vhost but has a completely unpredictable effect on
> > > > other functionality of the guest.
> > > > 
> > > > That's what bothers me with this approach.
> > > 
> > > So:
> > > 
> > > - The rest of the pages could still be balanced to other nodes, no?
> > > 
> > > - try to balance metadata pages (belongs to co-operate processes) itself is
> > > still questionable
> > I am not sure why. It should be easy enough to force the VCPU and vhost
> > to move (e.g. start them pinned to 1 cpu, then pin them to another one).
> > Clearly sometimes this would be necessary for load balancing reasons.
> 
> 
> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
> manual pinning.

... of memory. Yes.


> 
> > With autonuma after a while (could take seconds but it will happen) the
> > memory will migrate.
> > 
> 
> Yes. As you mentioned during the discuss, I wonder we could do it similarly
> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
> Unpin and remove kvm_arch->apic_access_page")

That would be a possible approach.

> 
> > 
> > 
> > > > 
> > > > 
> > > > 
> > > > > > > This is the price of all GUP users not only vhost itself.
> > > > > > Yes. GUP is just not a great interface for vhost to use.
> > > > > Zerocopy codes (enabled by defualt) use them for years.
> > > > But only for TX and temporarily. We pin, read, unpin.
> > > 
> > > Probably not. For several reasons that the page will be not be released soon
> > > or held for a very long period of time or even forever.
> > 
> > With zero copy? Well it's pinned until transmit. Takes a while
> > but could be enough for autocopy to work esp since
> > its the packet memory so not reused immediately.
> > 
> > > > Your patch is different
> > > > 
> > > > - it writes into memory and GUP has known issues with file
> > > >     backed memory
> > > 
> > > The ordinary user for vhost is anonymous pages I think?
> > 
> > It's not the most common scenario and not the fastest one
> > (e.g. THP does not work) but file backed is useful sometimes.
> > It would not be nice at all to corrupt guest memory in that case.
> 
> 
> Ok.
> 
> 
> > 
> > > > - it keeps pages pinned forever
> > > > 
> > > > 
> > > > 
> > > > > > > What's more
> > > > > > > important, the goal is not to be left too much behind for other backends
> > > > > > > like DPDK or AF_XDP (all of which are using GUP).
> > > > > > So these guys assume userspace knows what it's doing.
> > > > > > We can't assume that.
> > > > > What kind of assumption do you they have?
> > > > > 
> > > > > 
> > > > > > > > userspace-controlled
> > > > > > > > amount of memory locked up and not accounted for.
> > > > > > > It's pretty easy to add this since the slow path was still kept. If we
> > > > > > > exceeds the limitation, we can switch back to slow path.
> > > > > > > 
> > > > > > > > Don't get me wrong it's a great patch in an ideal world.
> > > > > > > > But then in an ideal world no barriers smap etc are necessary at all.
> > > > > > > Again, this is only for metadata accessing not the data which has been used
> > > > > > > for years for real use cases.
> > > > > > > 
> > > > > > > For SMAP, it makes senses for the address that kernel can not forcast. But
> > > > > > > it's not the case for the vhost metadata since we know the address will be
> > > > > > > accessed very frequently. For speculation barrier, it helps nothing for the
> > > > > > > data path of vhost which is a kthread.
> > > > > > I don't see how a kthread makes any difference. We do have a validation
> > > > > > step which makes some difference.
> > > > > The problem is not kthread but the address of userspace address. The
> > > > > addresses of vq metadata tends to be consistent for a while, and vhost knows
> > > > > they will be frequently. SMAP doesn't help too much in this case.
> > > > > 
> > > > > Thanks.
> > > > It's true for a real life applications but a malicious one
> > > > can call the setup ioctls any number of times. And SMAP is
> > > > all about malcious applications.
> > > 
> > > We don't do this in the path of ioctl, there's no context switch between
> > > userspace and kernel in the worker thread. SMAP is used to prevent kernel
> > > from accessing userspace pages unexpectedly which is not the case for
> > > metadata access.
> > > 
> > > Thanks
> > OK let's forget smap for now.
> 
> 
> Some numbers I measured:
> 
> On an old Sandy bridge machine without SMAP support. Remove speculation
> barrier boost the performance from 4.6Mpps to 5.1Mpps
> 
> On a newer Broadwell machine with SMAP support. Remove speculation barrier
> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
> only bypass SMAP for metadata).
> 
> So it looks like for recent machine, SMAP becomes pain point when the copy
> is short (e.g 64B) for high PPS.
> 
> Thanks

Thanks a lot for looking into this!

So first of all users can just boot with nosmap, right?
What's wrong with that? Yes it's not fine-grained but OTOH
it's easy to understand.

And I guess this confirms that if we are going to worry
about smap enabled, we need to look into packet copies
too, not just meta-data.

Vaguely could see a module option (off by default)
where vhost basically does user_access_begin
when it starts running, then uses unsafe accesses
in vhost and tun and then user_access_end.


> 
> > 
> > > > > > > Packet or AF_XDP benefit from
> > > > > > > accessing metadata directly, we should do it as well.
> > > > > > > 
> > > > > > > Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-26 15:02                   ` Michael S. Tsirkin
@ 2018-12-27  9:39                     ` Jason Wang
  2018-12-30 18:30                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2018-12-27  9:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/26 下午11:02, Michael S. Tsirkin wrote:
> On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
>> On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
>>> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>>>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>>>
>>>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>>>> access the metadata directly.
>>>>>>>>>>
>>>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>>>
>>>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>>>
>>>>>>>>>> Before: ~5.0Mpps
>>>>>>>>>> After:  ~6.1Mpps
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>       drivers/vhost/vhost.h |  11 +++
>>>>>>>>>>       2 files changed, 189 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>>>       		vq->indirect = NULL;
>>>>>>>>>>       		vq->heads = NULL;
>>>>>>>>>>       		vq->dev = dev;
>>>>>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>>>       		mutex_init(&vq->mutex);
>>>>>>>>>>       		vhost_vq_reset(dev, vq);
>>>>>>>>>>       		if (vq->handle_kick)
>>>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>>>       	spin_unlock(&dev->iotlb_lock);
>>>>>>>>>>       }
>>>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>>>> +			   size_t size, int write)
>>>>>>>>>> +{
>>>>>>>>>> +	struct page **pages;
>>>>>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>>>> +	int npinned;
>>>>>>>>>> +	void *vaddr;
>>>>>>>>>> +
>>>>>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>>>> +	if (!pages)
>>>>>>>>>> +		return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>>>> +	if (npinned != npages)
>>>>>>>>>> +		goto err;
>>>>>>>>>> +
>>>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>>>> implementation in particular isn't a good idea
>>>>>>>>> as it keeps the page around forever.
>>>>>> The pages wil be released during set features.
>>>>>>
>>>>>>
>>>>>>>>> So no THP, no NUMA rebalancing,
>>>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>>>> consider the gain we have?
>>>>> We as in vhost? networking isn't the only thing guest does.
>>>>> We don't even know if this guest does a lot of networking.
>>>>> You don't
>>>>> know what else is in this huge page. Can be something very important
>>>>> that guest touches all the time.
>>>> Well, the probability should be very small consider we usually give several
>>>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>>>> hugepage with metadata can still be merged by THP.  Anyway, I can test the
>>>> differences.
>>> Thanks!
>>>
>>>>>> For NUMA rebalancing, I'm even not quite sure if
>>>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>>>> in two nodes.
>>>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>>>> other functionality of the guest.
>>>>>
>>>>> That's what bothers me with this approach.
>>>> So:
>>>>
>>>> - The rest of the pages could still be balanced to other nodes, no?
>>>>
>>>> - try to balance metadata pages (belongs to co-operate processes) itself is
>>>> still questionable
>>> I am not sure why. It should be easy enough to force the VCPU and vhost
>>> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
>>> Clearly sometimes this would be necessary for load balancing reasons.
>>
>> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
>> manual pinning.
> ... of memory. Yes.
>
>
>>> With autonuma after a while (could take seconds but it will happen) the
>>> memory will migrate.
>>>
>> Yes. As you mentioned during the discuss, I wonder we could do it similarly
>> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
>> Unpin and remove kvm_arch->apic_access_page")
> That would be a possible approach.


Yes, this looks possible, and the conversion seems not hard. Let me have 
a try with this.


[...]


>>>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>>>> step which makes some difference.
>>>>>> The problem is not kthread but the address of userspace address. The
>>>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>>>
>>>>>> Thanks.
>>>>> It's true for a real life applications but a malicious one
>>>>> can call the setup ioctls any number of times. And SMAP is
>>>>> all about malcious applications.
>>>> We don't do this in the path of ioctl, there's no context switch between
>>>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>>>> from accessing userspace pages unexpectedly which is not the case for
>>>> metadata access.
>>>>
>>>> Thanks
>>> OK let's forget smap for now.
>>
>> Some numbers I measured:
>>
>> On an old Sandy bridge machine without SMAP support. Remove speculation
>> barrier boost the performance from 4.6Mpps to 5.1Mpps
>>
>> On a newer Broadwell machine with SMAP support. Remove speculation barrier
>> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
>> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
>> only bypass SMAP for metadata).
>>
>> So it looks like for recent machine, SMAP becomes pain point when the copy
>> is short (e.g 64B) for high PPS.
>>
>> Thanks
> Thanks a lot for looking into this!
>
> So first of all users can just boot with nosmap, right?
> What's wrong with that?


Nothing wrong, just realize we had this kernel parameter.


>   Yes it's not fine-grained but OTOH
> it's easy to understand.
>
> And I guess this confirms that if we are going to worry
> about smap enabled, we need to look into packet copies
> too, not just meta-data.


For packet copies, we can do batch copy which is pretty simple for the 
case of XDP. I've already had patches for this.


>
> Vaguely could see a module option (off by default)
> where vhost basically does user_access_begin
> when it starts running, then uses unsafe accesses
> in vhost and tun and then user_access_end.


Using user_access_begin() is more tricky than imaged. E.g it requires:

- userspace address to be validated before through access_ok() [1]

- It doesn't support calling a function that does explicit schedule 
since SMAP/PAN state is not maintained through schedule() [2]

[1] https://lwn.net/Articles/736348/

[2] https://lkml.org/lkml/2018/11/23/430

So calling user_access_begin() all the time when vhost is running seems 
pretty dangerous.

For a better batched datacopy, I tend to build not only XDP but also skb 
in vhost in the future.

Thanks


>
>
>>>>>>>> Packet or AF_XDP benefit from
>>>>>>>> accessing metadata directly, we should do it as well.
>>>>>>>>
>>>>>>>> Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-27  9:39                     ` Jason Wang
@ 2018-12-30 18:30                       ` Michael S. Tsirkin
  2019-01-02 11:38                         ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2018-12-30 18:30 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel

On Thu, Dec 27, 2018 at 05:39:21PM +0800, Jason Wang wrote:
> 
> On 2018/12/26 下午11:02, Michael S. Tsirkin wrote:
> > On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
> > > On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> > > > > On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
> > > > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> > > > > > > On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
> > > > > > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > > > > > > > On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > > > > > > > It was noticed that the copy_user() friends that was used to access
> > > > > > > > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > > > > > > > implementation like vhost since it involves lots of software check,
> > > > > > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > > > > > > > extra cost will be more obvious when transferring small packets.
> > > > > > > > > > > 
> > > > > > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > > > > > > > those mappings and memory accessors are modified to use pointers to
> > > > > > > > > > > access the metadata directly.
> > > > > > > > > > > 
> > > > > > > > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > > > > > > > use similar method to optimize it in the future.
> > > > > > > > > > > 
> > > > > > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > > > > > 
> > > > > > > > > > > Before: ~5.0Mpps
> > > > > > > > > > > After:  ~6.1Mpps
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > >       drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >       drivers/vhost/vhost.h |  11 +++
> > > > > > > > > > >       2 files changed, 189 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > > > > > > > >       		vq->indirect = NULL;
> > > > > > > > > > >       		vq->heads = NULL;
> > > > > > > > > > >       		vq->dev = dev;
> > > > > > > > > > > +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > > > > > > > +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > > > > > > > +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > > > > > > >       		mutex_init(&vq->mutex);
> > > > > > > > > > >       		vhost_vq_reset(dev, vq);
> > > > > > > > > > >       		if (vq->handle_kick)
> > > > > > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > > > > > > >       	spin_unlock(&dev->iotlb_lock);
> > > > > > > > > > >       }
> > > > > > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > > > > > > > +			   size_t size, int write)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct page **pages;
> > > > > > > > > > > +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > > > > > > > +	int npinned;
> > > > > > > > > > > +	void *vaddr;
> > > > > > > > > > > +
> > > > > > > > > > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > > > > > > > +	if (!pages)
> > > > > > > > > > > +		return -ENOMEM;
> > > > > > > > > > > +
> > > > > > > > > > > +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > > > > > +	if (npinned != npages)
> > > > > > > > > > > +		goto err;
> > > > > > > > > > > +
> > > > > > > > > > As I said I have doubts about the whole approach, but this
> > > > > > > > > > implementation in particular isn't a good idea
> > > > > > > > > > as it keeps the page around forever.
> > > > > > > The pages wil be released during set features.
> > > > > > > 
> > > > > > > 
> > > > > > > > > > So no THP, no NUMA rebalancing,
> > > > > > > For THP, we will probably miss 2 or 4 pages, but does this really matter
> > > > > > > consider the gain we have?
> > > > > > We as in vhost? networking isn't the only thing guest does.
> > > > > > We don't even know if this guest does a lot of networking.
> > > > > > You don't
> > > > > > know what else is in this huge page. Can be something very important
> > > > > > that guest touches all the time.
> > > > > Well, the probability should be very small consider we usually give several
> > > > > gigabytes to guest. The rest of the pages that doesn't sit in the same
> > > > > hugepage with metadata can still be merged by THP.  Anyway, I can test the
> > > > > differences.
> > > > Thanks!
> > > > 
> > > > > > > For NUMA rebalancing, I'm even not quite sure if
> > > > > > > it can helps for the case of IPC (vhost). It looks to me the worst case it
> > > > > > > may cause page to be thrash between nodes if vhost and userspace are running
> > > > > > > in two nodes.
> > > > > > So again it's a gain for vhost but has a completely unpredictable effect on
> > > > > > other functionality of the guest.
> > > > > > 
> > > > > > That's what bothers me with this approach.
> > > > > So:
> > > > > 
> > > > > - The rest of the pages could still be balanced to other nodes, no?
> > > > > 
> > > > > - try to balance metadata pages (belongs to co-operate processes) itself is
> > > > > still questionable
> > > > I am not sure why. It should be easy enough to force the VCPU and vhost
> > > > to move (e.g. start them pinned to 1 cpu, then pin them to another one).
> > > > Clearly sometimes this would be necessary for load balancing reasons.
> > > 
> > > Yes, but it looks to me the part of motivation of auto NUMA is to avoid
> > > manual pinning.
> > ... of memory. Yes.
> > 
> > 
> > > > With autonuma after a while (could take seconds but it will happen) the
> > > > memory will migrate.
> > > > 
> > > Yes. As you mentioned during the discuss, I wonder we could do it similarly
> > > through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
> > > Unpin and remove kvm_arch->apic_access_page")
> > That would be a possible approach.
> 
> 
> Yes, this looks possible, and the conversion seems not hard. Let me have a
> try with this.
> 
> 
> [...]
> 
> 
> > > > > > > > I don't see how a kthread makes any difference. We do have a validation
> > > > > > > > step which makes some difference.
> > > > > > > The problem is not kthread but the address of userspace address. The
> > > > > > > addresses of vq metadata tends to be consistent for a while, and vhost knows
> > > > > > > they will be frequently. SMAP doesn't help too much in this case.
> > > > > > > 
> > > > > > > Thanks.
> > > > > > It's true for a real life applications but a malicious one
> > > > > > can call the setup ioctls any number of times. And SMAP is
> > > > > > all about malcious applications.
> > > > > We don't do this in the path of ioctl, there's no context switch between
> > > > > userspace and kernel in the worker thread. SMAP is used to prevent kernel
> > > > > from accessing userspace pages unexpectedly which is not the case for
> > > > > metadata access.
> > > > > 
> > > > > Thanks
> > > > OK let's forget smap for now.
> > > 
> > > Some numbers I measured:
> > > 
> > > On an old Sandy bridge machine without SMAP support. Remove speculation
> > > barrier boost the performance from 4.6Mpps to 5.1Mpps
> > > 
> > > On a newer Broadwell machine with SMAP support. Remove speculation barrier
> > > only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
> > > 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
> > > only bypass SMAP for metadata).
> > > 
> > > So it looks like for recent machine, SMAP becomes pain point when the copy
> > > is short (e.g 64B) for high PPS.
> > > 
> > > Thanks
> > Thanks a lot for looking into this!
> > 
> > So first of all users can just boot with nosmap, right?
> > What's wrong with that?
> 
> 
> Nothing wrong, just realize we had this kernel parameter.
> 
> 
> >   Yes it's not fine-grained but OTOH
> > it's easy to understand.
> > 
> > And I guess this confirms that if we are going to worry
> > about smap enabled, we need to look into packet copies
> > too, not just meta-data.
> 
> 
> For packet copies, we can do batch copy which is pretty simple for the case
> of XDP. I've already had patches for this.
> 
> 
> > 
> > Vaguely could see a module option (off by default)
> > where vhost basically does user_access_begin
> > when it starts running, then uses unsafe accesses
> > in vhost and tun and then user_access_end.
> 
> 
> Using user_access_begin() is more tricky than imaged. E.g it requires:
> 
> - userspace address to be validated before through access_ok() [1]

This part is fine I think - addresses come from the memory
map and when userspace supplies the memory map
we validate everything with access_ok.
Well do we validate with the iotlb too? Don't see it right now
so maybe not but it's easy to add.

> - It doesn't support calling a function that does explicit schedule since
> SMAP/PAN state is not maintained through schedule() [2]
> 
> [1] https://lwn.net/Articles/736348/
> 
> [2] https://lkml.org/lkml/2018/11/23/430
> 
> So calling user_access_begin() all the time when vhost is running seems
> pretty dangerous.

Yes it requires some rework e.g. to try getting memory with
GFP_ATOMIC. We could then do a slow path with GFP_KERNEL
if that fails.

> For a better batched datacopy, I tend to build not only XDP but also skb in
> vhost in the future.
> 
> Thanks

Sure, why not.

> 
> > 
> > 
> > > > > > > > > Packet or AF_XDP benefit from
> > > > > > > > > accessing metadata directly, we should do it as well.
> > > > > > > > > 
> > > > > > > > > Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
  2018-12-30 18:30                       ` Michael S. Tsirkin
@ 2019-01-02 11:38                         ` Jason Wang
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Wang @ 2019-01-02 11:38 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel


On 2018/12/31 上午2:30, Michael S. Tsirkin wrote:
> On Thu, Dec 27, 2018 at 05:39:21PM +0800, Jason Wang wrote:
>> On 2018/12/26 下午11:02, Michael S. Tsirkin wrote:
>>> On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
>>>> On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
>>>>> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>>>>>
>>>>>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>>>>>> access the metadata directly.
>>>>>>>>>>>>
>>>>>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>>>>>
>>>>>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>>>>>
>>>>>>>>>>>> Before: ~5.0Mpps
>>>>>>>>>>>> After:  ~6.1Mpps
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>        drivers/vhost/vhost.h |  11 +++
>>>>>>>>>>>>        2 files changed, 189 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>>>>>        		vq->indirect = NULL;
>>>>>>>>>>>>        		vq->heads = NULL;
>>>>>>>>>>>>        		vq->dev = dev;
>>>>>>>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>>>>>        		mutex_init(&vq->mutex);
>>>>>>>>>>>>        		vhost_vq_reset(dev, vq);
>>>>>>>>>>>>        		if (vq->handle_kick)
>>>>>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>>>>>        	spin_unlock(&dev->iotlb_lock);
>>>>>>>>>>>>        }
>>>>>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>>>>>> +			   size_t size, int write)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	struct page **pages;
>>>>>>>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>>>>>> +	int npinned;
>>>>>>>>>>>> +	void *vaddr;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>>>>>> +	if (!pages)
>>>>>>>>>>>> +		return -ENOMEM;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>>>>>> +	if (npinned != npages)
>>>>>>>>>>>> +		goto err;
>>>>>>>>>>>> +
>>>>>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>>>>>> implementation in particular isn't a good idea
>>>>>>>>>>> as it keeps the page around forever.
>>>>>>>> The pages wil be released during set features.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> So no THP, no NUMA rebalancing,
>>>>>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>>>>>> consider the gain we have?
>>>>>>> We as in vhost? networking isn't the only thing guest does.
>>>>>>> We don't even know if this guest does a lot of networking.
>>>>>>> You don't
>>>>>>> know what else is in this huge page. Can be something very important
>>>>>>> that guest touches all the time.
>>>>>> Well, the probability should be very small consider we usually give several
>>>>>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>>>>>> hugepage with metadata can still be merged by THP.  Anyway, I can test the
>>>>>> differences.
>>>>> Thanks!
>>>>>
>>>>>>>> For NUMA rebalancing, I'm even not quite sure if
>>>>>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>>>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>>>>>> in two nodes.
>>>>>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>>>>>> other functionality of the guest.
>>>>>>>
>>>>>>> That's what bothers me with this approach.
>>>>>> So:
>>>>>>
>>>>>> - The rest of the pages could still be balanced to other nodes, no?
>>>>>>
>>>>>> - try to balance metadata pages (belongs to co-operate processes) itself is
>>>>>> still questionable
>>>>> I am not sure why. It should be easy enough to force the VCPU and vhost
>>>>> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
>>>>> Clearly sometimes this would be necessary for load balancing reasons.
>>>> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
>>>> manual pinning.
>>> ... of memory. Yes.
>>>
>>>
>>>>> With autonuma after a while (could take seconds but it will happen) the
>>>>> memory will migrate.
>>>>>
>>>> Yes. As you mentioned during the discuss, I wonder we could do it similarly
>>>> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
>>>> Unpin and remove kvm_arch->apic_access_page")
>>> That would be a possible approach.
>>
>> Yes, this looks possible, and the conversion seems not hard. Let me have a
>> try with this.
>>
>>
>> [...]
>>
>>
>>>>>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>>>>>> step which makes some difference.
>>>>>>>> The problem is not kthread but the address of userspace address. The
>>>>>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>>>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>> It's true for a real life applications but a malicious one
>>>>>>> can call the setup ioctls any number of times. And SMAP is
>>>>>>> all about malcious applications.
>>>>>> We don't do this in the path of ioctl, there's no context switch between
>>>>>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>>>>>> from accessing userspace pages unexpectedly which is not the case for
>>>>>> metadata access.
>>>>>>
>>>>>> Thanks
>>>>> OK let's forget smap for now.
>>>> Some numbers I measured:
>>>>
>>>> On an old Sandy bridge machine without SMAP support. Remove speculation
>>>> barrier boost the performance from 4.6Mpps to 5.1Mpps
>>>>
>>>> On a newer Broadwell machine with SMAP support. Remove speculation barrier
>>>> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
>>>> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
>>>> only bypass SMAP for metadata).
>>>>
>>>> So it looks like for recent machine, SMAP becomes pain point when the copy
>>>> is short (e.g 64B) for high PPS.
>>>>
>>>> Thanks
>>> Thanks a lot for looking into this!
>>>
>>> So first of all users can just boot with nosmap, right?
>>> What's wrong with that?
>>
>> Nothing wrong, just realize we had this kernel parameter.
>>
>>
>>>    Yes it's not fine-grained but OTOH
>>> it's easy to understand.
>>>
>>> And I guess this confirms that if we are going to worry
>>> about smap enabled, we need to look into packet copies
>>> too, not just meta-data.
>>
>> For packet copies, we can do batch copy which is pretty simple for the case
>> of XDP. I've already had patches for this.
>>
>>
>>> Vaguely could see a module option (off by default)
>>> where vhost basically does user_access_begin
>>> when it starts running, then uses unsafe accesses
>>> in vhost and tun and then user_access_end.
>>
>> Using user_access_begin() is more tricky than imaged. E.g it requires:
>>
>> - userspace address to be validated before through access_ok() [1]
> This part is fine I think - addresses come from the memory
> map and when userspace supplies the memory map
> we validate everything with access_ok.
> Well do we validate with the iotlb too? Don't see it right now
> so maybe not but it's easy to add.


Yes, it's not hard.


>
>> - It doesn't support calling a function that does explicit schedule since
>> SMAP/PAN state is not maintained through schedule() [2]
>>
>> [1] https://lwn.net/Articles/736348/
>>
>> [2] https://lkml.org/lkml/2018/11/23/430
>>
>> So calling user_access_begin() all the time when vhost is running seems
>> pretty dangerous.
> Yes it requires some rework e.g. to try getting memory with
> GFP_ATOMIC. We could then do a slow path with GFP_KERNEL
> if that fails.


I'm not sure this is the only part that needs care. Consider all the 
under layer network or block codes assumes a process context, it's not 
easy to figure out all I'm afraid. And even if we could, it's hard to 
prevent it from being added in the future.

Thanks


>
>> For a better batched datacopy, I tend to build not only XDP but also skb in
>> vhost in the future.
>>
>> Thanks
> Sure, why not.
>
>>>
>>>>>>>>>> Packet or AF_XDP benefit from
>>>>>>>>>> accessing metadata directly, we should do it as well.
>>>>>>>>>>
>>>>>>>>>> Thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2019-01-02 11:38 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-13 10:10 [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Jason Wang
2018-12-13 10:10 ` [PATCH net-next 1/3] vhost: generalize adding used elem Jason Wang
2018-12-13 19:41   ` Michael S. Tsirkin
2018-12-14  4:00     ` Jason Wang
2018-12-13 10:10 ` [PATCH net-next 2/3] vhost: fine grain userspace memory accessors Jason Wang
2018-12-13 10:10 ` [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address Jason Wang
2018-12-13 15:44   ` Michael S. Tsirkin
2018-12-13 21:18     ` Konrad Rzeszutek Wilk
2018-12-13 21:58       ` Michael S. Tsirkin
2018-12-14  3:57     ` Jason Wang
2018-12-14 12:36       ` Michael S. Tsirkin
2018-12-24  7:53         ` Jason Wang
2018-12-24 18:10           ` Michael S. Tsirkin
2018-12-25 10:05             ` Jason Wang
2018-12-25 12:50               ` Michael S. Tsirkin
2018-12-26  3:57                 ` Jason Wang
2018-12-26 15:02                   ` Michael S. Tsirkin
2018-12-27  9:39                     ` Jason Wang
2018-12-30 18:30                       ` Michael S. Tsirkin
2019-01-02 11:38                         ` Jason Wang
2018-12-15 21:15       ` David Miller
2018-12-14 14:48   ` kbuild test robot
2018-12-13 15:27 ` [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() Michael S. Tsirkin
2018-12-14  3:42   ` Jason Wang
2018-12-14 12:33     ` Michael S. Tsirkin
2018-12-14 15:31       ` Michael S. Tsirkin
2018-12-24  8:32       ` Jason Wang
2018-12-24 18:12         ` Michael S. Tsirkin
2018-12-25 10:09           ` Jason Wang
2018-12-25 12:52             ` Michael S. Tsirkin
2018-12-26  3:59               ` Jason Wang
2018-12-13 20:12 ` Michael S. Tsirkin
2018-12-14  4:29   ` Jason Wang
2018-12-14 12:52     ` Michael S. Tsirkin
2018-12-15 19:43     ` David Miller
2018-12-16 19:57       ` Michael S. Tsirkin
2018-12-24  8:44         ` Jason Wang
2018-12-24 19:09           ` Michael S. Tsirkin
2018-12-14 15:16 ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).