linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH V3 0/5] Hi:
@ 2018-12-29 12:46 Jason Wang
  2018-12-29 12:46 ` [RFC PATCH V3 1/5] vhost: generalize adding used elem Jason Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Jason Wang @ 2018-12-29 12:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel; +Cc: davem

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling.

Test shows about 24% improvement on TX PPS. It should benefit other
cases as well.

Changes from V2:
- fix buggy range overlapping check
- tear down MMU notifier during vhost ioctl to make sure invalidation
  request can read metadata userspace address and vq size without
  holding vq mutex.
Changes from V1:
- instead of pinning pages, use MMU notifier to invalidate vmaps and
  remap duing metadata prefetch
- fix build warning on MIPS

Jason Wang (5):
  vhost: generalize adding used elem
  vhost: fine grain userspace memory accessors
  vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
  vhost: introduce helpers to get the size of metadata area
  vhost: access vq metadata through kernel virtual address

 drivers/vhost/net.c   |   4 +-
 drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
 drivers/vhost/vhost.h |  15 +-
 3 files changed, 384 insertions(+), 51 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH V3 1/5] vhost: generalize adding used elem
  2018-12-29 12:46 [RFC PATCH V3 0/5] Hi: Jason Wang
@ 2018-12-29 12:46 ` Jason Wang
  2019-01-04 21:29   ` Michael S. Tsirkin
  2018-12-29 12:46 ` [RFC PATCH V3 2/5] vhost: fine grain userspace memory accessors Jason Wang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2018-12-29 12:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel; +Cc: davem

Use one generic vhost_copy_to_user() instead of two dedicated
accessor. This will simplify the conversion to fine grain
accessors. About 2% improvement of PPS were seen during vitio-user
txonly test.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 55e5aa662ad5..f179b5ee14c4 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2174,16 +2174,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 	start = vq->last_used_idx & (vq->num - 1);
 	used = vq->used->ring + start;
-	if (count == 1) {
-		if (vhost_put_user(vq, heads[0].id, &used->id)) {
-			vq_err(vq, "Failed to write used id");
-			return -EFAULT;
-		}
-		if (vhost_put_user(vq, heads[0].len, &used->len)) {
-			vq_err(vq, "Failed to write used len");
-			return -EFAULT;
-		}
-	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
 		vq_err(vq, "Failed to write used");
 		return -EFAULT;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH V3 2/5] vhost: fine grain userspace memory accessors
  2018-12-29 12:46 [RFC PATCH V3 0/5] Hi: Jason Wang
  2018-12-29 12:46 ` [RFC PATCH V3 1/5] vhost: generalize adding used elem Jason Wang
@ 2018-12-29 12:46 ` Jason Wang
  2018-12-29 12:46 ` [RFC PATCH V3 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch() Jason Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2018-12-29 12:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel; +Cc: davem

This is used to hide the metadata address from virtqueue helpers. This
will allow to implement a vmap based fast accessing to metadata.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 94 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 77 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f179b5ee14c4..337ce6f5a098 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -868,6 +868,34 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 	ret; \
 })
 
+static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+			      vhost_avail_event(vq));
+}
+
+static inline int vhost_put_used(struct vhost_virtqueue *vq,
+				 struct vring_used_elem *head, int idx,
+				 int count)
+{
+	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
+				  count * sizeof(*head));
+}
+
+static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
+
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+			      &vq->used->flags);
+}
+
+static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
+
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+			      &vq->used->idx);
+}
+
 #define vhost_get_user(vq, x, ptr, type)		\
 ({ \
 	int ret; \
@@ -906,6 +934,43 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
 		mutex_unlock(&d->vqs[i]->mutex);
 }
 
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
+				      __virtio16 *idx)
+{
+	return vhost_get_avail(vq, *idx, &vq->avail->idx);
+}
+
+static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
+				       __virtio16 *head, int idx)
+{
+	return vhost_get_avail(vq, *head,
+			       &vq->avail->ring[idx & (vq->num - 1)]);
+}
+
+static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
+					__virtio16 *flags)
+{
+	return vhost_get_avail(vq, *flags, &vq->avail->flags);
+}
+
+static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
+				       __virtio16 *event)
+{
+	return vhost_get_avail(vq, *event, vhost_used_event(vq));
+}
+
+static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
+				     __virtio16 *idx)
+{
+	return vhost_get_used(vq, *idx, &vq->used->idx);
+}
+
+static inline int vhost_get_desc(struct vhost_virtqueue *vq,
+				 struct vring_desc *desc, int idx)
+{
+	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
+}
+
 static int vhost_new_umem_range(struct vhost_umem *umem,
 				u64 start, u64 size, u64 end,
 				u64 userspace_addr, int perm)
@@ -1761,8 +1826,7 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
 	void __user *used;
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
-			   &vq->used->flags) < 0)
+	if (vhost_put_used_flags(vq))
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		/* Make sure the flag is seen before log. */
@@ -1780,8 +1844,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
 {
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
-			   vhost_avail_event(vq)))
+	if (vhost_put_avail_event(vq))
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		void __user *used;
@@ -1818,7 +1881,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 		r = -EFAULT;
 		goto err;
 	}
-	r = vhost_get_used(vq, last_used_idx, &vq->used->idx);
+	r = vhost_get_used_idx(vq, &last_used_idx);
 	if (r) {
 		vq_err(vq, "Can't access used idx at %p\n",
 		       &vq->used->idx);
@@ -2017,7 +2080,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	last_avail_idx = vq->last_avail_idx;
 
 	if (vq->avail_idx == vq->last_avail_idx) {
-		if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
+		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
 			vq_err(vq, "Failed to access avail idx at %p\n",
 				&vq->avail->idx);
 			return -EFAULT;
@@ -2044,8 +2107,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Grab the next descriptor number they're advertising, and increment
 	 * the index we've seen. */
-	if (unlikely(vhost_get_avail(vq, ring_head,
-		     &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
+	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
 		       &vq->avail->ring[last_avail_idx % vq->num]);
@@ -2080,8 +2142,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->num, head);
 			return -EINVAL;
 		}
-		ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
-					   sizeof desc);
+		ret = vhost_get_desc(vq, &desc, i);
 		if (unlikely(ret)) {
 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
 			       i, vq->desc + i);
@@ -2174,7 +2235,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 	start = vq->last_used_idx & (vq->num - 1);
 	used = vq->used->ring + start;
-	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+	if (vhost_put_used(vq, heads, start, count)) {
 		vq_err(vq, "Failed to write used");
 		return -EFAULT;
 	}
@@ -2218,8 +2279,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 
 	/* Make sure buffer is written before we update index. */
 	smp_wmb();
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
-			   &vq->used->idx)) {
+	if (vhost_put_used_idx(vq)) {
 		vq_err(vq, "Failed to increment used idx");
 		return -EFAULT;
 	}
@@ -2253,7 +2313,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 
 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 		__virtio16 flags;
-		if (vhost_get_avail(vq, flags, &vq->avail->flags)) {
+		if (vhost_get_avail_flags(vq, &flags)) {
 			vq_err(vq, "Failed to get flags");
 			return true;
 		}
@@ -2267,7 +2327,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (unlikely(!v))
 		return true;
 
-	if (vhost_get_avail(vq, event, vhost_used_event(vq))) {
+	if (vhost_get_used_event(vq, &event)) {
 		vq_err(vq, "Failed to get used event idx");
 		return true;
 	}
@@ -2312,7 +2372,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (vq->avail_idx != vq->last_avail_idx)
 		return false;
 
-	r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+	r = vhost_get_avail_idx(vq, &avail_idx);
 	if (unlikely(r))
 		return false;
 	vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
@@ -2348,7 +2408,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
 	smp_mb();
-	r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+	r = vhost_get_avail_idx(vq, &avail_idx);
 	if (r) {
 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
 		       &vq->avail->idx, r);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH V3 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
  2018-12-29 12:46 [RFC PATCH V3 0/5] Hi: Jason Wang
  2018-12-29 12:46 ` [RFC PATCH V3 1/5] vhost: generalize adding used elem Jason Wang
  2018-12-29 12:46 ` [RFC PATCH V3 2/5] vhost: fine grain userspace memory accessors Jason Wang
@ 2018-12-29 12:46 ` Jason Wang
  2018-12-29 12:46 ` [RFC PATCH V3 4/5] vhost: introduce helpers to get the size of metadata area Jason Wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2018-12-29 12:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel; +Cc: davem

Rename the function to be more accurate since it actually tries to
prefetch vq metadata address in IOTLB. And this will be used by
following patch to prefetch metadata virtual addresses.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   | 4 ++--
 drivers/vhost/vhost.c | 4 ++--
 drivers/vhost/vhost.h | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 36f3d0f49e60..0b4b3deab5aa 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -971,7 +971,7 @@ static void handle_tx(struct vhost_net *net)
 	if (!sock)
 		goto out;
 
-	if (!vq_iotlb_prefetch(vq))
+	if (!vq_meta_prefetch(vq))
 		goto out;
 
 	vhost_disable_notify(&net->dev, vq);
@@ -1140,7 +1140,7 @@ static void handle_rx(struct vhost_net *net)
 	if (!sock)
 		goto out;
 
-	if (!vq_iotlb_prefetch(vq))
+	if (!vq_meta_prefetch(vq))
 		goto out;
 
 	vhost_disable_notify(&net->dev, vq);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 337ce6f5a098..27b5c03feaac 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1304,7 +1304,7 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
 	return true;
 }
 
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+int vq_meta_prefetch(struct vhost_virtqueue *vq)
 {
 	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 	unsigned int num = vq->num;
@@ -1323,7 +1323,7 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
 			       num * sizeof(*vq->used->ring) + s,
 			       VHOST_ADDR_USED);
 }
-EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
+EXPORT_SYMBOL_GPL(vq_meta_prefetch);
 
 /* Can we log writes? */
 /* Caller should have device mutex but not vq mutex */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 466ef7542291..0d1ff977a43e 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -206,7 +206,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq);
+int vq_meta_prefetch(struct vhost_virtqueue *vq);
 
 struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
 void vhost_enqueue_msg(struct vhost_dev *dev,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH V3 4/5] vhost: introduce helpers to get the size of metadata area
  2018-12-29 12:46 [RFC PATCH V3 0/5] Hi: Jason Wang
                   ` (2 preceding siblings ...)
  2018-12-29 12:46 ` [RFC PATCH V3 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch() Jason Wang
@ 2018-12-29 12:46 ` Jason Wang
  2018-12-29 12:46 ` [RFC PATCH V3 5/5] vhost: access vq metadata through kernel virtual address Jason Wang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2018-12-29 12:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel; +Cc: davem

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 46 ++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 27b5c03feaac..54b43feef8d9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -489,6 +489,27 @@ bool vhost_dev_has_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_has_owner);
 
+static size_t vhost_get_avail_size(struct vhost_virtqueue *vq, int num)
+{
+	size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+
+	return sizeof(*vq->avail) +
+	       sizeof(*vq->avail->ring) * num + event;
+}
+
+static size_t vhost_get_used_size(struct vhost_virtqueue *vq, int num)
+{
+	size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+
+	return sizeof(*vq->used) +
+	       sizeof(*vq->used->ring) * num + event;
+}
+
+static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
+{
+	return sizeof(*vq->desc) * num;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
@@ -1248,13 +1269,9 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 			 struct vring_used __user *used)
 
 {
-	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
-
-	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
-	       access_ok(VERIFY_READ, avail,
-			 sizeof *avail + num * sizeof *avail->ring + s) &&
-	       access_ok(VERIFY_WRITE, used,
-			sizeof *used + num * sizeof *used->ring + s);
+	return access_ok(VERIFY_READ, desc, vhost_get_desc_size(vq, num)) &&
+	       access_ok(VERIFY_READ, avail, vhost_get_avail_size(vq, num)) &&
+	       access_ok(VERIFY_WRITE, used, vhost_get_used_size(vq, num));
 }
 
 static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
@@ -1306,22 +1323,18 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
 
 int vq_meta_prefetch(struct vhost_virtqueue *vq)
 {
-	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 	unsigned int num = vq->num;
 
 	if (!vq->iotlb)
 		return 1;
 
 	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
-			       num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
 	       iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
-			       sizeof *vq->avail +
-			       num * sizeof(*vq->avail->ring) + s,
+			       vhost_get_avail_size(vq, num),
 			       VHOST_ADDR_AVAIL) &&
 	       iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
-			       sizeof *vq->used +
-			       num * sizeof(*vq->used->ring) + s,
-			       VHOST_ADDR_USED);
+			       vhost_get_used_size(vq, num), VHOST_ADDR_USED);
 }
 EXPORT_SYMBOL_GPL(vq_meta_prefetch);
 
@@ -1338,13 +1351,10 @@ EXPORT_SYMBOL_GPL(vhost_log_access_ok);
 static bool vq_log_access_ok(struct vhost_virtqueue *vq,
 			     void __user *log_base)
 {
-	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
-
 	return vq_memory_access_ok(log_base, vq->umem,
 				   vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
-					sizeof *vq->used +
-					vq->num * sizeof *vq->used->ring + s));
+				  vhost_get_used_size(vq, vq->num)));
 }
 
 /* Can we start vq? */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH V3 5/5] vhost: access vq metadata through kernel virtual address
  2018-12-29 12:46 [RFC PATCH V3 0/5] Hi: Jason Wang
                   ` (3 preceding siblings ...)
  2018-12-29 12:46 ` [RFC PATCH V3 4/5] vhost: introduce helpers to get the size of metadata area Jason Wang
@ 2018-12-29 12:46 ` Jason Wang
  2019-01-04 21:34   ` Michael S. Tsirkin
  2019-01-02 20:47 ` [RFC PATCH V3 0/5] Hi: Michael S. Tsirkin
  2019-01-04 21:41 ` Michael S. Tsirkin
  6 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2018-12-29 12:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel; +Cc: davem

It was noticed that the copy_user() friends that was used to access
virtqueue metdata tends to be very expensive for dataplane
implementation like vhost since it involves lots of software checks,
speculation barrier, hardware feature toggling (e.g SMAP). The
extra cost will be more obvious when transferring small packets since
the time spent on metadata accessing become significant..

This patch tries to eliminate those overhead by accessing them through
kernel virtual address by vmap(). To make the pages can be migrated,
instead of pinning them through GUP, we use mmu notifiers to
invalidate vmaps and re-establish vmaps during each round of metadata
prefetching in necessary. For devices that doesn't use metadata
prefetching, the memory acessors fallback to normal copy_user()
implementation gracefully. The invalidation was synchronized with
datapath through vq mutex, and in order to avoid hold vq mutex during
range checking, MMU notifier was teared down when trying to modify vq
metadata.

Note that this was only done when device IOTLB is not enabled. We
could use similar method to optimize it in the future.

Tests shows about ~24% improvement on TX PPS when using virtio-user +
vhost_net + xdp1 on TAP:

Before: ~5.0Mpps
After:  ~6.1Mpps

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 263 +++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.h |  13 +++
 2 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 54b43feef8d9..e1ecb8acf8a3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -440,6 +440,9 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vq->indirect = NULL;
 		vq->heads = NULL;
 		vq->dev = dev;
+		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
+		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
+		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
 		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
@@ -510,6 +513,73 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
 	return sizeof(*vq->desc) * num;
 }
 
+static void vhost_uninit_vmap(struct vhost_vmap *map)
+{
+	if (map->addr)
+		vunmap(map->unmap_addr);
+
+	map->addr = NULL;
+	map->unmap_addr = NULL;
+}
+
+static int vhost_invalidate_vmap(struct vhost_virtqueue *vq,
+				 struct vhost_vmap *map,
+				 unsigned long ustart,
+				 size_t size,
+				 unsigned long start,
+				 unsigned long end,
+				 bool blockable)
+{
+	if (end < ustart || start > ustart - 1 + size)
+		return 0;
+
+	if (!blockable)
+		return -EAGAIN;
+
+	mutex_lock(&vq->mutex);
+	vhost_uninit_vmap(map);
+	mutex_unlock(&vq->mutex);
+
+	return 0;
+}
+
+static int vhost_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						     struct mm_struct *mm,
+						     unsigned long start,
+						     unsigned long end,
+						     bool blockable)
+{
+	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
+					     mmu_notifier);
+	int i;
+
+	for (i = 0; i < dev->nvqs; i++) {
+		struct vhost_virtqueue *vq = dev->vqs[i];
+
+		if (vhost_invalidate_vmap(vq, &vq->avail_ring,
+					  (unsigned long)vq->avail,
+					  vhost_get_avail_size(vq, vq->num),
+					  start, end, blockable))
+			return -EAGAIN;
+		if (vhost_invalidate_vmap(vq, &vq->desc_ring,
+					  (unsigned long)vq->desc,
+					  vhost_get_desc_size(vq, vq->num),
+					  start, end, blockable))
+			return -EAGAIN;
+		if (vhost_invalidate_vmap(vq, &vq->used_ring,
+					  (unsigned long)vq->used,
+					  vhost_get_used_size(vq, vq->num),
+					  start, end, blockable))
+			return -EAGAIN;
+	}
+
+	return 0;
+}
+
+static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
+	.invalidate_range_start = vhost_mmu_notifier_invalidate_range_start,
+};
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
@@ -541,7 +611,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	if (err)
 		goto err_cgroup;
 
+	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
+	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
+	if (err)
+		goto err_mmu_notifier;
+
 	return 0;
+err_mmu_notifier:
+	vhost_dev_free_iovecs(dev);
 err_cgroup:
 	kthread_stop(worker);
 	dev->worker = NULL;
@@ -632,6 +709,72 @@ static void vhost_clear_msg(struct vhost_dev *dev)
 	spin_unlock(&dev->iotlb_lock);
 }
 
+static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
+			   size_t size, int write)
+{
+	struct page **pages;
+	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
+	int npinned;
+	void *vaddr;
+	int err = 0;
+
+	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	npinned = get_user_pages_fast(uaddr, npages, write, pages);
+	if (npinned != npages) {
+		err = -EFAULT;
+		goto err;
+	}
+
+	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+	if (!vaddr) {
+		err = EFAULT;
+		goto err;
+	}
+
+	map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
+	map->unmap_addr = vaddr;
+
+err:
+	/* Don't pin pages, mmu notifier will notify us about page
+	 * migration.
+	 */
+	if (npinned > 0)
+		release_pages(pages, npinned);
+	kfree(pages);
+	return err;
+}
+
+static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
+{
+	vhost_uninit_vmap(&vq->avail_ring);
+	vhost_uninit_vmap(&vq->desc_ring);
+	vhost_uninit_vmap(&vq->used_ring);
+}
+
+static int vhost_setup_avail_vmap(struct vhost_virtqueue *vq,
+				  unsigned long avail)
+{
+	return vhost_init_vmap(&vq->avail_ring, avail,
+			       vhost_get_avail_size(vq, vq->num), false);
+}
+
+static int vhost_setup_desc_vmap(struct vhost_virtqueue *vq,
+				 unsigned long desc)
+{
+	return vhost_init_vmap(&vq->desc_ring, desc,
+			       vhost_get_desc_size(vq, vq->num), false);
+}
+
+static int vhost_setup_used_vmap(struct vhost_virtqueue *vq,
+				 unsigned long used)
+{
+	return vhost_init_vmap(&vq->used_ring, used,
+			       vhost_get_used_size(vq, vq->num), true);
+}
+
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
 	int i;
@@ -661,8 +804,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 		kthread_stop(dev->worker);
 		dev->worker = NULL;
 	}
-	if (dev->mm)
+	if (dev->mm) {
+		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
 		mmput(dev->mm);
+	}
+	for (i = 0; i < dev->nvqs; i++)
+		vhost_clean_vmaps(dev->vqs[i]);
 	dev->mm = NULL;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
@@ -891,6 +1038,16 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 
 static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			*((__virtio16 *)&used->ring[vq->num]) =
+				cpu_to_vhost16(vq, vq->avail_idx);
+			return 0;
+		}
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
 			      vhost_avail_event(vq));
 }
@@ -899,6 +1056,16 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
 				 struct vring_used_elem *head, int idx,
 				 int count)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			memcpy(used->ring + idx, head,
+			       count * sizeof(*head));
+			return 0;
+		}
+	}
+
 	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
 				  count * sizeof(*head));
 }
@@ -906,6 +1073,15 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
 static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
 
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			used->flags = cpu_to_vhost16(vq, vq->used_flags);
+			return 0;
+		}
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
 			      &vq->used->flags);
 }
@@ -913,6 +1089,15 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
 static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
 
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
+			return 0;
+		}
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
 			      &vq->used->idx);
 }
@@ -958,12 +1143,30 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
 static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
 				      __virtio16 *idx)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		if (likely(avail)) {
+			*idx = avail->idx;
+			return 0;
+		}
+	}
+
 	return vhost_get_avail(vq, *idx, &vq->avail->idx);
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 				       __virtio16 *head, int idx)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		if (likely(avail)) {
+			*head = avail->ring[idx & (vq->num - 1)];
+			return 0;
+		}
+	}
+
 	return vhost_get_avail(vq, *head,
 			       &vq->avail->ring[idx & (vq->num - 1)]);
 }
@@ -971,24 +1174,60 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
 					__virtio16 *flags)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		if (likely(avail)) {
+			*flags = avail->flags;
+			return 0;
+		}
+	}
+
 	return vhost_get_avail(vq, *flags, &vq->avail->flags);
 }
 
 static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
 				       __virtio16 *event)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		if (likely(avail)) {
+			*event = (__virtio16)avail->ring[vq->num];
+			return 0;
+		}
+	}
+
 	return vhost_get_avail(vq, *event, vhost_used_event(vq));
 }
 
 static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
 				     __virtio16 *idx)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			*idx = used->idx;
+			return 0;
+		}
+	}
+
 	return vhost_get_used(vq, *idx, &vq->used->idx);
 }
 
 static inline int vhost_get_desc(struct vhost_virtqueue *vq,
 				 struct vring_desc *desc, int idx)
 {
+	if (!vq->iotlb) {
+		struct vring_desc *d = vq->desc_ring.addr;
+
+		if (likely(d)) {
+			*desc = *(d + idx);
+			return 0;
+		}
+	}
+
 	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
 }
 
@@ -1325,8 +1564,16 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
 {
 	unsigned int num = vq->num;
 
-	if (!vq->iotlb)
+	if (!vq->iotlb) {
+		if (unlikely(!vq->avail_ring.addr))
+			vhost_setup_avail_vmap(vq, (unsigned long)vq->avail);
+		if (unlikely(!vq->desc_ring.addr))
+			vhost_setup_desc_vmap(vq, (unsigned long)vq->desc);
+		if (unlikely(!vq->used_ring.addr))
+			vhost_setup_used_vmap(vq, (unsigned long)vq->used);
+
 		return 1;
+	}
 
 	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
 			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
@@ -1478,6 +1725,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 
 	mutex_lock(&vq->mutex);
 
+	/* Unregister MMU notifer to allow invalidation callback
+	 * can access vq->avail, vq->desc , vq->used and vq->num
+	 * without holding vq->mutex.
+	 */
+	if (d->mm)
+		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
+
 	switch (ioctl) {
 	case VHOST_SET_VRING_NUM:
 		/* Resizing ring with an active backend?
@@ -1494,6 +1748,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 			r = -EINVAL;
 			break;
 		}
+		vhost_clean_vmaps(vq);
 		vq->num = s.num;
 		break;
 	case VHOST_SET_VRING_BASE:
@@ -1571,6 +1826,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 			}
 		}
 
+		vhost_clean_vmaps(vq);
+
 		vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
 		vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
 		vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
@@ -1651,6 +1908,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
 	if (pollstart && vq->handle_kick)
 		r = vhost_poll_start(&vq->poll, vq->kick);
 
+	if (d->mm)
+		mmu_notifier_register(&d->mmu_notifier, d->mm);
 	mutex_unlock(&vq->mutex);
 
 	if (pollstop && vq->handle_kick)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0d1ff977a43e..00f016a4f198 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,6 +12,8 @@
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
+#include <linux/pagemap.h>
+#include <linux/mmu_notifier.h>
 
 struct vhost_work;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
@@ -80,6 +82,11 @@ enum vhost_uaddr_type {
 	VHOST_NUM_ADDRS = 3,
 };
 
+struct vhost_vmap {
+	void *addr;
+	void *unmap_addr;
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -90,6 +97,11 @@ struct vhost_virtqueue {
 	struct vring_desc __user *desc;
 	struct vring_avail __user *avail;
 	struct vring_used __user *used;
+
+	struct vhost_vmap avail_ring;
+	struct vhost_vmap desc_ring;
+	struct vhost_vmap used_ring;
+
 	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
 	struct file *kick;
 	struct eventfd_ctx *call_ctx;
@@ -158,6 +170,7 @@ struct vhost_msg_node {
 
 struct vhost_dev {
 	struct mm_struct *mm;
+	struct mmu_notifier mmu_notifier;
 	struct mutex mutex;
 	struct vhost_virtqueue **vqs;
 	int nvqs;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2018-12-29 12:46 [RFC PATCH V3 0/5] Hi: Jason Wang
                   ` (4 preceding siblings ...)
  2018-12-29 12:46 ` [RFC PATCH V3 5/5] vhost: access vq metadata through kernel virtual address Jason Wang
@ 2019-01-02 20:47 ` Michael S. Tsirkin
  2019-01-07  2:19   ` Jason Wang
  2019-01-04 21:41 ` Michael S. Tsirkin
  6 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-02 20:47 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel, davem

On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.

Will review, thanks!
One questions that comes to mind is whether it's all about bypassing
stac/clac.  Could you please include a performance comparison with
nosmap?

> 
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
>
> Changes from V2:
> - fix buggy range overlapping check
> - tear down MMU notifier during vhost ioctl to make sure invalidation
>   request can read metadata userspace address and vq size without
>   holding vq mutex.
> Changes from V1:
> - instead of pinning pages, use MMU notifier to invalidate vmaps and
>   remap duing metadata prefetch
> - fix build warning on MIPS
> 
> Jason Wang (5):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
>   vhost: introduce helpers to get the size of metadata area
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/net.c   |   4 +-
>  drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
>  drivers/vhost/vhost.h |  15 +-
>  3 files changed, 384 insertions(+), 51 deletions(-)
> 
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 1/5] vhost: generalize adding used elem
  2018-12-29 12:46 ` [RFC PATCH V3 1/5] vhost: generalize adding used elem Jason Wang
@ 2019-01-04 21:29   ` Michael S. Tsirkin
  2019-01-05  0:33     ` Sean Christopherson
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-04 21:29 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel, davem

On Sat, Dec 29, 2018 at 08:46:52PM +0800, Jason Wang wrote:
> Use one generic vhost_copy_to_user() instead of two dedicated
> accessor. This will simplify the conversion to fine grain
> accessors. About 2% improvement of PPS were seen during vitio-user
> txonly test.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

I don't hve a problem with this patch but do you have
any idea how come removing what's supposed to be
an optimization speeds things up?

> ---
>  drivers/vhost/vhost.c | 11 +----------
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 55e5aa662ad5..f179b5ee14c4 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2174,16 +2174,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>  
>  	start = vq->last_used_idx & (vq->num - 1);
>  	used = vq->used->ring + start;
> -	if (count == 1) {
> -		if (vhost_put_user(vq, heads[0].id, &used->id)) {
> -			vq_err(vq, "Failed to write used id");
> -			return -EFAULT;
> -		}
> -		if (vhost_put_user(vq, heads[0].len, &used->len)) {
> -			vq_err(vq, "Failed to write used len");
> -			return -EFAULT;
> -		}
> -	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
> +	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
>  		vq_err(vq, "Failed to write used");
>  		return -EFAULT;
>  	}
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 5/5] vhost: access vq metadata through kernel virtual address
  2018-12-29 12:46 ` [RFC PATCH V3 5/5] vhost: access vq metadata through kernel virtual address Jason Wang
@ 2019-01-04 21:34   ` Michael S. Tsirkin
  2019-01-07  8:40     ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-04 21:34 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel, davem

On Sat, Dec 29, 2018 at 08:46:56PM +0800, Jason Wang wrote:
> It was noticed that the copy_user() friends that was used to access
> virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barrier, hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become significant..
> 
> This patch tries to eliminate those overhead by accessing them through
> kernel virtual address by vmap(). To make the pages can be migrated,
> instead of pinning them through GUP, we use mmu notifiers to
> invalidate vmaps and re-establish vmaps during each round of metadata
> prefetching in necessary. For devices that doesn't use metadata
> prefetching, the memory acessors fallback to normal copy_user()
> implementation gracefully. The invalidation was synchronized with
> datapath through vq mutex, and in order to avoid hold vq mutex during
> range checking, MMU notifier was teared down when trying to modify vq
> metadata.
> 
> Note that this was only done when device IOTLB is not enabled. We
> could use similar method to optimize it in the future.
> 
> Tests shows about ~24% improvement on TX PPS when using virtio-user +
> vhost_net + xdp1 on TAP:
> 
> Before: ~5.0Mpps
> After:  ~6.1Mpps
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c | 263 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.h |  13 +++
>  2 files changed, 274 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 54b43feef8d9..e1ecb8acf8a3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -440,6 +440,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vq->indirect = NULL;
>  		vq->heads = NULL;
>  		vq->dev = dev;
> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>  		mutex_init(&vq->mutex);
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
> @@ -510,6 +513,73 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
>  	return sizeof(*vq->desc) * num;
>  }
>  
> +static void vhost_uninit_vmap(struct vhost_vmap *map)
> +{
> +	if (map->addr)
> +		vunmap(map->unmap_addr);
> +
> +	map->addr = NULL;
> +	map->unmap_addr = NULL;
> +}
> +
> +static int vhost_invalidate_vmap(struct vhost_virtqueue *vq,
> +				 struct vhost_vmap *map,
> +				 unsigned long ustart,
> +				 size_t size,
> +				 unsigned long start,
> +				 unsigned long end,
> +				 bool blockable)
> +{
> +	if (end < ustart || start > ustart - 1 + size)
> +		return 0;
> +
> +	if (!blockable)
> +		return -EAGAIN;
> +
> +	mutex_lock(&vq->mutex);
> +	vhost_uninit_vmap(map);
> +	mutex_unlock(&vq->mutex);
> +
> +	return 0;
> +}
> +
> +static int vhost_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +						     struct mm_struct *mm,
> +						     unsigned long start,
> +						     unsigned long end,
> +						     bool blockable)
> +{
> +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> +					     mmu_notifier);
> +	int i;
> +
> +	for (i = 0; i < dev->nvqs; i++) {
> +		struct vhost_virtqueue *vq = dev->vqs[i];
> +
> +		if (vhost_invalidate_vmap(vq, &vq->avail_ring,
> +					  (unsigned long)vq->avail,
> +					  vhost_get_avail_size(vq, vq->num),
> +					  start, end, blockable))
> +			return -EAGAIN;
> +		if (vhost_invalidate_vmap(vq, &vq->desc_ring,
> +					  (unsigned long)vq->desc,
> +					  vhost_get_desc_size(vq, vq->num),
> +					  start, end, blockable))
> +			return -EAGAIN;
> +		if (vhost_invalidate_vmap(vq, &vq->used_ring,
> +					  (unsigned long)vq->used,
> +					  vhost_get_used_size(vq, vq->num),
> +					  start, end, blockable))
> +			return -EAGAIN;
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> +	.invalidate_range_start = vhost_mmu_notifier_invalidate_range_start,
> +};
> +
>  /* Caller should have device mutex */
>  long vhost_dev_set_owner(struct vhost_dev *dev)
>  {
> @@ -541,7 +611,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>  	if (err)
>  		goto err_cgroup;
>  
> +	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
> +	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
> +	if (err)
> +		goto err_mmu_notifier;
> +
>  	return 0;
> +err_mmu_notifier:
> +	vhost_dev_free_iovecs(dev);
>  err_cgroup:
>  	kthread_stop(worker);
>  	dev->worker = NULL;
> @@ -632,6 +709,72 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>  	spin_unlock(&dev->iotlb_lock);
>  }
>  
> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> +			   size_t size, int write)
> +{
> +	struct page **pages;
> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> +	int npinned;
> +	void *vaddr;
> +	int err = 0;
> +
> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> +	if (npinned != npages) {
> +		err = -EFAULT;
> +		goto err;
> +	}
> +
> +	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> +	if (!vaddr) {
> +		err = EFAULT;
> +		goto err;
> +	}
> +
> +	map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
> +	map->unmap_addr = vaddr;
> +
> +err:
> +	/* Don't pin pages, mmu notifier will notify us about page
> +	 * migration.
> +	 */
> +	if (npinned > 0)
> +		release_pages(pages, npinned);
> +	kfree(pages);
> +	return err;
> +}
> +
> +static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
> +{
> +	vhost_uninit_vmap(&vq->avail_ring);
> +	vhost_uninit_vmap(&vq->desc_ring);
> +	vhost_uninit_vmap(&vq->used_ring);
> +}
> +
> +static int vhost_setup_avail_vmap(struct vhost_virtqueue *vq,
> +				  unsigned long avail)
> +{
> +	return vhost_init_vmap(&vq->avail_ring, avail,
> +			       vhost_get_avail_size(vq, vq->num), false);
> +}
> +
> +static int vhost_setup_desc_vmap(struct vhost_virtqueue *vq,
> +				 unsigned long desc)
> +{
> +	return vhost_init_vmap(&vq->desc_ring, desc,
> +			       vhost_get_desc_size(vq, vq->num), false);
> +}
> +
> +static int vhost_setup_used_vmap(struct vhost_virtqueue *vq,
> +				 unsigned long used)
> +{
> +	return vhost_init_vmap(&vq->used_ring, used,
> +			       vhost_get_used_size(vq, vq->num), true);
> +}
> +
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
>  	int i;
> @@ -661,8 +804,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  		kthread_stop(dev->worker);
>  		dev->worker = NULL;
>  	}
> -	if (dev->mm)
> +	if (dev->mm) {
> +		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
>  		mmput(dev->mm);
> +	}
> +	for (i = 0; i < dev->nvqs; i++)
> +		vhost_clean_vmaps(dev->vqs[i]);
>  	dev->mm = NULL;
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> @@ -891,6 +1038,16 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>  
>  static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>  {
> +	if (!vq->iotlb) {

Do we have to limit this to !iotlb?

> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			*((__virtio16 *)&used->ring[vq->num]) =
> +				cpu_to_vhost16(vq, vq->avail_idx);

So here we are modifying userspace memory without marking it dirty.
Is this OK? And why?



> +			return 0;
> +		}
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
>  			      vhost_avail_event(vq));
>  }
> @@ -899,6 +1056,16 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  				 struct vring_used_elem *head, int idx,
>  				 int count)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			memcpy(used->ring + idx, head,
> +			       count * sizeof(*head));
> +			return 0;

Same here.

> +		}
> +	}
> +
>  	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
>  				  count * sizeof(*head));
>  }
> @@ -906,6 +1073,15 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
>  			      &vq->used->flags);
>  }
> @@ -913,6 +1089,15 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>  
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
>  			      &vq->used->idx);
>  }
> @@ -958,12 +1143,30 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>  static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>  				      __virtio16 *idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		if (likely(avail)) {
> +			*idx = avail->idx;
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_avail(vq, *idx, &vq->avail->idx);
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  				       __virtio16 *head, int idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		if (likely(avail)) {
> +			*head = avail->ring[idx & (vq->num - 1)];
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_avail(vq, *head,
>  			       &vq->avail->ring[idx & (vq->num - 1)]);
>  }
> @@ -971,24 +1174,60 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>  					__virtio16 *flags)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		if (likely(avail)) {
> +			*flags = avail->flags;
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_avail(vq, *flags, &vq->avail->flags);
>  }
>  
>  static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>  				       __virtio16 *event)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		if (likely(avail)) {
> +			*event = (__virtio16)avail->ring[vq->num];
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_avail(vq, *event, vhost_used_event(vq));
>  }
>  
>  static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>  				     __virtio16 *idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			*idx = used->idx;
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_used(vq, *idx, &vq->used->idx);
>  }
>  
>  static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>  				 struct vring_desc *desc, int idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_desc *d = vq->desc_ring.addr;
> +
> +		if (likely(d)) {
> +			*desc = *(d + idx);
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
>  }
>  
> @@ -1325,8 +1564,16 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
>  {
>  	unsigned int num = vq->num;
>  
> -	if (!vq->iotlb)
> +	if (!vq->iotlb) {
> +		if (unlikely(!vq->avail_ring.addr))
> +			vhost_setup_avail_vmap(vq, (unsigned long)vq->avail);
> +		if (unlikely(!vq->desc_ring.addr))
> +			vhost_setup_desc_vmap(vq, (unsigned long)vq->desc);
> +		if (unlikely(!vq->used_ring.addr))
> +			vhost_setup_used_vmap(vq, (unsigned long)vq->used);
> +
>  		return 1;
> +	}
>  
>  	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
>  			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
> @@ -1478,6 +1725,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  
>  	mutex_lock(&vq->mutex);
>  
> +	/* Unregister MMU notifer to allow invalidation callback
> +	 * can access vq->avail, vq->desc , vq->used and vq->num
> +	 * without holding vq->mutex.
> +	 */
> +	if (d->mm)
> +		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
> +
>  	switch (ioctl) {
>  	case VHOST_SET_VRING_NUM:
>  		/* Resizing ring with an active backend?
> @@ -1494,6 +1748,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  			r = -EINVAL;
>  			break;
>  		}
> +		vhost_clean_vmaps(vq);
>  		vq->num = s.num;
>  		break;
>  	case VHOST_SET_VRING_BASE:
> @@ -1571,6 +1826,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  			}
>  		}
>  
> +		vhost_clean_vmaps(vq);
> +
>  		vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
>  		vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
>  		vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
> @@ -1651,6 +1908,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  	if (pollstart && vq->handle_kick)
>  		r = vhost_poll_start(&vq->poll, vq->kick);
>  
> +	if (d->mm)
> +		mmu_notifier_register(&d->mmu_notifier, d->mm);
>  	mutex_unlock(&vq->mutex);
>  
>  	if (pollstop && vq->handle_kick)
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 0d1ff977a43e..00f016a4f198 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -12,6 +12,8 @@
>  #include <linux/virtio_config.h>
>  #include <linux/virtio_ring.h>
>  #include <linux/atomic.h>
> +#include <linux/pagemap.h>
> +#include <linux/mmu_notifier.h>
>  
>  struct vhost_work;
>  typedef void (*vhost_work_fn_t)(struct vhost_work *work);
> @@ -80,6 +82,11 @@ enum vhost_uaddr_type {
>  	VHOST_NUM_ADDRS = 3,
>  };
>  
> +struct vhost_vmap {
> +	void *addr;
> +	void *unmap_addr;
> +};
> +

How about using actual types like struct vring_used etc so we get type
safety and do not need to cast on access?


>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -90,6 +97,11 @@ struct vhost_virtqueue {
>  	struct vring_desc __user *desc;
>  	struct vring_avail __user *avail;
>  	struct vring_used __user *used;
> +
> +	struct vhost_vmap avail_ring;
> +	struct vhost_vmap desc_ring;
> +	struct vhost_vmap used_ring;
> +
>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>  	struct file *kick;
>  	struct eventfd_ctx *call_ctx;
> @@ -158,6 +170,7 @@ struct vhost_msg_node {
>  
>  struct vhost_dev {
>  	struct mm_struct *mm;
> +	struct mmu_notifier mmu_notifier;
>  	struct mutex mutex;
>  	struct vhost_virtqueue **vqs;
>  	int nvqs;
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2018-12-29 12:46 [RFC PATCH V3 0/5] Hi: Jason Wang
                   ` (5 preceding siblings ...)
  2019-01-02 20:47 ` [RFC PATCH V3 0/5] Hi: Michael S. Tsirkin
@ 2019-01-04 21:41 ` Michael S. Tsirkin
  2019-01-07  6:58   ` Jason Wang
  6 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-04 21:41 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel, davem

On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.


I think it's a reasonable approach.
However I need to look at whether and which mmu notifiers are invoked before
writeback. Do you know?

> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
> 
> Changes from V2:
> - fix buggy range overlapping check
> - tear down MMU notifier during vhost ioctl to make sure invalidation
>   request can read metadata userspace address and vq size without
>   holding vq mutex.
> Changes from V1:
> - instead of pinning pages, use MMU notifier to invalidate vmaps and
>   remap duing metadata prefetch
> - fix build warning on MIPS
> 
> Jason Wang (5):
>   vhost: generalize adding used elem
>   vhost: fine grain userspace memory accessors
>   vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
>   vhost: introduce helpers to get the size of metadata area
>   vhost: access vq metadata through kernel virtual address
> 
>  drivers/vhost/net.c   |   4 +-
>  drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
>  drivers/vhost/vhost.h |  15 +-
>  3 files changed, 384 insertions(+), 51 deletions(-)
> 
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 1/5] vhost: generalize adding used elem
  2019-01-04 21:29   ` Michael S. Tsirkin
@ 2019-01-05  0:33     ` Sean Christopherson
  2019-01-07  7:00       ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2019-01-05  0:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, davem

On Fri, Jan 04, 2019 at 04:29:34PM -0500, Michael S. Tsirkin wrote:
> On Sat, Dec 29, 2018 at 08:46:52PM +0800, Jason Wang wrote:
> > Use one generic vhost_copy_to_user() instead of two dedicated
> > accessor. This will simplify the conversion to fine grain
> > accessors. About 2% improvement of PPS were seen during vitio-user
> > txonly test.
> > 
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> 
> I don't hve a problem with this patch but do you have
> any idea how come removing what's supposed to be
> an optimization speeds things up?

With SMAP, the 2x vhost_put_user() will also mean an extra STAC/CLAC pair,
which is probably slower than the overhead of CALL+RET to whatever flavor
of copy_user_generic() gets used.  CALL+RET is really the only overhead
since all variants of copy_user_generic() unroll accesses smaller than
64 bytes, e.g. on a 64-bit system, __copy_to_user() will write all 8
bytes in a single MOV.

Removing the special casing also eliminates a few hundred bytes of code
as well as the need for hardware to predict count==1 vs. count>1.

> 
> > ---
> >  drivers/vhost/vhost.c | 11 +----------
> >  1 file changed, 1 insertion(+), 10 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 55e5aa662ad5..f179b5ee14c4 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2174,16 +2174,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
> >  
> >  	start = vq->last_used_idx & (vq->num - 1);
> >  	used = vq->used->ring + start;
> > -	if (count == 1) {
> > -		if (vhost_put_user(vq, heads[0].id, &used->id)) {
> > -			vq_err(vq, "Failed to write used id");
> > -			return -EFAULT;
> > -		}
> > -		if (vhost_put_user(vq, heads[0].len, &used->len)) {
> > -			vq_err(vq, "Failed to write used len");
> > -			return -EFAULT;
> > -		}
> > -	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
> > +	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
> >  		vq_err(vq, "Failed to write used");
> >  		return -EFAULT;
> >  	}
> > -- 
> > 2.17.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-02 20:47 ` [RFC PATCH V3 0/5] Hi: Michael S. Tsirkin
@ 2019-01-07  2:19   ` Jason Wang
  2019-01-07  3:28     ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2019-01-07  2:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel, davem


On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
> Will review, thanks!
> One questions that comes to mind is whether it's all about bypassing
> stac/clac.  Could you please include a performance comparison with
> nosmap?
>

On machine without SMAP (Sandy Bridge):

Before: 4.8Mpps

After: 5.2Mpps

On machine with SMAP (Broadwell):

Before: 5.0Mpps

After: 6.1Mpps

No smap: 7.5Mpps


Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07  2:19   ` Jason Wang
@ 2019-01-07  3:28     ` Michael S. Tsirkin
  2019-01-07  3:53       ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07  3:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel, davem

On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> 
> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > Will review, thanks!
> > One questions that comes to mind is whether it's all about bypassing
> > stac/clac.  Could you please include a performance comparison with
> > nosmap?
> > 
> 
> On machine without SMAP (Sandy Bridge):
> 
> Before: 4.8Mpps
> 
> After: 5.2Mpps

OK so would you say it's really unsafe versus safe accesses?
Or would you say it's just a better written code?

> On machine with SMAP (Broadwell):
> 
> Before: 5.0Mpps
> 
> After: 6.1Mpps
> 
> No smap: 7.5Mpps
> 
> 
> Thanks


no smap being before or after?

-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07  3:28     ` Michael S. Tsirkin
@ 2019-01-07  3:53       ` Jason Wang
  2019-01-07  4:17         ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2019-01-07  3:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel, davem


On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
>> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>> This series tries to access virtqueue metadata through kernel virtual
>>>> address instead of copy_user() friends since they had too much
>>>> overheads like checks, spec barriers or even hardware feature
>>>> toggling.
>>> Will review, thanks!
>>> One questions that comes to mind is whether it's all about bypassing
>>> stac/clac.  Could you please include a performance comparison with
>>> nosmap?
>>>
>> On machine without SMAP (Sandy Bridge):
>>
>> Before: 4.8Mpps
>>
>> After: 5.2Mpps
> OK so would you say it's really unsafe versus safe accesses?
> Or would you say it's just a better written code?


It's the effect of removing speculation barrier.


>
>> On machine with SMAP (Broadwell):
>>
>> Before: 5.0Mpps
>>
>> After: 6.1Mpps
>>
>> No smap: 7.5Mpps
>>
>>
>> Thanks
>
> no smap being before or after?
>

Let me clarify:


Before (SMAP on): 5.0Mpps

Before (SMAP off): 7.5Mpps

After (SMAP on): 6.1Mpps


Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07  3:53       ` Jason Wang
@ 2019-01-07  4:17         ` Michael S. Tsirkin
  2019-01-07  6:50           ` Jason Wang
  2019-01-07  7:15           ` Dan Williams
  0 siblings, 2 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07  4:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel, davem, Dan Williams

On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> 
> On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > address instead of copy_user() friends since they had too much
> > > > > overheads like checks, spec barriers or even hardware feature
> > > > > toggling.
> > > > Will review, thanks!
> > > > One questions that comes to mind is whether it's all about bypassing
> > > > stac/clac.  Could you please include a performance comparison with
> > > > nosmap?
> > > > 
> > > On machine without SMAP (Sandy Bridge):
> > > 
> > > Before: 4.8Mpps
> > > 
> > > After: 5.2Mpps
> > OK so would you say it's really unsafe versus safe accesses?
> > Or would you say it's just a better written code?
> 
> 
> It's the effect of removing speculation barrier.


You mean __uaccess_begin_nospec introduced by
commit 304ec1b050310548db33063e567123fae8fd0301
?

So fundamentally we do access_ok checks when supplying
the memory table to the kernel thread, and we should
do the spec barrier there.

Then we can just create and use a variant of uaccess macros that does
not include the barrier?

Or, how about moving the barrier into access_ok?
This way repeated accesses with a single access_ok get a bit faster.
CC Dan Williams on this idea.



> 
> > 
> > > On machine with SMAP (Broadwell):
> > > 
> > > Before: 5.0Mpps
> > > 
> > > After: 6.1Mpps
> > > 
> > > No smap: 7.5Mpps
> > > 
> > > 
> > > Thanks
> > 
> > no smap being before or after?
> > 
> 
> Let me clarify:
> 
> 
> Before (SMAP on): 5.0Mpps
> 
> Before (SMAP off): 7.5Mpps
> 
> After (SMAP on): 6.1Mpps
> 
> 
> Thanks

How about after + smap off?

And maybe we want a module option just for the vhost thread to keep smap
off generally since almost all it does is copy stuff from userspace into
kernel anyway. Because what above numbers should is that we really
really want a solution that isn't limited to just meta-data access,
and I really do not see how any such solution can not also be
used to make meta-data access fast.

-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07  4:17         ` Michael S. Tsirkin
@ 2019-01-07  6:50           ` Jason Wang
  2019-01-07 14:37             ` Michael S. Tsirkin
  2019-01-07  7:15           ` Dan Williams
  1 sibling, 1 reply; 33+ messages in thread
From: Jason Wang @ 2019-01-07  6:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, linux-kernel, davem, Dan Williams


On 2019/1/7 下午12:17, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
>> On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
>>> On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
>>>> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
>>>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>> address instead of copy_user() friends since they had too much
>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>> toggling.
>>>>> Will review, thanks!
>>>>> One questions that comes to mind is whether it's all about bypassing
>>>>> stac/clac.  Could you please include a performance comparison with
>>>>> nosmap?
>>>>>
>>>> On machine without SMAP (Sandy Bridge):
>>>>
>>>> Before: 4.8Mpps
>>>>
>>>> After: 5.2Mpps
>>> OK so would you say it's really unsafe versus safe accesses?
>>> Or would you say it's just a better written code?
>>
>> It's the effect of removing speculation barrier.
>
> You mean __uaccess_begin_nospec introduced by
> commit 304ec1b050310548db33063e567123fae8fd0301
> ?

Yes.


>
> So fundamentally we do access_ok checks when supplying
> the memory table to the kernel thread, and we should
> do the spec barrier there.
>
> Then we can just create and use a variant of uaccess macros that does
> not include the barrier?


The unsafe ones?


>
> Or, how about moving the barrier into access_ok?
> This way repeated accesses with a single access_ok get a bit faster.
> CC Dan Williams on this idea.


The problem is, e.g for vhost control path. During mem table validation, 
we don't even want to access them there. So the spec barrier is not needed.


>
>
>>>> On machine with SMAP (Broadwell):
>>>>
>>>> Before: 5.0Mpps
>>>>
>>>> After: 6.1Mpps
>>>>
>>>> No smap: 7.5Mpps
>>>>
>>>>
>>>> Thanks
>>> no smap being before or after?
>>>
>> Let me clarify:
>>
>>
>> Before (SMAP on): 5.0Mpps
>>
>> Before (SMAP off): 7.5Mpps
>>
>> After (SMAP on): 6.1Mpps
>>
>>
>> Thanks
> How about after + smap off?


After (SMAP off): 8.0Mpps


>
> And maybe we want a module option just for the vhost thread to keep smap
> off generally since almost all it does is copy stuff from userspace into
> kernel anyway. Because what above numbers should is that we really
> really want a solution that isn't limited to just meta-data access,
> and I really do not see how any such solution can not also be
> used to make meta-data access fast.


As we've discussed in another thread of previous version. This requires 
lots of changes, the main issues is SMAP state was not saved/restored on 
explicit schedule(). Even if it did, since vhost will call lots of 
net/block codes, any kind of uaccess in those codes needs understand 
this special request from vhost e.g you provably need to invent a new 
kinds of iov iterator that does not touch SMAP at all. And I'm not sure 
this is the only thing we need to deal with.

So I still prefer to:

1) speedup the metadata access through vmap + MMU notifier

2) speedup the datacopy with batched copy (unsafe ones or other new 
interfaces)

Thanks



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-04 21:41 ` Michael S. Tsirkin
@ 2019-01-07  6:58   ` Jason Wang
  2019-01-07 14:47     ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2019-01-07  6:58 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel, davem


On 2019/1/5 上午5:41, Michael S. Tsirkin wrote:
> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
>
> I think it's a reasonable approach.
> However I need to look at whether and which mmu notifiers are invoked before
> writeback. Do you know?


I don't know but just looking at the MMU notifier ops definition, 
there's no such callback if my understanding is correct.

Thanks


>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Changes from V2:
>> - fix buggy range overlapping check
>> - tear down MMU notifier during vhost ioctl to make sure invalidation
>>    request can read metadata userspace address and vq size without
>>    holding vq mutex.
>> Changes from V1:
>> - instead of pinning pages, use MMU notifier to invalidate vmaps and
>>    remap duing metadata prefetch
>> - fix build warning on MIPS
>>
>> Jason Wang (5):
>>    vhost: generalize adding used elem
>>    vhost: fine grain userspace memory accessors
>>    vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
>>    vhost: introduce helpers to get the size of metadata area
>>    vhost: access vq metadata through kernel virtual address
>>
>>   drivers/vhost/net.c   |   4 +-
>>   drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
>>   drivers/vhost/vhost.h |  15 +-
>>   3 files changed, 384 insertions(+), 51 deletions(-)
>>
>> -- 
>> 2.17.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 1/5] vhost: generalize adding used elem
  2019-01-05  0:33     ` Sean Christopherson
@ 2019-01-07  7:00       ` Jason Wang
  2019-01-07 14:50         ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2019-01-07  7:00 UTC (permalink / raw)
  To: Sean Christopherson, Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, linux-kernel, davem


On 2019/1/5 上午8:33, Sean Christopherson wrote:
> On Fri, Jan 04, 2019 at 04:29:34PM -0500, Michael S. Tsirkin wrote:
>> On Sat, Dec 29, 2018 at 08:46:52PM +0800, Jason Wang wrote:
>>> Use one generic vhost_copy_to_user() instead of two dedicated
>>> accessor. This will simplify the conversion to fine grain
>>> accessors. About 2% improvement of PPS were seen during vitio-user
>>> txonly test.
>>>
>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> I don't hve a problem with this patch but do you have
>> any idea how come removing what's supposed to be
>> an optimization speeds things up?
> With SMAP, the 2x vhost_put_user() will also mean an extra STAC/CLAC pair,
> which is probably slower than the overhead of CALL+RET to whatever flavor
> of copy_user_generic() gets used.  CALL+RET is really the only overhead
> since all variants of copy_user_generic() unroll accesses smaller than
> 64 bytes, e.g. on a 64-bit system, __copy_to_user() will write all 8
> bytes in a single MOV.
>
> Removing the special casing also eliminates a few hundred bytes of code
> as well as the need for hardware to predict count==1 vs. count>1.
>

Yes, I don't measure, but STAC/CALC is pretty expensive when we are do 
very small copies based on the result of nosmap PPS.

Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07  4:17         ` Michael S. Tsirkin
  2019-01-07  6:50           ` Jason Wang
@ 2019-01-07  7:15           ` Dan Williams
  2019-01-07 14:11             ` Michael S. Tsirkin
  1 sibling, 1 reply; 33+ messages in thread
From: Dan Williams @ 2019-01-07  7:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, KVM list, virtualization, Netdev,
	Linux Kernel Mailing List, David Miller

On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> >
> > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > address instead of copy_user() friends since they had too much
> > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > toggling.
> > > > > Will review, thanks!
> > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > stac/clac.  Could you please include a performance comparison with
> > > > > nosmap?
> > > > >
> > > > On machine without SMAP (Sandy Bridge):
> > > >
> > > > Before: 4.8Mpps
> > > >
> > > > After: 5.2Mpps
> > > OK so would you say it's really unsafe versus safe accesses?
> > > Or would you say it's just a better written code?
> >
> >
> > It's the effect of removing speculation barrier.
>
>
> You mean __uaccess_begin_nospec introduced by
> commit 304ec1b050310548db33063e567123fae8fd0301
> ?
>
> So fundamentally we do access_ok checks when supplying
> the memory table to the kernel thread, and we should
> do the spec barrier there.
>
> Then we can just create and use a variant of uaccess macros that does
> not include the barrier?
>
> Or, how about moving the barrier into access_ok?
> This way repeated accesses with a single access_ok get a bit faster.
> CC Dan Williams on this idea.

It would be interesting to see how expensive re-doing the address
limit check is compared to the speculation barrier. I.e. just switch
vhost_get_user() to use get_user() rather than __get_user(). That will
sanitize the pointer in the speculative path without a barrier.

I recall we had a convert access_ok() discussion with this result here:

https://lkml.org/lkml/2018/1/17/929

...but it sounds like you are proposing a smaller scope fixup for the
vhost use case? Something like barrier_nospec() in the success path
for all vhost access_ok() checks and then a get_user() variant that
disables the barrier.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 5/5] vhost: access vq metadata through kernel virtual address
  2019-01-04 21:34   ` Michael S. Tsirkin
@ 2019-01-07  8:40     ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2019-01-07  8:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel, davem


On 2019/1/5 上午5:34, Michael S. Tsirkin wrote:
> On Sat, Dec 29, 2018 at 08:46:56PM +0800, Jason Wang wrote:
>> It was noticed that the copy_user() friends that was used to access
>> virtqueue metdata tends to be very expensive for dataplane
>> implementation like vhost since it involves lots of software checks,
>> speculation barrier, hardware feature toggling (e.g SMAP). The
>> extra cost will be more obvious when transferring small packets since
>> the time spent on metadata accessing become significant..
>>
>> This patch tries to eliminate those overhead by accessing them through
>> kernel virtual address by vmap(). To make the pages can be migrated,
>> instead of pinning them through GUP, we use mmu notifiers to
>> invalidate vmaps and re-establish vmaps during each round of metadata
>> prefetching in necessary. For devices that doesn't use metadata
>> prefetching, the memory acessors fallback to normal copy_user()
>> implementation gracefully. The invalidation was synchronized with
>> datapath through vq mutex, and in order to avoid hold vq mutex during
>> range checking, MMU notifier was teared down when trying to modify vq
>> metadata.
>>
>> Note that this was only done when device IOTLB is not enabled. We
>> could use similar method to optimize it in the future.
>>
>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>> vhost_net + xdp1 on TAP:
>>
>> Before: ~5.0Mpps
>> After:  ~6.1Mpps
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   drivers/vhost/vhost.c | 263 +++++++++++++++++++++++++++++++++++++++++-
>>   drivers/vhost/vhost.h |  13 +++
>>   2 files changed, 274 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 54b43feef8d9..e1ecb8acf8a3 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -440,6 +440,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>   		vq->indirect = NULL;
>>   		vq->heads = NULL;
>>   		vq->dev = dev;
>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>   		mutex_init(&vq->mutex);
>>   		vhost_vq_reset(dev, vq);
>>   		if (vq->handle_kick)
>> @@ -510,6 +513,73 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
>>   	return sizeof(*vq->desc) * num;
>>   }
>>   
>> +static void vhost_uninit_vmap(struct vhost_vmap *map)
>> +{
>> +	if (map->addr)
>> +		vunmap(map->unmap_addr);
>> +
>> +	map->addr = NULL;
>> +	map->unmap_addr = NULL;
>> +}
>> +
>> +static int vhost_invalidate_vmap(struct vhost_virtqueue *vq,
>> +				 struct vhost_vmap *map,
>> +				 unsigned long ustart,
>> +				 size_t size,
>> +				 unsigned long start,
>> +				 unsigned long end,
>> +				 bool blockable)
>> +{
>> +	if (end < ustart || start > ustart - 1 + size)
>> +		return 0;
>> +
>> +	if (!blockable)
>> +		return -EAGAIN;
>> +
>> +	mutex_lock(&vq->mutex);
>> +	vhost_uninit_vmap(map);
>> +	mutex_unlock(&vq->mutex);
>> +
>> +	return 0;
>> +}
>> +
>> +static int vhost_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>> +						     struct mm_struct *mm,
>> +						     unsigned long start,
>> +						     unsigned long end,
>> +						     bool blockable)
>> +{
>> +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
>> +					     mmu_notifier);
>> +	int i;
>> +
>> +	for (i = 0; i < dev->nvqs; i++) {
>> +		struct vhost_virtqueue *vq = dev->vqs[i];
>> +
>> +		if (vhost_invalidate_vmap(vq, &vq->avail_ring,
>> +					  (unsigned long)vq->avail,
>> +					  vhost_get_avail_size(vq, vq->num),
>> +					  start, end, blockable))
>> +			return -EAGAIN;
>> +		if (vhost_invalidate_vmap(vq, &vq->desc_ring,
>> +					  (unsigned long)vq->desc,
>> +					  vhost_get_desc_size(vq, vq->num),
>> +					  start, end, blockable))
>> +			return -EAGAIN;
>> +		if (vhost_invalidate_vmap(vq, &vq->used_ring,
>> +					  (unsigned long)vq->used,
>> +					  vhost_get_used_size(vq, vq->num),
>> +					  start, end, blockable))
>> +			return -EAGAIN;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
>> +	.invalidate_range_start = vhost_mmu_notifier_invalidate_range_start,
>> +};
>> +
>>   /* Caller should have device mutex */
>>   long vhost_dev_set_owner(struct vhost_dev *dev)
>>   {
>> @@ -541,7 +611,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>>   	if (err)
>>   		goto err_cgroup;
>>   
>> +	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
>> +	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
>> +	if (err)
>> +		goto err_mmu_notifier;
>> +
>>   	return 0;
>> +err_mmu_notifier:
>> +	vhost_dev_free_iovecs(dev);
>>   err_cgroup:
>>   	kthread_stop(worker);
>>   	dev->worker = NULL;
>> @@ -632,6 +709,72 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>   	spin_unlock(&dev->iotlb_lock);
>>   }
>>   
>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>> +			   size_t size, int write)
>> +{
>> +	struct page **pages;
>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>> +	int npinned;
>> +	void *vaddr;
>> +	int err = 0;
>> +
>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>> +	if (!pages)
>> +		return -ENOMEM;
>> +
>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>> +	if (npinned != npages) {
>> +		err = -EFAULT;
>> +		goto err;
>> +	}
>> +
>> +	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
>> +	if (!vaddr) {
>> +		err = EFAULT;
>> +		goto err;
>> +	}
>> +
>> +	map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
>> +	map->unmap_addr = vaddr;
>> +
>> +err:
>> +	/* Don't pin pages, mmu notifier will notify us about page
>> +	 * migration.
>> +	 */
>> +	if (npinned > 0)
>> +		release_pages(pages, npinned);
>> +	kfree(pages);
>> +	return err;
>> +}
>> +
>> +static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
>> +{
>> +	vhost_uninit_vmap(&vq->avail_ring);
>> +	vhost_uninit_vmap(&vq->desc_ring);
>> +	vhost_uninit_vmap(&vq->used_ring);
>> +}
>> +
>> +static int vhost_setup_avail_vmap(struct vhost_virtqueue *vq,
>> +				  unsigned long avail)
>> +{
>> +	return vhost_init_vmap(&vq->avail_ring, avail,
>> +			       vhost_get_avail_size(vq, vq->num), false);
>> +}
>> +
>> +static int vhost_setup_desc_vmap(struct vhost_virtqueue *vq,
>> +				 unsigned long desc)
>> +{
>> +	return vhost_init_vmap(&vq->desc_ring, desc,
>> +			       vhost_get_desc_size(vq, vq->num), false);
>> +}
>> +
>> +static int vhost_setup_used_vmap(struct vhost_virtqueue *vq,
>> +				 unsigned long used)
>> +{
>> +	return vhost_init_vmap(&vq->used_ring, used,
>> +			       vhost_get_used_size(vq, vq->num), true);
>> +}
>> +
>>   void vhost_dev_cleanup(struct vhost_dev *dev)
>>   {
>>   	int i;
>> @@ -661,8 +804,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>>   		kthread_stop(dev->worker);
>>   		dev->worker = NULL;
>>   	}
>> -	if (dev->mm)
>> +	if (dev->mm) {
>> +		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
>>   		mmput(dev->mm);
>> +	}
>> +	for (i = 0; i < dev->nvqs; i++)
>> +		vhost_clean_vmaps(dev->vqs[i]);
>>   	dev->mm = NULL;
>>   }
>>   EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
>> @@ -891,6 +1038,16 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>>   
>>   static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>>   {
>> +	if (!vq->iotlb) {
> Do we have to limit this to !iotlb?


No need, but for simplicity I leave this for the future.


>
>> +		struct vring_used *used = vq->used_ring.addr;
>> +
>> +		if (likely(used)) {
>> +			*((__virtio16 *)&used->ring[vq->num]) =
>> +				cpu_to_vhost16(vq, vq->avail_idx);
> So here we are modifying userspace memory without marking it dirty.
> Is this OK? And why?


Probably not, any suggestion to fix this?


>
>
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
>>   			      vhost_avail_event(vq));
>>   }
>> @@ -899,6 +1056,16 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>>   				 struct vring_used_elem *head, int idx,
>>   				 int count)
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_used *used = vq->used_ring.addr;
>> +
>> +		if (likely(used)) {
>> +			memcpy(used->ring + idx, head,
>> +			       count * sizeof(*head));
>> +			return 0;
> Same here.
>
>> +		}
>> +	}
>> +
>>   	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
>>   				  count * sizeof(*head));
>>   }
>> @@ -906,6 +1073,15 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>>   static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>>   
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_used *used = vq->used_ring.addr;
>> +
>> +		if (likely(used)) {
>> +			used->flags = cpu_to_vhost16(vq, vq->used_flags);
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
>>   			      &vq->used->flags);
>>   }
>> @@ -913,6 +1089,15 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>>   static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>>   
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_used *used = vq->used_ring.addr;
>> +
>> +		if (likely(used)) {
>> +			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
>>   			      &vq->used->idx);
>>   }
>> @@ -958,12 +1143,30 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>>   static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>>   				      __virtio16 *idx)
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_avail *avail = vq->avail_ring.addr;
>> +
>> +		if (likely(avail)) {
>> +			*idx = avail->idx;
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_get_avail(vq, *idx, &vq->avail->idx);
>>   }
>>   
>>   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>>   				       __virtio16 *head, int idx)
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_avail *avail = vq->avail_ring.addr;
>> +
>> +		if (likely(avail)) {
>> +			*head = avail->ring[idx & (vq->num - 1)];
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_get_avail(vq, *head,
>>   			       &vq->avail->ring[idx & (vq->num - 1)]);
>>   }
>> @@ -971,24 +1174,60 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>>   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>>   					__virtio16 *flags)
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_avail *avail = vq->avail_ring.addr;
>> +
>> +		if (likely(avail)) {
>> +			*flags = avail->flags;
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_get_avail(vq, *flags, &vq->avail->flags);
>>   }
>>   
>>   static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>>   				       __virtio16 *event)
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_avail *avail = vq->avail_ring.addr;
>> +
>> +		if (likely(avail)) {
>> +			*event = (__virtio16)avail->ring[vq->num];
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_get_avail(vq, *event, vhost_used_event(vq));
>>   }
>>   
>>   static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>>   				     __virtio16 *idx)
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_used *used = vq->used_ring.addr;
>> +
>> +		if (likely(used)) {
>> +			*idx = used->idx;
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_get_used(vq, *idx, &vq->used->idx);
>>   }
>>   
>>   static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>>   				 struct vring_desc *desc, int idx)
>>   {
>> +	if (!vq->iotlb) {
>> +		struct vring_desc *d = vq->desc_ring.addr;
>> +
>> +		if (likely(d)) {
>> +			*desc = *(d + idx);
>> +			return 0;
>> +		}
>> +	}
>> +
>>   	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
>>   }
>>   
>> @@ -1325,8 +1564,16 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
>>   {
>>   	unsigned int num = vq->num;
>>   
>> -	if (!vq->iotlb)
>> +	if (!vq->iotlb) {
>> +		if (unlikely(!vq->avail_ring.addr))
>> +			vhost_setup_avail_vmap(vq, (unsigned long)vq->avail);
>> +		if (unlikely(!vq->desc_ring.addr))
>> +			vhost_setup_desc_vmap(vq, (unsigned long)vq->desc);
>> +		if (unlikely(!vq->used_ring.addr))
>> +			vhost_setup_used_vmap(vq, (unsigned long)vq->used);
>> +
>>   		return 1;
>> +	}
>>   
>>   	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
>>   			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
>> @@ -1478,6 +1725,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>   
>>   	mutex_lock(&vq->mutex);
>>   
>> +	/* Unregister MMU notifer to allow invalidation callback
>> +	 * can access vq->avail, vq->desc , vq->used and vq->num
>> +	 * without holding vq->mutex.
>> +	 */
>> +	if (d->mm)
>> +		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
>> +
>>   	switch (ioctl) {
>>   	case VHOST_SET_VRING_NUM:
>>   		/* Resizing ring with an active backend?
>> @@ -1494,6 +1748,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>   			r = -EINVAL;
>>   			break;
>>   		}
>> +		vhost_clean_vmaps(vq);
>>   		vq->num = s.num;
>>   		break;
>>   	case VHOST_SET_VRING_BASE:
>> @@ -1571,6 +1826,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>   			}
>>   		}
>>   
>> +		vhost_clean_vmaps(vq);
>> +
>>   		vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
>>   		vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
>>   		vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
>> @@ -1651,6 +1908,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>   	if (pollstart && vq->handle_kick)
>>   		r = vhost_poll_start(&vq->poll, vq->kick);
>>   
>> +	if (d->mm)
>> +		mmu_notifier_register(&d->mmu_notifier, d->mm);
>>   	mutex_unlock(&vq->mutex);
>>   
>>   	if (pollstop && vq->handle_kick)
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 0d1ff977a43e..00f016a4f198 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -12,6 +12,8 @@
>>   #include <linux/virtio_config.h>
>>   #include <linux/virtio_ring.h>
>>   #include <linux/atomic.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/mmu_notifier.h>
>>   
>>   struct vhost_work;
>>   typedef void (*vhost_work_fn_t)(struct vhost_work *work);
>> @@ -80,6 +82,11 @@ enum vhost_uaddr_type {
>>   	VHOST_NUM_ADDRS = 3,
>>   };
>>   
>> +struct vhost_vmap {
>> +	void *addr;
>> +	void *unmap_addr;
>> +};
>> +
> How about using actual types like struct vring_used etc so we get type
> safety and do not need to cast on access?


Yes, we can.

Thanks


>
>
>>   /* The virtqueue structure describes a queue attached to a device. */
>>   struct vhost_virtqueue {
>>   	struct vhost_dev *dev;
>> @@ -90,6 +97,11 @@ struct vhost_virtqueue {
>>   	struct vring_desc __user *desc;
>>   	struct vring_avail __user *avail;
>>   	struct vring_used __user *used;
>> +
>> +	struct vhost_vmap avail_ring;
>> +	struct vhost_vmap desc_ring;
>> +	struct vhost_vmap used_ring;
>> +
>>   	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>>   	struct file *kick;
>>   	struct eventfd_ctx *call_ctx;
>> @@ -158,6 +170,7 @@ struct vhost_msg_node {
>>   
>>   struct vhost_dev {
>>   	struct mm_struct *mm;
>> +	struct mmu_notifier mmu_notifier;
>>   	struct mutex mutex;
>>   	struct vhost_virtqueue **vqs;
>>   	int nvqs;
>> -- 
>> 2.17.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07  7:15           ` Dan Williams
@ 2019-01-07 14:11             ` Michael S. Tsirkin
  2019-01-07 21:39               ` Dan Williams
  2019-01-08 11:42               ` [RFC PATCH V3 0/5] Hi: Jason Wang
  0 siblings, 2 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07 14:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Wang, KVM list, virtualization, Netdev,
	Linux Kernel Mailing List, David Miller

On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > >
> > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > toggling.
> > > > > > Will review, thanks!
> > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > nosmap?
> > > > > >
> > > > > On machine without SMAP (Sandy Bridge):
> > > > >
> > > > > Before: 4.8Mpps
> > > > >
> > > > > After: 5.2Mpps
> > > > OK so would you say it's really unsafe versus safe accesses?
> > > > Or would you say it's just a better written code?
> > >
> > >
> > > It's the effect of removing speculation barrier.
> >
> >
> > You mean __uaccess_begin_nospec introduced by
> > commit 304ec1b050310548db33063e567123fae8fd0301
> > ?
> >
> > So fundamentally we do access_ok checks when supplying
> > the memory table to the kernel thread, and we should
> > do the spec barrier there.
> >
> > Then we can just create and use a variant of uaccess macros that does
> > not include the barrier?
> >
> > Or, how about moving the barrier into access_ok?
> > This way repeated accesses with a single access_ok get a bit faster.
> > CC Dan Williams on this idea.
> 
> It would be interesting to see how expensive re-doing the address
> limit check is compared to the speculation barrier. I.e. just switch
> vhost_get_user() to use get_user() rather than __get_user(). That will
> sanitize the pointer in the speculative path without a barrier.

Hmm it's way cheaper even though IIRC it's measureable.
Jason, would you like to try?
Although frankly __get_user being slower than get_user feels very wrong.
Not yet sure what to do exactly but would you agree?


> I recall we had a convert access_ok() discussion with this result here:
> 
> https://lkml.org/lkml/2018/1/17/929

Sorry let me try to clarify. IIUC speculating access_ok once
is harmless. As Linus said the problem is with "_subsequent_
accesses that can then be used to perturb the cache".

Thus:

1. if (!access_ok)
2.	return
3.  get_user
4. if (!access_ok)
5.	return
6.  get_user

Your proposal that Linus nacked was to effectively add a barrier after
lines 2 and 5 (also using the array_index_nospec trick for speed),
right? Unfortunately that needs a big API change.

I am asking about adding barrier_nospec within access_ok.
Thus effectively before lines 1 and 4.
access_ok will be slower but after all the point of access_ok is
to then access the same memory multiple times.
So we should be making __get_user faster and access_ok slower ...

> ...but it sounds like you are proposing a smaller scope fixup for the
> vhost use case? Something like barrier_nospec() in the success path
> for all vhost access_ok() checks and then a get_user() variant that
> disables the barrier.

Maybe we'll have to. Except I hope vhost won't end up being the
only user otherwise it will be hard to maintain.


-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07  6:50           ` Jason Wang
@ 2019-01-07 14:37             ` Michael S. Tsirkin
  2019-01-08 10:01               ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07 14:37 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel, davem, Dan Williams

On Mon, Jan 07, 2019 at 02:50:17PM +0800, Jason Wang wrote:
> 
> On 2019/1/7 下午12:17, Michael S. Tsirkin wrote:
> > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > toggling.
> > > > > > Will review, thanks!
> > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > nosmap?
> > > > > > 
> > > > > On machine without SMAP (Sandy Bridge):
> > > > > 
> > > > > Before: 4.8Mpps
> > > > > 
> > > > > After: 5.2Mpps
> > > > OK so would you say it's really unsafe versus safe accesses?
> > > > Or would you say it's just a better written code?
> > > 
> > > It's the effect of removing speculation barrier.
> > 
> > You mean __uaccess_begin_nospec introduced by
> > commit 304ec1b050310548db33063e567123fae8fd0301
> > ?
> 
> Yes.
> 
> 
> > 
> > So fundamentally we do access_ok checks when supplying
> > the memory table to the kernel thread, and we should
> > do the spec barrier there.
> > 
> > Then we can just create and use a variant of uaccess macros that does
> > not include the barrier?
> 
> 
> The unsafe ones?

Fundamentally yes.


> 
> > 
> > Or, how about moving the barrier into access_ok?
> > This way repeated accesses with a single access_ok get a bit faster.
> > CC Dan Williams on this idea.
> 
> 
> The problem is, e.g for vhost control path. During mem table validation, we
> don't even want to access them there. So the spec barrier is not needed.

Again spec barrier is not needed as such at all. It's defence in depth.
And mem table init is slow path. So we can stick a barrier there and it
won't be a problem for anyone.

> 
> > 
> > 
> > > > > On machine with SMAP (Broadwell):
> > > > > 
> > > > > Before: 5.0Mpps
> > > > > 
> > > > > After: 6.1Mpps
> > > > > 
> > > > > No smap: 7.5Mpps
> > > > > 
> > > > > 
> > > > > Thanks
> > > > no smap being before or after?
> > > > 
> > > Let me clarify:
> > > 
> > > 
> > > Before (SMAP on): 5.0Mpps
> > > 
> > > Before (SMAP off): 7.5Mpps
> > > 
> > > After (SMAP on): 6.1Mpps
> > > 
> > > 
> > > Thanks
> > How about after + smap off?
> 
> 
> After (SMAP off): 8.0Mpps
> 
> > 
> > And maybe we want a module option just for the vhost thread to keep smap
> > off generally since almost all it does is copy stuff from userspace into
> > kernel anyway. Because what above numbers should is that we really
> > really want a solution that isn't limited to just meta-data access,
> > and I really do not see how any such solution can not also be
> > used to make meta-data access fast.
> 
> 
> As we've discussed in another thread of previous version. This requires lots
> of changes, the main issues is SMAP state was not saved/restored on explicit
> schedule().

I wonder how expensive can reading eflags be?
If it's cheap we can just check EFLAGS.AC and rerun stac if needed.

> Even if it did, since vhost will call lots of net/block codes,
> any kind of uaccess in those codes needs understand this special request
> from vhost e.g you provably need to invent a new kinds of iov iterator that
> does not touch SMAP at all. And I'm not sure this is the only thing we need
> to deal with.


Well we wanted to move packet processing from tun into vhost anyway right?

> 
> So I still prefer to:
> 
> 1) speedup the metadata access through vmap + MMU notifier
> 
> 2) speedup the datacopy with batched copy (unsafe ones or other new
> interfaces)
> 
> Thanks

I just guess once you do (2) you will want to rework (1) to use
the new interfaces. So all the effort you are now investing in (1)
will be wasted. Just my $.02.

-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07  6:58   ` Jason Wang
@ 2019-01-07 14:47     ` Michael S. Tsirkin
  2019-01-08 10:12       ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07 14:47 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel, davem

On Mon, Jan 07, 2019 at 02:58:08PM +0800, Jason Wang wrote:
> 
> On 2019/1/5 上午5:41, Michael S. Tsirkin wrote:
> > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > 
> > I think it's a reasonable approach.
> > However I need to look at whether and which mmu notifiers are invoked before
> > writeback. Do you know?
> 
> 
> I don't know but just looking at the MMU notifier ops definition, there's no
> such callback if my understanding is correct.
> 
> Thanks

In that case how are you making sure used ring updates are written back?
If they aren't guest will crash ...

> 
> > 
> > > Test shows about 24% improvement on TX PPS. It should benefit other
> > > cases as well.
> > > 
> > > Changes from V2:
> > > - fix buggy range overlapping check
> > > - tear down MMU notifier during vhost ioctl to make sure invalidation
> > >    request can read metadata userspace address and vq size without
> > >    holding vq mutex.
> > > Changes from V1:
> > > - instead of pinning pages, use MMU notifier to invalidate vmaps and
> > >    remap duing metadata prefetch
> > > - fix build warning on MIPS
> > > 
> > > Jason Wang (5):
> > >    vhost: generalize adding used elem
> > >    vhost: fine grain userspace memory accessors
> > >    vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
> > >    vhost: introduce helpers to get the size of metadata area
> > >    vhost: access vq metadata through kernel virtual address
> > > 
> > >   drivers/vhost/net.c   |   4 +-
> > >   drivers/vhost/vhost.c | 416 +++++++++++++++++++++++++++++++++++++-----
> > >   drivers/vhost/vhost.h |  15 +-
> > >   3 files changed, 384 insertions(+), 51 deletions(-)
> > > 
> > > -- 
> > > 2.17.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 1/5] vhost: generalize adding used elem
  2019-01-07  7:00       ` Jason Wang
@ 2019-01-07 14:50         ` Michael S. Tsirkin
  0 siblings, 0 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07 14:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Sean Christopherson, kvm, virtualization, netdev, linux-kernel, davem

On Mon, Jan 07, 2019 at 03:00:17PM +0800, Jason Wang wrote:
> 
> On 2019/1/5 上午8:33, Sean Christopherson wrote:
> > On Fri, Jan 04, 2019 at 04:29:34PM -0500, Michael S. Tsirkin wrote:
> > > On Sat, Dec 29, 2018 at 08:46:52PM +0800, Jason Wang wrote:
> > > > Use one generic vhost_copy_to_user() instead of two dedicated
> > > > accessor. This will simplify the conversion to fine grain
> > > > accessors. About 2% improvement of PPS were seen during vitio-user
> > > > txonly test.
> > > > 
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > I don't hve a problem with this patch but do you have
> > > any idea how come removing what's supposed to be
> > > an optimization speeds things up?
> > With SMAP, the 2x vhost_put_user() will also mean an extra STAC/CLAC pair,
> > which is probably slower than the overhead of CALL+RET to whatever flavor
> > of copy_user_generic() gets used.  CALL+RET is really the only overhead
> > since all variants of copy_user_generic() unroll accesses smaller than
> > 64 bytes, e.g. on a 64-bit system, __copy_to_user() will write all 8
> > bytes in a single MOV.
> > 
> > Removing the special casing also eliminates a few hundred bytes of code
> > as well as the need for hardware to predict count==1 vs. count>1.
> > 
> 
> Yes, I don't measure, but STAC/CALC is pretty expensive when we are do very
> small copies based on the result of nosmap PPS.
> 
> Thanks

Yes all this really looks like a poster child for uaccess_begin/end
plus unsafe accesses. And if these APIs don't do the job for us
then maybe better ones are needed ...


-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07 14:11             ` Michael S. Tsirkin
@ 2019-01-07 21:39               ` Dan Williams
  2019-01-07 22:25                 ` Michael S. Tsirkin
  2019-01-08 11:42               ` [RFC PATCH V3 0/5] Hi: Jason Wang
  1 sibling, 1 reply; 33+ messages in thread
From: Dan Williams @ 2019-01-07 21:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, KVM list, virtualization, Netdev,
	Linux Kernel Mailing List, David Miller

On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > >
> > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > toggling.
> > > > > > > Will review, thanks!
> > > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > nosmap?
> > > > > > >
> > > > > > On machine without SMAP (Sandy Bridge):
> > > > > >
> > > > > > Before: 4.8Mpps
> > > > > >
> > > > > > After: 5.2Mpps
> > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > Or would you say it's just a better written code?
> > > >
> > > >
> > > > It's the effect of removing speculation barrier.
> > >
> > >
> > > You mean __uaccess_begin_nospec introduced by
> > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > ?
> > >
> > > So fundamentally we do access_ok checks when supplying
> > > the memory table to the kernel thread, and we should
> > > do the spec barrier there.
> > >
> > > Then we can just create and use a variant of uaccess macros that does
> > > not include the barrier?
> > >
> > > Or, how about moving the barrier into access_ok?
> > > This way repeated accesses with a single access_ok get a bit faster.
> > > CC Dan Williams on this idea.
> >
> > It would be interesting to see how expensive re-doing the address
> > limit check is compared to the speculation barrier. I.e. just switch
> > vhost_get_user() to use get_user() rather than __get_user(). That will
> > sanitize the pointer in the speculative path without a barrier.
>
> Hmm it's way cheaper even though IIRC it's measureable.
> Jason, would you like to try?
> Although frankly __get_user being slower than get_user feels very wrong.
> Not yet sure what to do exactly but would you agree?

Agree. __get_user() being faster than get_user() defeats the whole
point of converting code paths to the access_ok() + __get_user()
pattern.

>
>
> > I recall we had a convert access_ok() discussion with this result here:
> >
> > https://lkml.org/lkml/2018/1/17/929
>
> Sorry let me try to clarify. IIUC speculating access_ok once
> is harmless. As Linus said the problem is with "_subsequent_
> accesses that can then be used to perturb the cache".
>
> Thus:
>
> 1. if (!access_ok)
> 2.      return
> 3.  get_user
> 4. if (!access_ok)
> 5.      return
> 6.  get_user
>
> Your proposal that Linus nacked was to effectively add a barrier after
> lines 2 and 5 (also using the array_index_nospec trick for speed),
> right? Unfortunately that needs a big API change.
>
> I am asking about adding barrier_nospec within access_ok.
> Thus effectively before lines 1 and 4.
> access_ok will be slower but after all the point of access_ok is
> to then access the same memory multiple times.

If the barrier is before lines 1 and 4 then it offers no real
protection as far I can see. It will start speculating again on the
user controlled pointer value after the barrier resolves.

> So we should be making __get_user faster and access_ok slower ...

I agree, but then the barrier always needs to be after the access_ok()
check unconditionally called in the return path from access_ok(). At
that point it's back to the implementation that Linus nak'd, or I'm
missing some other detail.

...but maybe if it is limited to a new access_ok_nospec() then the
concern is addressed? Then rename current __get_user() to
__get_user_nospec() and make a new __get_user() that is back to being
optimal.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07 21:39               ` Dan Williams
@ 2019-01-07 22:25                 ` Michael S. Tsirkin
  2019-01-07 22:44                   ` Dan Williams
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07 22:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Wang, KVM list, virtualization, Netdev,
	Linux Kernel Mailing List, David Miller

On Mon, Jan 07, 2019 at 01:39:15PM -0800, Dan Williams wrote:
> On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > > >
> > > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > > toggling.
> > > > > > > > Will review, thanks!
> > > > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > > nosmap?
> > > > > > > >
> > > > > > > On machine without SMAP (Sandy Bridge):
> > > > > > >
> > > > > > > Before: 4.8Mpps
> > > > > > >
> > > > > > > After: 5.2Mpps
> > > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > > Or would you say it's just a better written code?
> > > > >
> > > > >
> > > > > It's the effect of removing speculation barrier.
> > > >
> > > >
> > > > You mean __uaccess_begin_nospec introduced by
> > > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > > ?
> > > >
> > > > So fundamentally we do access_ok checks when supplying
> > > > the memory table to the kernel thread, and we should
> > > > do the spec barrier there.
> > > >
> > > > Then we can just create and use a variant of uaccess macros that does
> > > > not include the barrier?
> > > >
> > > > Or, how about moving the barrier into access_ok?
> > > > This way repeated accesses with a single access_ok get a bit faster.
> > > > CC Dan Williams on this idea.
> > >
> > > It would be interesting to see how expensive re-doing the address
> > > limit check is compared to the speculation barrier. I.e. just switch
> > > vhost_get_user() to use get_user() rather than __get_user(). That will
> > > sanitize the pointer in the speculative path without a barrier.
> >
> > Hmm it's way cheaper even though IIRC it's measureable.
> > Jason, would you like to try?
> > Although frankly __get_user being slower than get_user feels very wrong.
> > Not yet sure what to do exactly but would you agree?
> 
> Agree. __get_user() being faster than get_user() defeats the whole
> point of converting code paths to the access_ok() + __get_user()
> pattern.

Did you mean the reverse?

> >
> >
> > > I recall we had a convert access_ok() discussion with this result here:
> > >
> > > https://lkml.org/lkml/2018/1/17/929
> >
> > Sorry let me try to clarify. IIUC speculating access_ok once
> > is harmless. As Linus said the problem is with "_subsequent_
> > accesses that can then be used to perturb the cache".
> >
> > Thus:
> >
> > 1. if (!access_ok)
> > 2.      return
> > 3.  get_user
> > 4. if (!access_ok)
> > 5.      return
> > 6.  get_user
> >
> > Your proposal that Linus nacked was to effectively add a barrier after
> > lines 2 and 5 (also using the array_index_nospec trick for speed),
> > right? Unfortunately that needs a big API change.
> >
> > I am asking about adding barrier_nospec within access_ok.
> > Thus effectively before lines 1 and 4.
> > access_ok will be slower but after all the point of access_ok is
> > to then access the same memory multiple times.
> 
> If the barrier is before lines 1 and 4 then it offers no real
> protection as far I can see. It will start speculating again on the
> user controlled pointer value after the barrier resolves.
> 
> > So we should be making __get_user faster and access_ok slower ...
> 
> I agree, but then the barrier always needs to be after the access_ok()
> check unconditionally called in the return path from access_ok(). At
> that point it's back to the implementation that Linus nak'd, or I'm
> missing some other detail.
> 
> ...but maybe if it is limited to a new access_ok_nospec() then the
> concern is addressed? Then rename current __get_user() to
> __get_user_nospec() and make a new __get_user() that is back to being
> optimal.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07 22:25                 ` Michael S. Tsirkin
@ 2019-01-07 22:44                   ` Dan Williams
  2019-01-09  4:31                     ` __get_user slower than get_user (was Re: [RFC PATCH V3 0/5] Hi:) Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Dan Williams @ 2019-01-07 22:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, KVM list, virtualization, Netdev,
	Linux Kernel Mailing List, David Miller

On Mon, Jan 7, 2019 at 2:25 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Jan 07, 2019 at 01:39:15PM -0800, Dan Williams wrote:
> > On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > > > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > > > >
> > > > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > > > toggling.
> > > > > > > > > Will review, thanks!
> > > > > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > > > nosmap?
> > > > > > > > >
> > > > > > > > On machine without SMAP (Sandy Bridge):
> > > > > > > >
> > > > > > > > Before: 4.8Mpps
> > > > > > > >
> > > > > > > > After: 5.2Mpps
> > > > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > > > Or would you say it's just a better written code?
> > > > > >
> > > > > >
> > > > > > It's the effect of removing speculation barrier.
> > > > >
> > > > >
> > > > > You mean __uaccess_begin_nospec introduced by
> > > > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > > > ?
> > > > >
> > > > > So fundamentally we do access_ok checks when supplying
> > > > > the memory table to the kernel thread, and we should
> > > > > do the spec barrier there.
> > > > >
> > > > > Then we can just create and use a variant of uaccess macros that does
> > > > > not include the barrier?
> > > > >
> > > > > Or, how about moving the barrier into access_ok?
> > > > > This way repeated accesses with a single access_ok get a bit faster.
> > > > > CC Dan Williams on this idea.
> > > >
> > > > It would be interesting to see how expensive re-doing the address
> > > > limit check is compared to the speculation barrier. I.e. just switch
> > > > vhost_get_user() to use get_user() rather than __get_user(). That will
> > > > sanitize the pointer in the speculative path without a barrier.
> > >
> > > Hmm it's way cheaper even though IIRC it's measureable.
> > > Jason, would you like to try?
> > > Although frankly __get_user being slower than get_user feels very wrong.
> > > Not yet sure what to do exactly but would you agree?
> >
> > Agree. __get_user() being faster than get_user() defeats the whole
> > point of converting code paths to the access_ok() + __get_user()
> > pattern.
>
> Did you mean the reverse?

Hmm, no... I'll rephrase: __get_user() should have lower overhead than
get_user().

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07 14:37             ` Michael S. Tsirkin
@ 2019-01-08 10:01               ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2019-01-08 10:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, linux-kernel, davem, Dan Williams


On 2019/1/7 下午10:37, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 02:50:17PM +0800, Jason Wang wrote:
>> On 2019/1/7 下午12:17, Michael S. Tsirkin wrote:
>>> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
>>>> On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
>>>>> On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
>>>>>> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
>>>>>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>>>> address instead of copy_user() friends since they had too much
>>>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>>>> toggling.
>>>>>>> Will review, thanks!
>>>>>>> One questions that comes to mind is whether it's all about bypassing
>>>>>>> stac/clac.  Could you please include a performance comparison with
>>>>>>> nosmap?
>>>>>>>
>>>>>> On machine without SMAP (Sandy Bridge):
>>>>>>
>>>>>> Before: 4.8Mpps
>>>>>>
>>>>>> After: 5.2Mpps
>>>>> OK so would you say it's really unsafe versus safe accesses?
>>>>> Or would you say it's just a better written code?
>>>> It's the effect of removing speculation barrier.
>>> You mean __uaccess_begin_nospec introduced by
>>> commit 304ec1b050310548db33063e567123fae8fd0301
>>> ?
>> Yes.
>>
>>
>>> So fundamentally we do access_ok checks when supplying
>>> the memory table to the kernel thread, and we should
>>> do the spec barrier there.
>>>
>>> Then we can just create and use a variant of uaccess macros that does
>>> not include the barrier?
>>
>> The unsafe ones?
> Fundamentally yes.
>
>
>>> Or, how about moving the barrier into access_ok?
>>> This way repeated accesses with a single access_ok get a bit faster.
>>> CC Dan Williams on this idea.
>>
>> The problem is, e.g for vhost control path. During mem table validation, we
>> don't even want to access them there. So the spec barrier is not needed.
> Again spec barrier is not needed as such at all. It's defence in depth.
> And mem table init is slow path. So we can stick a barrier there and it
> won't be a problem for anyone.


Consider it's a generic helper. For a deep defense we should keep it 
around the places we do the real userspace memory access.


>
>>>
>>>>>> On machine with SMAP (Broadwell):
>>>>>>
>>>>>> Before: 5.0Mpps
>>>>>>
>>>>>> After: 6.1Mpps
>>>>>>
>>>>>> No smap: 7.5Mpps
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>> no smap being before or after?
>>>>>
>>>> Let me clarify:
>>>>
>>>>
>>>> Before (SMAP on): 5.0Mpps
>>>>
>>>> Before (SMAP off): 7.5Mpps
>>>>
>>>> After (SMAP on): 6.1Mpps
>>>>
>>>>
>>>> Thanks
>>> How about after + smap off?
>>
>> After (SMAP off): 8.0Mpps
>>
>>> And maybe we want a module option just for the vhost thread to keep smap
>>> off generally since almost all it does is copy stuff from userspace into
>>> kernel anyway. Because what above numbers should is that we really
>>> really want a solution that isn't limited to just meta-data access,
>>> and I really do not see how any such solution can not also be
>>> used to make meta-data access fast.
>>
>> As we've discussed in another thread of previous version. This requires lots
>> of changes, the main issues is SMAP state was not saved/restored on explicit
>> schedule().
> I wonder how expensive can reading eflags be?
> If it's cheap we can just check EFLAGS.AC and rerun stac if needed.


Probably not expensive, but consider vhost is probably the only user, is 
it really worth to do this? If we do vmap + batched copy, most part of 
the code were still under protection of SMAP but the performance is 
almost the same. Isn't this a much better solution?


>
>> Even if it did, since vhost will call lots of net/block codes,
>> any kind of uaccess in those codes needs understand this special request
>> from vhost e.g you provably need to invent a new kinds of iov iterator that
>> does not touch SMAP at all. And I'm not sure this is the only thing we need
>> to deal with.
>
> Well we wanted to move packet processing from tun into vhost anyway right?


Yes, but how about other devices? And we should deal with zerocopy path. 
It not a small amount of refactoring and work.


>
>> So I still prefer to:
>>
>> 1) speedup the metadata access through vmap + MMU notifier
>>
>> 2) speedup the datacopy with batched copy (unsafe ones or other new
>> interfaces)
>>
>> Thanks
> I just guess once you do (2) you will want to rework (1) to use
> the new interfaces.



Do you mean batching? So batched copy is much more easier, just few 
codes if unsafe variants if ready or we can invent new safe variants. 
But it would still be slower than vmap. And what's more, vmap does not 
conflict with batching.


>   So all the effort you are now investing in (1)
> will be wasted. Just my $.02.
>

Speeding up metadata access is much easier and vmap was the fastest 
method. So we can benefit from it soon. Speeding up data copy requires 
much more work to do. And in the future if kernel or vhost is ready for 
some new API and perf numbers prove its advantage, it doesn't harm to 
switch.


Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07 14:47     ` Michael S. Tsirkin
@ 2019-01-08 10:12       ` Jason Wang
  2019-01-11  8:59         ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2019-01-08 10:12 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel, davem


On 2019/1/7 下午10:47, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 02:58:08PM +0800, Jason Wang wrote:
>> On 2019/1/5 上午5:41, Michael S. Tsirkin wrote:
>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>> This series tries to access virtqueue metadata through kernel virtual
>>>> address instead of copy_user() friends since they had too much
>>>> overheads like checks, spec barriers or even hardware feature
>>>> toggling.
>>> I think it's a reasonable approach.
>>> However I need to look at whether and which mmu notifiers are invoked before
>>> writeback. Do you know?
>>
>> I don't know but just looking at the MMU notifier ops definition, there's no
>> such callback if my understanding is correct.
>>
>> Thanks
> In that case how are you making sure used ring updates are written back?
> If they aren't guest will crash ...


I think this is the writeback issue you mentioned early. I don't do a 
followup on the pointer but it looks to me some work is ongoing to fix 
the issue.

I can investigate it more, but it's not something new, consider the case 
of VFIO.

Thanks




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-07 14:11             ` Michael S. Tsirkin
  2019-01-07 21:39               ` Dan Williams
@ 2019-01-08 11:42               ` Jason Wang
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Wang @ 2019-01-08 11:42 UTC (permalink / raw)
  To: Michael S. Tsirkin, Dan Williams
  Cc: KVM list, virtualization, Netdev, Linux Kernel Mailing List,
	David Miller


On 2019/1/7 下午10:11, Michael S. Tsirkin wrote:
> On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
>> On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
>>>> On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
>>>>> On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
>>>>>> On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
>>>>>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>>>>> address instead of copy_user() friends since they had too much
>>>>>>>> overheads like checks, spec barriers or even hardware feature
>>>>>>>> toggling.
>>>>>>> Will review, thanks!
>>>>>>> One questions that comes to mind is whether it's all about bypassing
>>>>>>> stac/clac.  Could you please include a performance comparison with
>>>>>>> nosmap?
>>>>>>>
>>>>>> On machine without SMAP (Sandy Bridge):
>>>>>>
>>>>>> Before: 4.8Mpps
>>>>>>
>>>>>> After: 5.2Mpps
>>>>> OK so would you say it's really unsafe versus safe accesses?
>>>>> Or would you say it's just a better written code?
>>>>
>>>> It's the effect of removing speculation barrier.
>>>
>>> You mean __uaccess_begin_nospec introduced by
>>> commit 304ec1b050310548db33063e567123fae8fd0301
>>> ?
>>>
>>> So fundamentally we do access_ok checks when supplying
>>> the memory table to the kernel thread, and we should
>>> do the spec barrier there.
>>>
>>> Then we can just create and use a variant of uaccess macros that does
>>> not include the barrier?
>>>
>>> Or, how about moving the barrier into access_ok?
>>> This way repeated accesses with a single access_ok get a bit faster.
>>> CC Dan Williams on this idea.
>> It would be interesting to see how expensive re-doing the address
>> limit check is compared to the speculation barrier. I.e. just switch
>> vhost_get_user() to use get_user() rather than __get_user(). That will
>> sanitize the pointer in the speculative path without a barrier.
> Hmm it's way cheaper even though IIRC it's measureable.
> Jason, would you like to try?


0.5% regression after using get_user()/put_user()/...


> Although frankly __get_user being slower than get_user feels very wrong.
> Not yet sure what to do exactly but would you agree?
>
>
>> I recall we had a convert access_ok() discussion with this result here:
>>
>> https://lkml.org/lkml/2018/1/17/929
> Sorry let me try to clarify. IIUC speculating access_ok once
> is harmless. As Linus said the problem is with "_subsequent_
> accesses that can then be used to perturb the cache".
>
> Thus:
>
> 1. if (!access_ok)
> 2.	return
> 3.  get_user
> 4. if (!access_ok)
> 5.	return
> 6.  get_user
>
> Your proposal that Linus nacked was to effectively add a barrier after
> lines 2 and 5 (also using the array_index_nospec trick for speed),
> right? Unfortunately that needs a big API change.
>
> I am asking about adding barrier_nospec within access_ok.
> Thus effectively before lines 1 and 4.
> access_ok will be slower but after all the point of access_ok is
> to then access the same memory multiple times.
> So we should be making __get_user faster and access_ok slower ...


And the barrier_nospec() was completely necessary if you want to do 
write instead read.


Thanks


>
>> ...but it sounds like you are proposing a smaller scope fixup for the
>> vhost use case? Something like barrier_nospec() in the success path
>> for all vhost access_ok() checks and then a get_user() variant that
>> disables the barrier.
> Maybe we'll have to. Except I hope vhost won't end up being the
> only user otherwise it will be hard to maintain.
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* __get_user slower than get_user (was Re: [RFC PATCH V3 0/5] Hi:)
  2019-01-07 22:44                   ` Dan Williams
@ 2019-01-09  4:31                     ` Michael S. Tsirkin
  2019-01-09  5:19                       ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-01-09  4:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Wang, KVM list, virtualization, Netdev,
	Linux Kernel Mailing List, David Miller, Linus Torvalds

On Mon, Jan 07, 2019 at 02:44:24PM -0800, Dan Williams wrote:
> On Mon, Jan 7, 2019 at 2:25 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Jan 07, 2019 at 01:39:15PM -0800, Dan Williams wrote:
> > > On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > > > > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > > > > >
> > > > > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > > > > This series tries to access virtqueue metadata through kernel virtual
> > > > > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > > > > toggling.
> > > > > > > > > > Will review, thanks!
> > > > > > > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > > > > nosmap?
> > > > > > > > > >
> > > > > > > > > On machine without SMAP (Sandy Bridge):
> > > > > > > > >
> > > > > > > > > Before: 4.8Mpps
> > > > > > > > >
> > > > > > > > > After: 5.2Mpps
> > > > > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > > > > Or would you say it's just a better written code?
> > > > > > >
> > > > > > >
> > > > > > > It's the effect of removing speculation barrier.
> > > > > >
> > > > > >
> > > > > > You mean __uaccess_begin_nospec introduced by
> > > > > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > > > > ?
> > > > > >
> > > > > > So fundamentally we do access_ok checks when supplying
> > > > > > the memory table to the kernel thread, and we should
> > > > > > do the spec barrier there.
> > > > > >
> > > > > > Then we can just create and use a variant of uaccess macros that does
> > > > > > not include the barrier?
> > > > > >
> > > > > > Or, how about moving the barrier into access_ok?
> > > > > > This way repeated accesses with a single access_ok get a bit faster.
> > > > > > CC Dan Williams on this idea.
> > > > >
> > > > > It would be interesting to see how expensive re-doing the address
> > > > > limit check is compared to the speculation barrier. I.e. just switch
> > > > > vhost_get_user() to use get_user() rather than __get_user(). That will
> > > > > sanitize the pointer in the speculative path without a barrier.
> > > >
> > > > Hmm it's way cheaper even though IIRC it's measureable.
> > > > Jason, would you like to try?
> > > > Although frankly __get_user being slower than get_user feels very wrong.
> > > > Not yet sure what to do exactly but would you agree?
> > >
> > > Agree. __get_user() being faster than get_user() defeats the whole
> > > point of converting code paths to the access_ok() + __get_user()
> > > pattern.
> >
> > Did you mean the reverse?
> 
> Hmm, no... I'll rephrase: __get_user() should have lower overhead than
> get_user().

Right ...

Linus, given that you just changed all users of access_ok anyway, do
you still think that the access_ok() conversion to return a speculation
sanitized pointer or NULL is too big a conversion?

It was previously discarded here:
https://lkml.org/lkml/2018/1/17/929
but at that point we didn't have numbers and there was an understandable
rush to ship something safe.

At this point I think that vhost can show very measureable gains from
this conversion.

Thanks,

-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: __get_user slower than get_user (was Re: [RFC PATCH V3 0/5] Hi:)
  2019-01-09  4:31                     ` __get_user slower than get_user (was Re: [RFC PATCH V3 0/5] Hi:) Michael S. Tsirkin
@ 2019-01-09  5:19                       ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2019-01-09  5:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Dan Williams, Jason Wang, KVM list, virtualization, Netdev,
	Linux Kernel Mailing List, David Miller

On Tue, Jan 8, 2019 at 8:31 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Linus, given that you just changed all users of access_ok anyway, do
> you still think that the access_ok() conversion to return a speculation
> sanitized pointer or NULL is too big a conversion?

I didn't actually change a single access_ok().

I changed the (very few) users of "user_access_begin()" to do an
access_ok() in them. There were 8 of them total.

It turns out that two of those cases (the strn*_user() ones) found
bugs in the implementation of access_ok() of two architectures, and
then looking at the others found that six more architectures also had
problems, but those weren't actually because of any access_ok()
changes, they were pre-existing issues. So we definitely had
unfortunate bugs in access_ok(), but they were mostly the benign kind
(ir the "use arguments twice - a real potential bug, but not one that
actually likely makes any difference to existing users)

Changing all 600+ users of access_ok() would be painful.

That said, one thing I *would* like to do is to just get rid of
__get_user() and __put_user() entirely. Or rather, just make them do
exactly the same thing that the normal "get_user()"/"put_user()"
functions do.

And then, _within_ the case of get_user()/put_user(), doing the
access_ok() as a data dependency rather than a lfence should be easy
enough.

                     Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH V3 0/5] Hi:
  2019-01-08 10:12       ` Jason Wang
@ 2019-01-11  8:59         ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2019-01-11  8:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization, netdev, linux-kernel, davem


On 2019/1/8 下午6:12, Jason Wang wrote:
>
> On 2019/1/7 下午10:47, Michael S. Tsirkin wrote:
>> On Mon, Jan 07, 2019 at 02:58:08PM +0800, Jason Wang wrote:
>>> On 2019/1/5 上午5:41, Michael S. Tsirkin wrote:
>>>> On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
>>>>> This series tries to access virtqueue metadata through kernel virtual
>>>>> address instead of copy_user() friends since they had too much
>>>>> overheads like checks, spec barriers or even hardware feature
>>>>> toggling.
>>>> I think it's a reasonable approach.
>>>> However I need to look at whether and which mmu notifiers are 
>>>> invoked before
>>>> writeback. Do you know?
>>>
>>> I don't know but just looking at the MMU notifier ops definition, 
>>> there's no
>>> such callback if my understanding is correct.
>>>
>>> Thanks
>> In that case how are you making sure used ring updates are written back?
>> If they aren't guest will crash ...
>
>
> I think this is the writeback issue you mentioned early. I don't do a 
> followup on the pointer but it looks to me some work is ongoing to fix 
> the issue.
>
> I can investigate it more, but it's not something new, consider the 
> case of VFIO.
>
> Thanks


Ok, after some investigation. The GUP + dirty pages issue is not easy to 
be fixed so it may still take a while.

An idea is switch back to copy_user() friends if we find the metadata 
page is not anonymous.

Does this sound good to you?

Thanks


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-01-11  8:59 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-29 12:46 [RFC PATCH V3 0/5] Hi: Jason Wang
2018-12-29 12:46 ` [RFC PATCH V3 1/5] vhost: generalize adding used elem Jason Wang
2019-01-04 21:29   ` Michael S. Tsirkin
2019-01-05  0:33     ` Sean Christopherson
2019-01-07  7:00       ` Jason Wang
2019-01-07 14:50         ` Michael S. Tsirkin
2018-12-29 12:46 ` [RFC PATCH V3 2/5] vhost: fine grain userspace memory accessors Jason Wang
2018-12-29 12:46 ` [RFC PATCH V3 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch() Jason Wang
2018-12-29 12:46 ` [RFC PATCH V3 4/5] vhost: introduce helpers to get the size of metadata area Jason Wang
2018-12-29 12:46 ` [RFC PATCH V3 5/5] vhost: access vq metadata through kernel virtual address Jason Wang
2019-01-04 21:34   ` Michael S. Tsirkin
2019-01-07  8:40     ` Jason Wang
2019-01-02 20:47 ` [RFC PATCH V3 0/5] Hi: Michael S. Tsirkin
2019-01-07  2:19   ` Jason Wang
2019-01-07  3:28     ` Michael S. Tsirkin
2019-01-07  3:53       ` Jason Wang
2019-01-07  4:17         ` Michael S. Tsirkin
2019-01-07  6:50           ` Jason Wang
2019-01-07 14:37             ` Michael S. Tsirkin
2019-01-08 10:01               ` Jason Wang
2019-01-07  7:15           ` Dan Williams
2019-01-07 14:11             ` Michael S. Tsirkin
2019-01-07 21:39               ` Dan Williams
2019-01-07 22:25                 ` Michael S. Tsirkin
2019-01-07 22:44                   ` Dan Williams
2019-01-09  4:31                     ` __get_user slower than get_user (was Re: [RFC PATCH V3 0/5] Hi:) Michael S. Tsirkin
2019-01-09  5:19                       ` Linus Torvalds
2019-01-08 11:42               ` [RFC PATCH V3 0/5] Hi: Jason Wang
2019-01-04 21:41 ` Michael S. Tsirkin
2019-01-07  6:58   ` Jason Wang
2019-01-07 14:47     ` Michael S. Tsirkin
2019-01-08 10:12       ` Jason Wang
2019-01-11  8:59         ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).