All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 00/14] virtio and vhost-net performance enhancements
@ 2011-05-19 23:10 ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

OK, here is the large patchset that implements the virtio spec update
that I sent earlier (the spec itself needs a minor update, will send
that out too next week, but I think we are on the same page here
already). It supercedes the PUBLISH_USED_IDX patches I sent
out earlier.

What will follow will be a patchset that actually includes 4 sets of
patches.  I note below their status.  Please consider for 2.6.40, at
least partially. Rusty, do you think it's feasible?

List of patches and what they do:

I) With the first patchset, we change virtio ring notification
hand-off to work like the one in Xen -
each side publishes an event index, the other one
notifies when it reaches that value -
With the one difference that event index starts at 0,
same as request index (in xen event index starts at 1).

These are the patches in this set:
virtio: event index interface
virtio ring: inline function to check for events
virtio_ring: support event idx feature
vhost: support event index
virtio_test: support event index

Changes in this part of the patchset from v1 - address comments by Rusty et al.

I tested this a lot with virtio net block and with the simulator and esp
with the simulator it's easy to see drastic performance improvement
here:

[virtio]# time ./virtio_test 
spurious wakeus: 0x7

real    0m0.169s
user    0m0.140s
sys     0m0.019s
[virtio]# time ./virtio_test --no-event-idx
spurious wakeus: 0x11

real    0m0.649s
user    0m0.295s
sys     0m0.335s

And these patches are mostly unchanged from the very first version,
changes being almost exclusively code cleanups.  So I consider this part
the most stable, I strongly think these patches should go into 2.6.40.
One extra reason besides performance is that maintaining
them out of tree is very painful as guest/host ABI is affected.

II) Second set of patches: new apis and use in virtio_net
With the indexes in place it becomes possibile to request an event after
many requests (and not just on the next one as done now). This shall fix
the TX queue overrun which currently triggers a storm of interrupts.

Another issue I tried to fix is capacity checks in virtio-net,
there's a new API for that, and on top of that,
I implemented a patch improving real-time characteristics
of virtio_net

Thus we get the second patchset:
virtio: add api for delayed callbacks
virtio_net: delay TX callbacks
virtio_ring: Add capacity check API
virtio_net: fix TX capacity checks using new API
virtio_net: limit xmit polling

This has some fixes that I posted previously applied,
but otherwise ideantical to v1. I tried to change API
for enable_cb_delayed as Rusty suggested but failed to do this.
I think it's not possible to define cleanly.

These work fine for me, I think they can be merged for 2.6.40
too but would be nice to hear back from Shirley, Tom, Krishna.

III) There's also a patch that adds a tweak to virtio ring
virtio: don't delay avail index update

This seems to help small message sizes where we are constantly draining
the RX VQ.

I'll need to benchmark this to be able to give any numbers
with confidence, but I don't see how it can hurt anything.
Thoughts?

IV) Last part is a set of patches to extend feature bits
to 64 bit. I tested this by using feature bit 32.
vhost: fix 64 bit features
virtio_test: update for 64 bit features
virtio: 64 bit features

It's nice to have as set I used up the last free bit.
But not a must now that a single bit controls
use of event index on both sides.

The patchset is on top of net-next which at the time
I last rebased was 15ecd03 - so roughly 2.6.39-rc2.
For testing I usually do merge v2.6.39 on top.

qemu patch is also ready.  Code can be pulled from here:

git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next-event-idx-v3
git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git virtio-net-event-idx-v3

Rusty, I think it will be easier to merge vhost and virtio bits in one
go. Can it all go in through your tree (Dave in the past acked
sending a very similar patch through you so should not be a problem)?



-- 
1.7.5.53.gc233e


Michael S. Tsirkin (13):
  virtio: event index interface
  virtio ring: inline function to check for events
  virtio_ring: support event idx feature
  vhost: support event index
  virtio_test: support event index
  virtio: add api for delayed callbacks
  virtio_net: delay TX callbacks
  virtio_net: fix TX capacity checks using new API
  virtio_net: limit xmit polling
  virtio: don't delay avail index update
  virtio: 64 bit features
  virtio_test: update for 64 bit features
  vhost: fix 64 bit features

Shirley Ma (1):
  virtio_ring: Add capacity check API

 drivers/lguest/lguest_device.c |    8 +-
 drivers/net/virtio_net.c       |   27 +++++---
 drivers/s390/kvm/kvm_virtio.c  |    8 +-
 drivers/vhost/net.c            |   12 ++--
 drivers/vhost/test.c           |    6 +-
 drivers/vhost/vhost.c          |  138 ++++++++++++++++++++++++++++++----------
 drivers/vhost/vhost.h          |   29 +++++---
 drivers/virtio/virtio.c        |    8 +-
 drivers/virtio/virtio_pci.c    |   34 ++++++++--
 drivers/virtio/virtio_ring.c   |   87 ++++++++++++++++++++++---
 include/linux/virtio.h         |   16 ++++-
 include/linux/virtio_config.h  |   15 +++--
 include/linux/virtio_pci.h     |    9 ++-
 include/linux/virtio_ring.h    |   29 ++++++++-
 tools/virtio/virtio_test.c     |   27 +++++++-
 15 files changed, 348 insertions(+), 105 deletions(-)

-- 
1.7.5.53.gc233e

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCHv2 00/14] virtio and vhost-net performance enhancements
@ 2011-05-19 23:10 ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

OK, here is the large patchset that implements the virtio spec update
that I sent earlier (the spec itself needs a minor update, will send
that out too next week, but I think we are on the same page here
already). It supercedes the PUBLISH_USED_IDX patches I sent
out earlier.

What will follow will be a patchset that actually includes 4 sets of
patches.  I note below their status.  Please consider for 2.6.40, at
least partially. Rusty, do you think it's feasible?

List of patches and what they do:

I) With the first patchset, we change virtio ring notification
hand-off to work like the one in Xen -
each side publishes an event index, the other one
notifies when it reaches that value -
With the one difference that event index starts at 0,
same as request index (in xen event index starts at 1).

These are the patches in this set:
virtio: event index interface
virtio ring: inline function to check for events
virtio_ring: support event idx feature
vhost: support event index
virtio_test: support event index

Changes in this part of the patchset from v1 - address comments by Rusty et al.

I tested this a lot with virtio net block and with the simulator and esp
with the simulator it's easy to see drastic performance improvement
here:

[virtio]# time ./virtio_test 
spurious wakeus: 0x7

real    0m0.169s
user    0m0.140s
sys     0m0.019s
[virtio]# time ./virtio_test --no-event-idx
spurious wakeus: 0x11

real    0m0.649s
user    0m0.295s
sys     0m0.335s

And these patches are mostly unchanged from the very first version,
changes being almost exclusively code cleanups.  So I consider this part
the most stable, I strongly think these patches should go into 2.6.40.
One extra reason besides performance is that maintaining
them out of tree is very painful as guest/host ABI is affected.

II) Second set of patches: new apis and use in virtio_net
With the indexes in place it becomes possibile to request an event after
many requests (and not just on the next one as done now). This shall fix
the TX queue overrun which currently triggers a storm of interrupts.

Another issue I tried to fix is capacity checks in virtio-net,
there's a new API for that, and on top of that,
I implemented a patch improving real-time characteristics
of virtio_net

Thus we get the second patchset:
virtio: add api for delayed callbacks
virtio_net: delay TX callbacks
virtio_ring: Add capacity check API
virtio_net: fix TX capacity checks using new API
virtio_net: limit xmit polling

This has some fixes that I posted previously applied,
but otherwise ideantical to v1. I tried to change API
for enable_cb_delayed as Rusty suggested but failed to do this.
I think it's not possible to define cleanly.

These work fine for me, I think they can be merged for 2.6.40
too but would be nice to hear back from Shirley, Tom, Krishna.

III) There's also a patch that adds a tweak to virtio ring
virtio: don't delay avail index update

This seems to help small message sizes where we are constantly draining
the RX VQ.

I'll need to benchmark this to be able to give any numbers
with confidence, but I don't see how it can hurt anything.
Thoughts?

IV) Last part is a set of patches to extend feature bits
to 64 bit. I tested this by using feature bit 32.
vhost: fix 64 bit features
virtio_test: update for 64 bit features
virtio: 64 bit features

It's nice to have as set I used up the last free bit.
But not a must now that a single bit controls
use of event index on both sides.

The patchset is on top of net-next which at the time
I last rebased was 15ecd03 - so roughly 2.6.39-rc2.
For testing I usually do merge v2.6.39 on top.

qemu patch is also ready.  Code can be pulled from here:

git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next-event-idx-v3
git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git virtio-net-event-idx-v3

Rusty, I think it will be easier to merge vhost and virtio bits in one
go. Can it all go in through your tree (Dave in the past acked
sending a very similar patch through you so should not be a problem)?



-- 
1.7.5.53.gc233e


Michael S. Tsirkin (13):
  virtio: event index interface
  virtio ring: inline function to check for events
  virtio_ring: support event idx feature
  vhost: support event index
  virtio_test: support event index
  virtio: add api for delayed callbacks
  virtio_net: delay TX callbacks
  virtio_net: fix TX capacity checks using new API
  virtio_net: limit xmit polling
  virtio: don't delay avail index update
  virtio: 64 bit features
  virtio_test: update for 64 bit features
  vhost: fix 64 bit features

Shirley Ma (1):
  virtio_ring: Add capacity check API

 drivers/lguest/lguest_device.c |    8 +-
 drivers/net/virtio_net.c       |   27 +++++---
 drivers/s390/kvm/kvm_virtio.c  |    8 +-
 drivers/vhost/net.c            |   12 ++--
 drivers/vhost/test.c           |    6 +-
 drivers/vhost/vhost.c          |  138 ++++++++++++++++++++++++++++++----------
 drivers/vhost/vhost.h          |   29 +++++---
 drivers/virtio/virtio.c        |    8 +-
 drivers/virtio/virtio_pci.c    |   34 ++++++++--
 drivers/virtio/virtio_ring.c   |   87 ++++++++++++++++++++++---
 include/linux/virtio.h         |   16 ++++-
 include/linux/virtio_config.h  |   15 +++--
 include/linux/virtio_pci.h     |    9 ++-
 include/linux/virtio_ring.h    |   29 ++++++++-
 tools/virtio/virtio_test.c     |   27 +++++++-
 15 files changed, 348 insertions(+), 105 deletions(-)

-- 
1.7.5.53.gc233e

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCHv2 01/14] virtio: event index interface
@ 2011-05-19 23:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Define a new feature bit for the guest and host to utilize
an event index (like Xen) instead if a flag bit to enable/disable
interrupts and kicks.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_ring.h |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index e4d144b..70b0b39 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -29,6 +29,12 @@
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
+/* The Guest publishes the used index for which it expects an interrupt
+ * at the end of the avail ring. Host should ignore the avail->flags field. */
+/* The Host publishes the avail index for which it expects a kick
+ * at the end of the used ring. Guest should ignore the used->flags field. */
+#define VIRTIO_RING_F_EVENT_IDX		29
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
 	/* Address (guest-physical). */
@@ -83,6 +89,7 @@ struct vring {
  *	__u16 avail_flags;
  *	__u16 avail_idx;
  *	__u16 available[num];
+ *	__u16 used_event_idx;
  *
  *	// Padding to the next align boundary.
  *	char pad[];
@@ -91,8 +98,14 @@ struct vring {
  *	__u16 used_flags;
  *	__u16 used_idx;
  *	struct vring_used_elem used[num];
+ *	__u16 avail_event_idx;
  * };
  */
+/* We publish the used event index at the end of the available ring, and vice
+ * versa. They are at the end for backwards compatibility. */
+#define vring_used_event(vr) ((vr)->avail->ring[(vr)->num])
+#define vring_avail_event(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
+
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 			      unsigned long align)
 {
@@ -107,7 +120,7 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 {
 	return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num)
 		 + align - 1) & ~(align - 1))
-		+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
+		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
 #ifdef __KERNEL__
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 01/14] virtio: event index interface
@ 2011-05-19 23:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Define a new feature bit for the guest and host to utilize
an event index (like Xen) instead if a flag bit to enable/disable
interrupts and kicks.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/virtio_ring.h |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index e4d144b..70b0b39 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -29,6 +29,12 @@
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
+/* The Guest publishes the used index for which it expects an interrupt
+ * at the end of the avail ring. Host should ignore the avail->flags field. */
+/* The Host publishes the avail index for which it expects a kick
+ * at the end of the used ring. Guest should ignore the used->flags field. */
+#define VIRTIO_RING_F_EVENT_IDX		29
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
 	/* Address (guest-physical). */
@@ -83,6 +89,7 @@ struct vring {
  *	__u16 avail_flags;
  *	__u16 avail_idx;
  *	__u16 available[num];
+ *	__u16 used_event_idx;
  *
  *	// Padding to the next align boundary.
  *	char pad[];
@@ -91,8 +98,14 @@ struct vring {
  *	__u16 used_flags;
  *	__u16 used_idx;
  *	struct vring_used_elem used[num];
+ *	__u16 avail_event_idx;
  * };
  */
+/* We publish the used event index at the end of the available ring, and vice
+ * versa. They are at the end for backwards compatibility. */
+#define vring_used_event(vr) ((vr)->avail->ring[(vr)->num])
+#define vring_avail_event(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
+
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 			      unsigned long align)
 {
@@ -107,7 +120,7 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 {
 	return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num)
 		 + align - 1) & ~(align - 1))
-		+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
+		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
 #ifdef __KERNEL__
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 01/14] virtio: event index interface
  2011-05-19 23:10 ` Michael S. Tsirkin
  (?)
  (?)
@ 2011-05-19 23:10 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Define a new feature bit for the guest and host to utilize
an event index (like Xen) instead if a flag bit to enable/disable
interrupts and kicks.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_ring.h |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index e4d144b..70b0b39 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -29,6 +29,12 @@
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
+/* The Guest publishes the used index for which it expects an interrupt
+ * at the end of the avail ring. Host should ignore the avail->flags field. */
+/* The Host publishes the avail index for which it expects a kick
+ * at the end of the used ring. Guest should ignore the used->flags field. */
+#define VIRTIO_RING_F_EVENT_IDX		29
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
 	/* Address (guest-physical). */
@@ -83,6 +89,7 @@ struct vring {
  *	__u16 avail_flags;
  *	__u16 avail_idx;
  *	__u16 available[num];
+ *	__u16 used_event_idx;
  *
  *	// Padding to the next align boundary.
  *	char pad[];
@@ -91,8 +98,14 @@ struct vring {
  *	__u16 used_flags;
  *	__u16 used_idx;
  *	struct vring_used_elem used[num];
+ *	__u16 avail_event_idx;
  * };
  */
+/* We publish the used event index at the end of the available ring, and vice
+ * versa. They are at the end for backwards compatibility. */
+#define vring_used_event(vr) ((vr)->avail->ring[(vr)->num])
+#define vring_avail_event(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
+
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 			      unsigned long align)
 {
@@ -107,7 +120,7 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 {
 	return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num)
 		 + align - 1) & ~(align - 1))
-		+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
+		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
 #ifdef __KERNEL__
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 02/14] virtio ring: inline function to check for events
@ 2011-05-19 23:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

With the new used_event and avail_event and features, both
host and guest need similar logic to check whether events are
enabled, so it helps to put the common code in the header.

Note that Xen has similar logic for notification hold-off
in include/xen/interface/io/ring.h with req_event and req_prod
corresponding to event_idx + 1 and new_idx respectively.
+1 comes from the fact that req_event and req_prod in Xen start at 1,
while event index in virtio starts at 0.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_ring.h |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 70b0b39..cf020e3 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -123,6 +123,20 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
+/* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
+/* Assuming a given event_idx value from the other size, if
+ * we have just incremented index from old to new_idx,
+ * should we trigger an event? */
+static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
+{
+	/* Note: Xen has similar logic for notification hold-off
+	 * in include/xen/interface/io/ring.h with req_event and req_prod
+	 * corresponding to event_idx + 1 and new_idx respectively.
+	 * Note also that req_event and req_prod in Xen start at 1,
+	 * event indexes in virtio start at 0. */
+	return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
+}
+
 #ifdef __KERNEL__
 #include <linux/irqreturn.h>
 struct virtio_device;
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 02/14] virtio ring: inline function to check for events
@ 2011-05-19 23:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

With the new used_event and avail_event and features, both
host and guest need similar logic to check whether events are
enabled, so it helps to put the common code in the header.

Note that Xen has similar logic for notification hold-off
in include/xen/interface/io/ring.h with req_event and req_prod
corresponding to event_idx + 1 and new_idx respectively.
+1 comes from the fact that req_event and req_prod in Xen start at 1,
while event index in virtio starts at 0.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/virtio_ring.h |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 70b0b39..cf020e3 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -123,6 +123,20 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
+/* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
+/* Assuming a given event_idx value from the other size, if
+ * we have just incremented index from old to new_idx,
+ * should we trigger an event? */
+static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
+{
+	/* Note: Xen has similar logic for notification hold-off
+	 * in include/xen/interface/io/ring.h with req_event and req_prod
+	 * corresponding to event_idx + 1 and new_idx respectively.
+	 * Note also that req_event and req_prod in Xen start at 1,
+	 * event indexes in virtio start at 0. */
+	return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
+}
+
 #ifdef __KERNEL__
 #include <linux/irqreturn.h>
 struct virtio_device;
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 02/14] virtio ring: inline function to check for events
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (3 preceding siblings ...)
  (?)
@ 2011-05-19 23:10 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

With the new used_event and avail_event and features, both
host and guest need similar logic to check whether events are
enabled, so it helps to put the common code in the header.

Note that Xen has similar logic for notification hold-off
in include/xen/interface/io/ring.h with req_event and req_prod
corresponding to event_idx + 1 and new_idx respectively.
+1 comes from the fact that req_event and req_prod in Xen start at 1,
while event index in virtio starts at 0.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_ring.h |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 70b0b39..cf020e3 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -123,6 +123,20 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
+/* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
+/* Assuming a given event_idx value from the other size, if
+ * we have just incremented index from old to new_idx,
+ * should we trigger an event? */
+static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
+{
+	/* Note: Xen has similar logic for notification hold-off
+	 * in include/xen/interface/io/ring.h with req_event and req_prod
+	 * corresponding to event_idx + 1 and new_idx respectively.
+	 * Note also that req_event and req_prod in Xen start at 1,
+	 * event indexes in virtio start at 0. */
+	return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
+}
+
 #ifdef __KERNEL__
 #include <linux/irqreturn.h>
 struct virtio_device;
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 03/14] virtio_ring: support event idx feature
@ 2011-05-19 23:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Support for the new event idx feature:
1. When enabling interrupts, publish the current avail index
   value to the host to get interrupts on the next update.
2. Use the new avail_event feature to reduce the number
   of exits from the guest.

Simple test with the simulator:

[virtio]# time ./virtio_test
spurious wakeus: 0x7

real    0m0.169s
user    0m0.140s
sys     0m0.019s
[virtio]# time ./virtio_test --no-event-idx
spurious wakeus: 0x11

real    0m0.649s
user    0m0.295s
sys     0m0.335s

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   26 ++++++++++++++++++++++++--
 1 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cc2f73e..1d0f9be 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -82,6 +82,9 @@ struct vring_virtqueue
 	/* Host supports indirect buffers */
 	bool indirect;
 
+	/* Host publishes avail event idx */
+	bool event;
+
 	/* Number of free buffers */
 	unsigned int num_free;
 	/* Head of free buffer list. */
@@ -237,18 +240,22 @@ EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 void virtqueue_kick(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 new, old;
 	START_USE(vq);
 	/* Descriptors and available array need to be set before we expose the
 	 * new available array entries. */
 	virtio_wmb();
 
-	vq->vring.avail->idx += vq->num_added;
+	old = vq->vring.avail->idx;
+	new = vq->vring.avail->idx = old + vq->num_added;
 	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
-	if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
+	if (vq->event ?
+	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
+	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
 		/* Prod other side to tell it about changes. */
 		vq->notify(&vq->vq);
 
@@ -324,6 +331,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 	ret = vq->data[i];
 	detach_buf(vq, i);
 	vq->last_used_idx++;
+	/* If we expect an interrupt for the next entry, tell host
+	 * by writing event index and flush out the write before
+	 * the read in the next get_buf call. */
+	if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
+		vring_used_event(&vq->vring) = vq->last_used_idx;
+		virtio_mb();
+	}
+
 	END_USE(vq);
 	return ret;
 }
@@ -345,7 +360,11 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
 	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
+	vring_used_event(&vq->vring) = vq->last_used_idx;
 	virtio_mb();
 	if (unlikely(more_used(vq))) {
 		END_USE(vq);
@@ -437,6 +456,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 #endif
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)
@@ -471,6 +491,8 @@ void vring_transport_features(struct virtio_device *vdev)
 		switch (i) {
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
+		case VIRTIO_RING_F_EVENT_IDX:
+			break;
 		default:
 			/* We don't understand this bit. */
 			clear_bit(i, vdev->features);
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 03/14] virtio_ring: support event idx feature
@ 2011-05-19 23:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Support for the new event idx feature:
1. When enabling interrupts, publish the current avail index
   value to the host to get interrupts on the next update.
2. Use the new avail_event feature to reduce the number
   of exits from the guest.

Simple test with the simulator:

[virtio]# time ./virtio_test
spurious wakeus: 0x7

real    0m0.169s
user    0m0.140s
sys     0m0.019s
[virtio]# time ./virtio_test --no-event-idx
spurious wakeus: 0x11

real    0m0.649s
user    0m0.295s
sys     0m0.335s

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/virtio/virtio_ring.c |   26 ++++++++++++++++++++++++--
 1 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cc2f73e..1d0f9be 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -82,6 +82,9 @@ struct vring_virtqueue
 	/* Host supports indirect buffers */
 	bool indirect;
 
+	/* Host publishes avail event idx */
+	bool event;
+
 	/* Number of free buffers */
 	unsigned int num_free;
 	/* Head of free buffer list. */
@@ -237,18 +240,22 @@ EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 void virtqueue_kick(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 new, old;
 	START_USE(vq);
 	/* Descriptors and available array need to be set before we expose the
 	 * new available array entries. */
 	virtio_wmb();
 
-	vq->vring.avail->idx += vq->num_added;
+	old = vq->vring.avail->idx;
+	new = vq->vring.avail->idx = old + vq->num_added;
 	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
-	if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
+	if (vq->event ?
+	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
+	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
 		/* Prod other side to tell it about changes. */
 		vq->notify(&vq->vq);
 
@@ -324,6 +331,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 	ret = vq->data[i];
 	detach_buf(vq, i);
 	vq->last_used_idx++;
+	/* If we expect an interrupt for the next entry, tell host
+	 * by writing event index and flush out the write before
+	 * the read in the next get_buf call. */
+	if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
+		vring_used_event(&vq->vring) = vq->last_used_idx;
+		virtio_mb();
+	}
+
 	END_USE(vq);
 	return ret;
 }
@@ -345,7 +360,11 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
 	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
+	vring_used_event(&vq->vring) = vq->last_used_idx;
 	virtio_mb();
 	if (unlikely(more_used(vq))) {
 		END_USE(vq);
@@ -437,6 +456,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 #endif
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)
@@ -471,6 +491,8 @@ void vring_transport_features(struct virtio_device *vdev)
 		switch (i) {
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
+		case VIRTIO_RING_F_EVENT_IDX:
+			break;
 		default:
 			/* We don't understand this bit. */
 			clear_bit(i, vdev->features);
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 03/14] virtio_ring: support event idx feature
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  (?)
@ 2011-05-19 23:10 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Support for the new event idx feature:
1. When enabling interrupts, publish the current avail index
   value to the host to get interrupts on the next update.
2. Use the new avail_event feature to reduce the number
   of exits from the guest.

Simple test with the simulator:

[virtio]# time ./virtio_test
spurious wakeus: 0x7

real    0m0.169s
user    0m0.140s
sys     0m0.019s
[virtio]# time ./virtio_test --no-event-idx
spurious wakeus: 0x11

real    0m0.649s
user    0m0.295s
sys     0m0.335s

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   26 ++++++++++++++++++++++++--
 1 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cc2f73e..1d0f9be 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -82,6 +82,9 @@ struct vring_virtqueue
 	/* Host supports indirect buffers */
 	bool indirect;
 
+	/* Host publishes avail event idx */
+	bool event;
+
 	/* Number of free buffers */
 	unsigned int num_free;
 	/* Head of free buffer list. */
@@ -237,18 +240,22 @@ EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 void virtqueue_kick(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 new, old;
 	START_USE(vq);
 	/* Descriptors and available array need to be set before we expose the
 	 * new available array entries. */
 	virtio_wmb();
 
-	vq->vring.avail->idx += vq->num_added;
+	old = vq->vring.avail->idx;
+	new = vq->vring.avail->idx = old + vq->num_added;
 	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
-	if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
+	if (vq->event ?
+	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
+	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
 		/* Prod other side to tell it about changes. */
 		vq->notify(&vq->vq);
 
@@ -324,6 +331,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 	ret = vq->data[i];
 	detach_buf(vq, i);
 	vq->last_used_idx++;
+	/* If we expect an interrupt for the next entry, tell host
+	 * by writing event index and flush out the write before
+	 * the read in the next get_buf call. */
+	if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
+		vring_used_event(&vq->vring) = vq->last_used_idx;
+		virtio_mb();
+	}
+
 	END_USE(vq);
 	return ret;
 }
@@ -345,7 +360,11 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
 	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
+	vring_used_event(&vq->vring) = vq->last_used_idx;
 	virtio_mb();
 	if (unlikely(more_used(vq))) {
 		END_USE(vq);
@@ -437,6 +456,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 #endif
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)
@@ -471,6 +491,8 @@ void vring_transport_features(struct virtio_device *vdev)
 		switch (i) {
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
+		case VIRTIO_RING_F_EVENT_IDX:
+			break;
 		default:
 			/* We don't understand this bit. */
 			clear_bit(i, vdev->features);
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 04/14] vhost: support event index
@ 2011-05-19 23:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Support the new event index feature. When acked,
utilize it to reduce the # of interrupts sent to the guest.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/net.c   |   12 ++--
 drivers/vhost/test.c  |    6 +-
 drivers/vhost/vhost.c |  138 +++++++++++++++++++++++++++++++++++++------------
 drivers/vhost/vhost.h |   21 +++++---
 4 files changed, 127 insertions(+), 50 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..e224a92 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -144,7 +144,7 @@ static void handle_tx(struct vhost_net *net)
 	}
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&net->dev, vq);
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
@@ -166,8 +166,8 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
-			if (unlikely(vhost_enable_notify(vq))) {
-				vhost_disable_notify(vq);
+			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			break;
@@ -315,7 +315,7 @@ static void handle_rx(struct vhost_net *net)
 		return;
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&net->dev, vq);
 	vhost_hlen = vq->vhost_hlen;
 	sock_hlen = vq->sock_hlen;
 
@@ -334,10 +334,10 @@ static void handle_rx(struct vhost_net *net)
 			break;
 		/* OK, now we need to know about added descriptors. */
 		if (!headcount) {
-			if (unlikely(vhost_enable_notify(vq))) {
+			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
-				vhost_disable_notify(vq);
+				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			/* Nothing new?  Wait for eventfd to tell us
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 099f302..734e1d7 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -49,7 +49,7 @@ static void handle_vq(struct vhost_test *n)
 		return;
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&n->dev, vq);
 
 	for (;;) {
 		head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
@@ -61,8 +61,8 @@ static void handle_vq(struct vhost_test *n)
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
-			if (unlikely(vhost_enable_notify(vq))) {
-				vhost_disable_notify(vq);
+			if (unlikely(vhost_enable_notify(&n->dev, vq))) {
+				vhost_disable_notify(&n->dev, vq);
 				continue;
 			}
 			break;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..2a10786 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,6 +37,9 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
+#define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
+#define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
+
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -161,6 +164,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->last_avail_idx = 0;
 	vq->avail_idx = 0;
 	vq->last_used_idx = 0;
+	vq->signalled_used = 0;
+	vq->signalled_used_valid = false;
 	vq->used_flags = 0;
 	vq->log_used = false;
 	vq->log_addr = -1ull;
@@ -489,16 +494,17 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 	return 1;
 }
 
-static int vq_access_ok(unsigned int num,
+static int vq_access_ok(struct vhost_dev *d, unsigned int num,
 			struct vring_desc __user *desc,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
 {
+	size_t s = vhost_has_feature(d, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
-			 sizeof *avail + num * sizeof *avail->ring) &&
+			 sizeof *avail + num * sizeof *avail->ring + s) &&
 	       access_ok(VERIFY_WRITE, used,
-			sizeof *used + num * sizeof *used->ring);
+			sizeof *used + num * sizeof *used->ring + s);
 }
 
 /* Can we log writes? */
@@ -514,9 +520,11 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 
 /* Verify access for write logging. */
 /* Caller should have vq mutex and device mutex */
-static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
+static int vq_log_access_ok(struct vhost_dev *d, struct vhost_virtqueue *vq,
+			    void __user *log_base)
 {
 	struct vhost_memory *mp;
+	size_t s = vhost_has_feature(d, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 
 	mp = rcu_dereference_protected(vq->dev->memory,
 				       lockdep_is_held(&vq->mutex));
@@ -524,15 +532,15 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
 					sizeof *vq->used +
-					vq->num * sizeof *vq->used->ring));
+					vq->num * sizeof *vq->used->ring + s));
 }
 
 /* Can we start vq? */
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
-	return vq_access_ok(vq->num, vq->desc, vq->avail, vq->used) &&
-		vq_log_access_ok(vq, vq->log_base);
+	return vq_access_ok(vq->dev, vq->num, vq->desc, vq->avail, vq->used) &&
+		vq_log_access_ok(vq->dev, vq, vq->log_base);
 }
 
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
@@ -577,6 +585,7 @@ static int init_used(struct vhost_virtqueue *vq,
 
 	if (r)
 		return r;
+	vq->signalled_used_valid = false;
 	return get_user(vq->last_used_idx, &used->idx);
 }
 
@@ -674,7 +683,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		 * If it is not, we don't as size might not have been setup.
 		 * We will verify when backend is configured. */
 		if (vq->private_data) {
-			if (!vq_access_ok(vq->num,
+			if (!vq_access_ok(d, vq->num,
 				(void __user *)(unsigned long)a.desc_user_addr,
 				(void __user *)(unsigned long)a.avail_user_addr,
 				(void __user *)(unsigned long)a.used_user_addr)) {
@@ -818,7 +827,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 			vq = d->vqs + i;
 			mutex_lock(&vq->mutex);
 			/* If ring is inactive, will check when it's enabled. */
-			if (vq->private_data && !vq_log_access_ok(vq, base))
+			if (vq->private_data && !vq_log_access_ok(d, vq, base))
 				r = -EFAULT;
 			else
 				vq->log_base = base;
@@ -1219,6 +1228,10 @@ int vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 
 	/* On success, increment avail index. */
 	vq->last_avail_idx++;
+
+	/* Assume notifications from guest are disabled at this point,
+	 * if they aren't we would need to update avail_event index. */
+	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
 	return head;
 }
 
@@ -1267,6 +1280,12 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 			eventfd_signal(vq->log_ctx, 1);
 	}
 	vq->last_used_idx++;
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this. */
+	if (unlikely(vq->last_used_idx == vq->signalled_used))
+		vq->signalled_used_valid = false;
 	return 0;
 }
 
@@ -1275,6 +1294,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			    unsigned count)
 {
 	struct vring_used_elem __user *used;
+	u16 old, new;
 	int start;
 
 	start = vq->last_used_idx % vq->num;
@@ -1292,7 +1312,14 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			   ((void __user *)used - (void __user *)vq->used),
 			  count * sizeof *used);
 	}
-	vq->last_used_idx += count;
+	old = vq->last_used_idx;
+	new = (vq->last_used_idx += count);
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this. */
+	if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
+		vq->signalled_used_valid = false;
 	return 0;
 }
 
@@ -1331,29 +1358,47 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 	return r;
 }
 
-/* This actually signals the guest, using eventfd. */
-void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-	__u16 flags;
-
+	__u16 old, new, event;
+	bool v;
 	/* Flush out used index updates. This is paired
 	 * with the barrier that the Guest executes when enabling
 	 * interrupts. */
 	smp_mb();
 
-	if (__get_user(flags, &vq->avail->flags)) {
-		vq_err(vq, "Failed to get flags");
-		return;
+	if (vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
+	    unlikely(vq->avail_idx == vq->last_avail_idx))
+		return true;
+
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		__u16 flags;
+		if (__get_user(flags, &vq->avail->flags)) {
+			vq_err(vq, "Failed to get flags");
+			return true;
+		}
+		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
 	}
+	old = vq->signalled_used;
+	v = vq->signalled_used_valid;
+	new = vq->signalled_used = vq->last_used_idx;
+	vq->signalled_used_valid = true;
 
-	/* If they don't want an interrupt, don't signal, unless empty. */
-	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
-	    (vq->avail_idx != vq->last_avail_idx ||
-	     !vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY)))
-		return;
+	if (unlikely(!v))
+		return true;
 
+	if (get_user(event, vhost_used_event(vq))) {
+		vq_err(vq, "Failed to get used event idx");
+		return true;
+	}
+	return vring_need_event(event, new, old);
+}
+
+/* This actually signals the guest, using eventfd. */
+void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
 	/* Signal the Guest tell them we used something up. */
-	if (vq->call_ctx)
+	if (vq->call_ctx && vhost_notify(dev, vq))
 		eventfd_signal(vq->call_ctx, 1);
 }
 
@@ -1376,7 +1421,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 
 /* OK, now we need to know about added descriptors. */
-bool vhost_enable_notify(struct vhost_virtqueue *vq)
+bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	u16 avail_idx;
 	int r;
@@ -1384,11 +1429,34 @@ bool vhost_enable_notify(struct vhost_virtqueue *vq)
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
-	r = put_user(vq->used_flags, &vq->used->flags);
-	if (r) {
-		vq_err(vq, "Failed to enable notification at %p: %d\n",
-		       &vq->used->flags, r);
-		return false;
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		r = put_user(vq->used_flags, &vq->used->flags);
+		if (r) {
+			vq_err(vq, "Failed to enable notification at %p: %d\n",
+			       &vq->used->flags, r);
+			return false;
+		}
+	} else {
+		r = put_user(vq->avail_idx, vhost_avail_event(vq));
+		if (r) {
+			vq_err(vq, "Failed to update avail event index at %p: %d\n",
+			       vhost_avail_event(vq), r);
+			return false;
+		}
+	}
+	if (unlikely(vq->log_used)) {
+		void __user *used;
+		/* Make sure data is seen before log. */
+		smp_wmb();
+		used = vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX) ?
+			&vq->used->flags : vhost_avail_event(vq);
+		/* Log used flags or event index entry write. Both are 16 bit
+		 * fields. */
+		log_write(vq->log_base, vq->log_addr +
+			   (used - (void __user *)vq->used),
+			  sizeof(u16));
+		if (vq->log_ctx)
+			eventfd_signal(vq->log_ctx, 1);
 	}
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
@@ -1404,15 +1472,17 @@ bool vhost_enable_notify(struct vhost_virtqueue *vq)
 }
 
 /* We don't need to be notified again. */
-void vhost_disable_notify(struct vhost_virtqueue *vq)
+void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
-	r = put_user(vq->used_flags, &vq->used->flags);
-	if (r)
-		vq_err(vq, "Failed to enable notification at %p: %d\n",
-		       &vq->used->flags, r);
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		r = put_user(vq->used_flags, &vq->used->flags);
+		if (r)
+			vq_err(vq, "Failed to enable notification at %p: %d\n",
+			       &vq->used->flags, r);
+	}
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..8e03379 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -84,6 +84,12 @@ struct vhost_virtqueue {
 	/* Used flags */
 	u16 used_flags;
 
+	/* Last used index value we have signalled on */
+	u16 signalled_used;
+
+	/* Last used index value we have signalled on */
+	bool signalled_used_valid;
+
 	/* Log writes to used structure. */
 	bool log_used;
 	u64 log_addr;
@@ -149,8 +155,8 @@ void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
 void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
 			       struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
-void vhost_disable_notify(struct vhost_virtqueue *);
-bool vhost_enable_notify(struct vhost_virtqueue *);
+void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
@@ -162,11 +168,12 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 	} while (0)
 
 enum {
-	VHOST_FEATURES = (1 << VIRTIO_F_NOTIFY_ON_EMPTY) |
-			 (1 << VIRTIO_RING_F_INDIRECT_DESC) |
-			 (1 << VHOST_F_LOG_ALL) |
-			 (1 << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1 << VIRTIO_NET_F_MRG_RXBUF),
+	VHOST_FEATURES = (1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+			 (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+			 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
+			 (1ULL << VHOST_F_LOG_ALL) |
+			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 04/14] vhost: support event index
@ 2011-05-19 23:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Support the new event index feature. When acked,
utilize it to reduce the # of interrupts sent to the guest.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/vhost/net.c   |   12 ++--
 drivers/vhost/test.c  |    6 +-
 drivers/vhost/vhost.c |  138 +++++++++++++++++++++++++++++++++++++------------
 drivers/vhost/vhost.h |   21 +++++---
 4 files changed, 127 insertions(+), 50 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..e224a92 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -144,7 +144,7 @@ static void handle_tx(struct vhost_net *net)
 	}
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&net->dev, vq);
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
@@ -166,8 +166,8 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
-			if (unlikely(vhost_enable_notify(vq))) {
-				vhost_disable_notify(vq);
+			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			break;
@@ -315,7 +315,7 @@ static void handle_rx(struct vhost_net *net)
 		return;
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&net->dev, vq);
 	vhost_hlen = vq->vhost_hlen;
 	sock_hlen = vq->sock_hlen;
 
@@ -334,10 +334,10 @@ static void handle_rx(struct vhost_net *net)
 			break;
 		/* OK, now we need to know about added descriptors. */
 		if (!headcount) {
-			if (unlikely(vhost_enable_notify(vq))) {
+			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
-				vhost_disable_notify(vq);
+				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			/* Nothing new?  Wait for eventfd to tell us
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 099f302..734e1d7 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -49,7 +49,7 @@ static void handle_vq(struct vhost_test *n)
 		return;
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&n->dev, vq);
 
 	for (;;) {
 		head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
@@ -61,8 +61,8 @@ static void handle_vq(struct vhost_test *n)
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
-			if (unlikely(vhost_enable_notify(vq))) {
-				vhost_disable_notify(vq);
+			if (unlikely(vhost_enable_notify(&n->dev, vq))) {
+				vhost_disable_notify(&n->dev, vq);
 				continue;
 			}
 			break;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..2a10786 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,6 +37,9 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
+#define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
+#define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
+
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -161,6 +164,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->last_avail_idx = 0;
 	vq->avail_idx = 0;
 	vq->last_used_idx = 0;
+	vq->signalled_used = 0;
+	vq->signalled_used_valid = false;
 	vq->used_flags = 0;
 	vq->log_used = false;
 	vq->log_addr = -1ull;
@@ -489,16 +494,17 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 	return 1;
 }
 
-static int vq_access_ok(unsigned int num,
+static int vq_access_ok(struct vhost_dev *d, unsigned int num,
 			struct vring_desc __user *desc,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
 {
+	size_t s = vhost_has_feature(d, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
-			 sizeof *avail + num * sizeof *avail->ring) &&
+			 sizeof *avail + num * sizeof *avail->ring + s) &&
 	       access_ok(VERIFY_WRITE, used,
-			sizeof *used + num * sizeof *used->ring);
+			sizeof *used + num * sizeof *used->ring + s);
 }
 
 /* Can we log writes? */
@@ -514,9 +520,11 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 
 /* Verify access for write logging. */
 /* Caller should have vq mutex and device mutex */
-static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
+static int vq_log_access_ok(struct vhost_dev *d, struct vhost_virtqueue *vq,
+			    void __user *log_base)
 {
 	struct vhost_memory *mp;
+	size_t s = vhost_has_feature(d, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 
 	mp = rcu_dereference_protected(vq->dev->memory,
 				       lockdep_is_held(&vq->mutex));
@@ -524,15 +532,15 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
 					sizeof *vq->used +
-					vq->num * sizeof *vq->used->ring));
+					vq->num * sizeof *vq->used->ring + s));
 }
 
 /* Can we start vq? */
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
-	return vq_access_ok(vq->num, vq->desc, vq->avail, vq->used) &&
-		vq_log_access_ok(vq, vq->log_base);
+	return vq_access_ok(vq->dev, vq->num, vq->desc, vq->avail, vq->used) &&
+		vq_log_access_ok(vq->dev, vq, vq->log_base);
 }
 
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
@@ -577,6 +585,7 @@ static int init_used(struct vhost_virtqueue *vq,
 
 	if (r)
 		return r;
+	vq->signalled_used_valid = false;
 	return get_user(vq->last_used_idx, &used->idx);
 }
 
@@ -674,7 +683,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		 * If it is not, we don't as size might not have been setup.
 		 * We will verify when backend is configured. */
 		if (vq->private_data) {
-			if (!vq_access_ok(vq->num,
+			if (!vq_access_ok(d, vq->num,
 				(void __user *)(unsigned long)a.desc_user_addr,
 				(void __user *)(unsigned long)a.avail_user_addr,
 				(void __user *)(unsigned long)a.used_user_addr)) {
@@ -818,7 +827,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 			vq = d->vqs + i;
 			mutex_lock(&vq->mutex);
 			/* If ring is inactive, will check when it's enabled. */
-			if (vq->private_data && !vq_log_access_ok(vq, base))
+			if (vq->private_data && !vq_log_access_ok(d, vq, base))
 				r = -EFAULT;
 			else
 				vq->log_base = base;
@@ -1219,6 +1228,10 @@ int vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 
 	/* On success, increment avail index. */
 	vq->last_avail_idx++;
+
+	/* Assume notifications from guest are disabled at this point,
+	 * if they aren't we would need to update avail_event index. */
+	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
 	return head;
 }
 
@@ -1267,6 +1280,12 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 			eventfd_signal(vq->log_ctx, 1);
 	}
 	vq->last_used_idx++;
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this. */
+	if (unlikely(vq->last_used_idx == vq->signalled_used))
+		vq->signalled_used_valid = false;
 	return 0;
 }
 
@@ -1275,6 +1294,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			    unsigned count)
 {
 	struct vring_used_elem __user *used;
+	u16 old, new;
 	int start;
 
 	start = vq->last_used_idx % vq->num;
@@ -1292,7 +1312,14 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			   ((void __user *)used - (void __user *)vq->used),
 			  count * sizeof *used);
 	}
-	vq->last_used_idx += count;
+	old = vq->last_used_idx;
+	new = (vq->last_used_idx += count);
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this. */
+	if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
+		vq->signalled_used_valid = false;
 	return 0;
 }
 
@@ -1331,29 +1358,47 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 	return r;
 }
 
-/* This actually signals the guest, using eventfd. */
-void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-	__u16 flags;
-
+	__u16 old, new, event;
+	bool v;
 	/* Flush out used index updates. This is paired
 	 * with the barrier that the Guest executes when enabling
 	 * interrupts. */
 	smp_mb();
 
-	if (__get_user(flags, &vq->avail->flags)) {
-		vq_err(vq, "Failed to get flags");
-		return;
+	if (vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
+	    unlikely(vq->avail_idx == vq->last_avail_idx))
+		return true;
+
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		__u16 flags;
+		if (__get_user(flags, &vq->avail->flags)) {
+			vq_err(vq, "Failed to get flags");
+			return true;
+		}
+		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
 	}
+	old = vq->signalled_used;
+	v = vq->signalled_used_valid;
+	new = vq->signalled_used = vq->last_used_idx;
+	vq->signalled_used_valid = true;
 
-	/* If they don't want an interrupt, don't signal, unless empty. */
-	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
-	    (vq->avail_idx != vq->last_avail_idx ||
-	     !vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY)))
-		return;
+	if (unlikely(!v))
+		return true;
 
+	if (get_user(event, vhost_used_event(vq))) {
+		vq_err(vq, "Failed to get used event idx");
+		return true;
+	}
+	return vring_need_event(event, new, old);
+}
+
+/* This actually signals the guest, using eventfd. */
+void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
 	/* Signal the Guest tell them we used something up. */
-	if (vq->call_ctx)
+	if (vq->call_ctx && vhost_notify(dev, vq))
 		eventfd_signal(vq->call_ctx, 1);
 }
 
@@ -1376,7 +1421,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 
 /* OK, now we need to know about added descriptors. */
-bool vhost_enable_notify(struct vhost_virtqueue *vq)
+bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	u16 avail_idx;
 	int r;
@@ -1384,11 +1429,34 @@ bool vhost_enable_notify(struct vhost_virtqueue *vq)
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
-	r = put_user(vq->used_flags, &vq->used->flags);
-	if (r) {
-		vq_err(vq, "Failed to enable notification at %p: %d\n",
-		       &vq->used->flags, r);
-		return false;
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		r = put_user(vq->used_flags, &vq->used->flags);
+		if (r) {
+			vq_err(vq, "Failed to enable notification at %p: %d\n",
+			       &vq->used->flags, r);
+			return false;
+		}
+	} else {
+		r = put_user(vq->avail_idx, vhost_avail_event(vq));
+		if (r) {
+			vq_err(vq, "Failed to update avail event index at %p: %d\n",
+			       vhost_avail_event(vq), r);
+			return false;
+		}
+	}
+	if (unlikely(vq->log_used)) {
+		void __user *used;
+		/* Make sure data is seen before log. */
+		smp_wmb();
+		used = vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX) ?
+			&vq->used->flags : vhost_avail_event(vq);
+		/* Log used flags or event index entry write. Both are 16 bit
+		 * fields. */
+		log_write(vq->log_base, vq->log_addr +
+			   (used - (void __user *)vq->used),
+			  sizeof(u16));
+		if (vq->log_ctx)
+			eventfd_signal(vq->log_ctx, 1);
 	}
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
@@ -1404,15 +1472,17 @@ bool vhost_enable_notify(struct vhost_virtqueue *vq)
 }
 
 /* We don't need to be notified again. */
-void vhost_disable_notify(struct vhost_virtqueue *vq)
+void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
-	r = put_user(vq->used_flags, &vq->used->flags);
-	if (r)
-		vq_err(vq, "Failed to enable notification at %p: %d\n",
-		       &vq->used->flags, r);
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		r = put_user(vq->used_flags, &vq->used->flags);
+		if (r)
+			vq_err(vq, "Failed to enable notification at %p: %d\n",
+			       &vq->used->flags, r);
+	}
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..8e03379 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -84,6 +84,12 @@ struct vhost_virtqueue {
 	/* Used flags */
 	u16 used_flags;
 
+	/* Last used index value we have signalled on */
+	u16 signalled_used;
+
+	/* Last used index value we have signalled on */
+	bool signalled_used_valid;
+
 	/* Log writes to used structure. */
 	bool log_used;
 	u64 log_addr;
@@ -149,8 +155,8 @@ void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
 void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
 			       struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
-void vhost_disable_notify(struct vhost_virtqueue *);
-bool vhost_enable_notify(struct vhost_virtqueue *);
+void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
@@ -162,11 +168,12 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 	} while (0)
 
 enum {
-	VHOST_FEATURES = (1 << VIRTIO_F_NOTIFY_ON_EMPTY) |
-			 (1 << VIRTIO_RING_F_INDIRECT_DESC) |
-			 (1 << VHOST_F_LOG_ALL) |
-			 (1 << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1 << VIRTIO_NET_F_MRG_RXBUF),
+	VHOST_FEATURES = (1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+			 (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+			 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
+			 (1ULL << VHOST_F_LOG_ALL) |
+			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 04/14] vhost: support event index
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (7 preceding siblings ...)
  (?)
@ 2011-05-19 23:10 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Support the new event index feature. When acked,
utilize it to reduce the # of interrupts sent to the guest.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/net.c   |   12 ++--
 drivers/vhost/test.c  |    6 +-
 drivers/vhost/vhost.c |  138 +++++++++++++++++++++++++++++++++++++------------
 drivers/vhost/vhost.h |   21 +++++---
 4 files changed, 127 insertions(+), 50 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..e224a92 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -144,7 +144,7 @@ static void handle_tx(struct vhost_net *net)
 	}
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&net->dev, vq);
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
@@ -166,8 +166,8 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
-			if (unlikely(vhost_enable_notify(vq))) {
-				vhost_disable_notify(vq);
+			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			break;
@@ -315,7 +315,7 @@ static void handle_rx(struct vhost_net *net)
 		return;
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&net->dev, vq);
 	vhost_hlen = vq->vhost_hlen;
 	sock_hlen = vq->sock_hlen;
 
@@ -334,10 +334,10 @@ static void handle_rx(struct vhost_net *net)
 			break;
 		/* OK, now we need to know about added descriptors. */
 		if (!headcount) {
-			if (unlikely(vhost_enable_notify(vq))) {
+			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
-				vhost_disable_notify(vq);
+				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			/* Nothing new?  Wait for eventfd to tell us
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 099f302..734e1d7 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -49,7 +49,7 @@ static void handle_vq(struct vhost_test *n)
 		return;
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&n->dev, vq);
 
 	for (;;) {
 		head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
@@ -61,8 +61,8 @@ static void handle_vq(struct vhost_test *n)
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
-			if (unlikely(vhost_enable_notify(vq))) {
-				vhost_disable_notify(vq);
+			if (unlikely(vhost_enable_notify(&n->dev, vq))) {
+				vhost_disable_notify(&n->dev, vq);
 				continue;
 			}
 			break;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..2a10786 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,6 +37,9 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
+#define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
+#define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
+
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -161,6 +164,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->last_avail_idx = 0;
 	vq->avail_idx = 0;
 	vq->last_used_idx = 0;
+	vq->signalled_used = 0;
+	vq->signalled_used_valid = false;
 	vq->used_flags = 0;
 	vq->log_used = false;
 	vq->log_addr = -1ull;
@@ -489,16 +494,17 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 	return 1;
 }
 
-static int vq_access_ok(unsigned int num,
+static int vq_access_ok(struct vhost_dev *d, unsigned int num,
 			struct vring_desc __user *desc,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
 {
+	size_t s = vhost_has_feature(d, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
-			 sizeof *avail + num * sizeof *avail->ring) &&
+			 sizeof *avail + num * sizeof *avail->ring + s) &&
 	       access_ok(VERIFY_WRITE, used,
-			sizeof *used + num * sizeof *used->ring);
+			sizeof *used + num * sizeof *used->ring + s);
 }
 
 /* Can we log writes? */
@@ -514,9 +520,11 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 
 /* Verify access for write logging. */
 /* Caller should have vq mutex and device mutex */
-static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
+static int vq_log_access_ok(struct vhost_dev *d, struct vhost_virtqueue *vq,
+			    void __user *log_base)
 {
 	struct vhost_memory *mp;
+	size_t s = vhost_has_feature(d, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 
 	mp = rcu_dereference_protected(vq->dev->memory,
 				       lockdep_is_held(&vq->mutex));
@@ -524,15 +532,15 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
 					sizeof *vq->used +
-					vq->num * sizeof *vq->used->ring));
+					vq->num * sizeof *vq->used->ring + s));
 }
 
 /* Can we start vq? */
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
-	return vq_access_ok(vq->num, vq->desc, vq->avail, vq->used) &&
-		vq_log_access_ok(vq, vq->log_base);
+	return vq_access_ok(vq->dev, vq->num, vq->desc, vq->avail, vq->used) &&
+		vq_log_access_ok(vq->dev, vq, vq->log_base);
 }
 
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
@@ -577,6 +585,7 @@ static int init_used(struct vhost_virtqueue *vq,
 
 	if (r)
 		return r;
+	vq->signalled_used_valid = false;
 	return get_user(vq->last_used_idx, &used->idx);
 }
 
@@ -674,7 +683,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		 * If it is not, we don't as size might not have been setup.
 		 * We will verify when backend is configured. */
 		if (vq->private_data) {
-			if (!vq_access_ok(vq->num,
+			if (!vq_access_ok(d, vq->num,
 				(void __user *)(unsigned long)a.desc_user_addr,
 				(void __user *)(unsigned long)a.avail_user_addr,
 				(void __user *)(unsigned long)a.used_user_addr)) {
@@ -818,7 +827,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 			vq = d->vqs + i;
 			mutex_lock(&vq->mutex);
 			/* If ring is inactive, will check when it's enabled. */
-			if (vq->private_data && !vq_log_access_ok(vq, base))
+			if (vq->private_data && !vq_log_access_ok(d, vq, base))
 				r = -EFAULT;
 			else
 				vq->log_base = base;
@@ -1219,6 +1228,10 @@ int vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 
 	/* On success, increment avail index. */
 	vq->last_avail_idx++;
+
+	/* Assume notifications from guest are disabled at this point,
+	 * if they aren't we would need to update avail_event index. */
+	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
 	return head;
 }
 
@@ -1267,6 +1280,12 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 			eventfd_signal(vq->log_ctx, 1);
 	}
 	vq->last_used_idx++;
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this. */
+	if (unlikely(vq->last_used_idx == vq->signalled_used))
+		vq->signalled_used_valid = false;
 	return 0;
 }
 
@@ -1275,6 +1294,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			    unsigned count)
 {
 	struct vring_used_elem __user *used;
+	u16 old, new;
 	int start;
 
 	start = vq->last_used_idx % vq->num;
@@ -1292,7 +1312,14 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			   ((void __user *)used - (void __user *)vq->used),
 			  count * sizeof *used);
 	}
-	vq->last_used_idx += count;
+	old = vq->last_used_idx;
+	new = (vq->last_used_idx += count);
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this. */
+	if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
+		vq->signalled_used_valid = false;
 	return 0;
 }
 
@@ -1331,29 +1358,47 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 	return r;
 }
 
-/* This actually signals the guest, using eventfd. */
-void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-	__u16 flags;
-
+	__u16 old, new, event;
+	bool v;
 	/* Flush out used index updates. This is paired
 	 * with the barrier that the Guest executes when enabling
 	 * interrupts. */
 	smp_mb();
 
-	if (__get_user(flags, &vq->avail->flags)) {
-		vq_err(vq, "Failed to get flags");
-		return;
+	if (vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
+	    unlikely(vq->avail_idx == vq->last_avail_idx))
+		return true;
+
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		__u16 flags;
+		if (__get_user(flags, &vq->avail->flags)) {
+			vq_err(vq, "Failed to get flags");
+			return true;
+		}
+		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
 	}
+	old = vq->signalled_used;
+	v = vq->signalled_used_valid;
+	new = vq->signalled_used = vq->last_used_idx;
+	vq->signalled_used_valid = true;
 
-	/* If they don't want an interrupt, don't signal, unless empty. */
-	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
-	    (vq->avail_idx != vq->last_avail_idx ||
-	     !vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY)))
-		return;
+	if (unlikely(!v))
+		return true;
 
+	if (get_user(event, vhost_used_event(vq))) {
+		vq_err(vq, "Failed to get used event idx");
+		return true;
+	}
+	return vring_need_event(event, new, old);
+}
+
+/* This actually signals the guest, using eventfd. */
+void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
 	/* Signal the Guest tell them we used something up. */
-	if (vq->call_ctx)
+	if (vq->call_ctx && vhost_notify(dev, vq))
 		eventfd_signal(vq->call_ctx, 1);
 }
 
@@ -1376,7 +1421,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 
 /* OK, now we need to know about added descriptors. */
-bool vhost_enable_notify(struct vhost_virtqueue *vq)
+bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	u16 avail_idx;
 	int r;
@@ -1384,11 +1429,34 @@ bool vhost_enable_notify(struct vhost_virtqueue *vq)
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
-	r = put_user(vq->used_flags, &vq->used->flags);
-	if (r) {
-		vq_err(vq, "Failed to enable notification at %p: %d\n",
-		       &vq->used->flags, r);
-		return false;
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		r = put_user(vq->used_flags, &vq->used->flags);
+		if (r) {
+			vq_err(vq, "Failed to enable notification at %p: %d\n",
+			       &vq->used->flags, r);
+			return false;
+		}
+	} else {
+		r = put_user(vq->avail_idx, vhost_avail_event(vq));
+		if (r) {
+			vq_err(vq, "Failed to update avail event index at %p: %d\n",
+			       vhost_avail_event(vq), r);
+			return false;
+		}
+	}
+	if (unlikely(vq->log_used)) {
+		void __user *used;
+		/* Make sure data is seen before log. */
+		smp_wmb();
+		used = vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX) ?
+			&vq->used->flags : vhost_avail_event(vq);
+		/* Log used flags or event index entry write. Both are 16 bit
+		 * fields. */
+		log_write(vq->log_base, vq->log_addr +
+			   (used - (void __user *)vq->used),
+			  sizeof(u16));
+		if (vq->log_ctx)
+			eventfd_signal(vq->log_ctx, 1);
 	}
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
@@ -1404,15 +1472,17 @@ bool vhost_enable_notify(struct vhost_virtqueue *vq)
 }
 
 /* We don't need to be notified again. */
-void vhost_disable_notify(struct vhost_virtqueue *vq)
+void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
-	r = put_user(vq->used_flags, &vq->used->flags);
-	if (r)
-		vq_err(vq, "Failed to enable notification at %p: %d\n",
-		       &vq->used->flags, r);
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		r = put_user(vq->used_flags, &vq->used->flags);
+		if (r)
+			vq_err(vq, "Failed to enable notification at %p: %d\n",
+			       &vq->used->flags, r);
+	}
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..8e03379 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -84,6 +84,12 @@ struct vhost_virtqueue {
 	/* Used flags */
 	u16 used_flags;
 
+	/* Last used index value we have signalled on */
+	u16 signalled_used;
+
+	/* Last used index value we have signalled on */
+	bool signalled_used_valid;
+
 	/* Log writes to used structure. */
 	bool log_used;
 	u64 log_addr;
@@ -149,8 +155,8 @@ void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
 void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
 			       struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
-void vhost_disable_notify(struct vhost_virtqueue *);
-bool vhost_enable_notify(struct vhost_virtqueue *);
+void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
@@ -162,11 +168,12 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 	} while (0)
 
 enum {
-	VHOST_FEATURES = (1 << VIRTIO_F_NOTIFY_ON_EMPTY) |
-			 (1 << VIRTIO_RING_F_INDIRECT_DESC) |
-			 (1 << VHOST_F_LOG_ALL) |
-			 (1 << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1 << VIRTIO_NET_F_MRG_RXBUF),
+	VHOST_FEATURES = (1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+			 (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+			 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
+			 (1ULL << VHOST_F_LOG_ALL) |
+			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 05/14] virtio_test: support event index
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Add ability to test the new event idx feature,
enable by default.
---
 tools/virtio/virtio_test.c |   19 +++++++++++++++++--
 1 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index df0c6d2..74d3331 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -198,6 +198,14 @@ const struct option longopts[] = {
 		.val = 'h',
 	},
 	{
+		.name = "event-idx",
+		.val = 'E',
+	},
+	{
+		.name = "no-event-idx",
+		.val = 'e',
+	},
+	{
 		.name = "indirect",
 		.val = 'I',
 	},
@@ -211,13 +219,17 @@ const struct option longopts[] = {
 
 static void help()
 {
-	fprintf(stderr, "Usage: virtio_test [--help] [--no-indirect]\n");
+	fprintf(stderr, "Usage: virtio_test [--help]"
+		" [--no-indirect]"
+		" [--no-event-idx]"
+		"\n");
 }
 
 int main(int argc, char **argv)
 {
 	struct vdev_info dev;
-	unsigned long long features = 1ULL << VIRTIO_RING_F_INDIRECT_DESC;
+	unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+		(1ULL << VIRTIO_RING_F_EVENT_IDX);
 	int o;
 
 	for (;;) {
@@ -228,6 +240,9 @@ int main(int argc, char **argv)
 		case '?':
 			help();
 			exit(2);
+		case 'e':
+			features &= ~(1ULL << VIRTIO_RING_F_EVENT_IDX);
+			break;
 		case 'h':
 			help();
 			goto done;
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 05/14] virtio_test: support event index
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Add ability to test the new event idx feature,
enable by default.
---
 tools/virtio/virtio_test.c |   19 +++++++++++++++++--
 1 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index df0c6d2..74d3331 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -198,6 +198,14 @@ const struct option longopts[] = {
 		.val = 'h',
 	},
 	{
+		.name = "event-idx",
+		.val = 'E',
+	},
+	{
+		.name = "no-event-idx",
+		.val = 'e',
+	},
+	{
 		.name = "indirect",
 		.val = 'I',
 	},
@@ -211,13 +219,17 @@ const struct option longopts[] = {
 
 static void help()
 {
-	fprintf(stderr, "Usage: virtio_test [--help] [--no-indirect]\n");
+	fprintf(stderr, "Usage: virtio_test [--help]"
+		" [--no-indirect]"
+		" [--no-event-idx]"
+		"\n");
 }
 
 int main(int argc, char **argv)
 {
 	struct vdev_info dev;
-	unsigned long long features = 1ULL << VIRTIO_RING_F_INDIRECT_DESC;
+	unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+		(1ULL << VIRTIO_RING_F_EVENT_IDX);
 	int o;
 
 	for (;;) {
@@ -228,6 +240,9 @@ int main(int argc, char **argv)
 		case '?':
 			help();
 			exit(2);
+		case 'e':
+			features &= ~(1ULL << VIRTIO_RING_F_EVENT_IDX);
+			break;
 		case 'h':
 			help();
 			goto done;
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 05/14] virtio_test: support event index
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (8 preceding siblings ...)
  (?)
@ 2011-05-19 23:11 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Add ability to test the new event idx feature,
enable by default.
---
 tools/virtio/virtio_test.c |   19 +++++++++++++++++--
 1 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index df0c6d2..74d3331 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -198,6 +198,14 @@ const struct option longopts[] = {
 		.val = 'h',
 	},
 	{
+		.name = "event-idx",
+		.val = 'E',
+	},
+	{
+		.name = "no-event-idx",
+		.val = 'e',
+	},
+	{
 		.name = "indirect",
 		.val = 'I',
 	},
@@ -211,13 +219,17 @@ const struct option longopts[] = {
 
 static void help()
 {
-	fprintf(stderr, "Usage: virtio_test [--help] [--no-indirect]\n");
+	fprintf(stderr, "Usage: virtio_test [--help]"
+		" [--no-indirect]"
+		" [--no-event-idx]"
+		"\n");
 }
 
 int main(int argc, char **argv)
 {
 	struct vdev_info dev;
-	unsigned long long features = 1ULL << VIRTIO_RING_F_INDIRECT_DESC;
+	unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+		(1ULL << VIRTIO_RING_F_EVENT_IDX);
 	int o;
 
 	for (;;) {
@@ -228,6 +240,9 @@ int main(int argc, char **argv)
 		case '?':
 			help();
 			exit(2);
+		case 'e':
+			features &= ~(1ULL << VIRTIO_RING_F_EVENT_IDX);
+			break;
 		case 'h':
 			help();
 			goto done;
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 06/14] virtio: add api for delayed callbacks
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Add an API that tells the other side that callbacks
should be delayed until a lot of work has been done.
Implement using the new event_idx feature.

Note: it might seem advantageous to let the drivers
ask for a callback after a specific capacity has
been reached. However, as a single head can
free many entries in the descriptor table,
we don't really have a clue about capacity
until get_buf is called. The API is the simplest
to implement at the moment, we'll see what kind of
hints drivers can pass when there's more than one
user of the feature.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   27 +++++++++++++++++++++++++++
 include/linux/virtio.h       |    9 +++++++++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1d0f9be..6578e1a 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -376,6 +376,33 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 
+bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 bufs;
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
+	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
+	/* TODO: tune this threshold */
+	bufs = (u16)(vq->vring.avail->idx - vq->last_used_idx) * 3 / 4;
+	vring_used_event(&vq->vring) = vq->last_used_idx + bufs;
+	virtio_mb();
+	if (unlikely((u16)(vq->vring.used->idx - vq->last_used_idx) > bufs)) {
+		END_USE(vq);
+		return false;
+	}
+
+	END_USE(vq);
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index aff5b4f..7108857 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -51,6 +51,13 @@ struct virtqueue {
  *	This re-enables callbacks; it returns "false" if there are pending
  *	buffers in the queue, to detect a possible race between the driver
  *	checking for more work, and enabling callbacks.
+ * virtqueue_enable_cb_delayed: restart callbacks after disable_cb.
+ *	vq: the struct virtqueue we're talking about.
+ *	This re-enables callbacks but hints to the other side to delay
+ *	interrupts until most of the available buffers have been processed;
+ *	it returns "false" if there are many pending buffers in the queue,
+ *	to detect a possible race between the driver checking for more work,
+ *	and enabling callbacks.
  * virtqueue_detach_unused_buf: detach first unused buffer
  * 	vq: the struct virtqueue we're talking about.
  * 	Returns NULL or the "data" token handed to add_buf
@@ -86,6 +93,8 @@ void virtqueue_disable_cb(struct virtqueue *vq);
 
 bool virtqueue_enable_cb(struct virtqueue *vq);
 
+bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 /**
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 06/14] virtio: add api for delayed callbacks
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Add an API that tells the other side that callbacks
should be delayed until a lot of work has been done.
Implement using the new event_idx feature.

Note: it might seem advantageous to let the drivers
ask for a callback after a specific capacity has
been reached. However, as a single head can
free many entries in the descriptor table,
we don't really have a clue about capacity
until get_buf is called. The API is the simplest
to implement at the moment, we'll see what kind of
hints drivers can pass when there's more than one
user of the feature.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/virtio/virtio_ring.c |   27 +++++++++++++++++++++++++++
 include/linux/virtio.h       |    9 +++++++++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1d0f9be..6578e1a 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -376,6 +376,33 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 
+bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 bufs;
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
+	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
+	/* TODO: tune this threshold */
+	bufs = (u16)(vq->vring.avail->idx - vq->last_used_idx) * 3 / 4;
+	vring_used_event(&vq->vring) = vq->last_used_idx + bufs;
+	virtio_mb();
+	if (unlikely((u16)(vq->vring.used->idx - vq->last_used_idx) > bufs)) {
+		END_USE(vq);
+		return false;
+	}
+
+	END_USE(vq);
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index aff5b4f..7108857 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -51,6 +51,13 @@ struct virtqueue {
  *	This re-enables callbacks; it returns "false" if there are pending
  *	buffers in the queue, to detect a possible race between the driver
  *	checking for more work, and enabling callbacks.
+ * virtqueue_enable_cb_delayed: restart callbacks after disable_cb.
+ *	vq: the struct virtqueue we're talking about.
+ *	This re-enables callbacks but hints to the other side to delay
+ *	interrupts until most of the available buffers have been processed;
+ *	it returns "false" if there are many pending buffers in the queue,
+ *	to detect a possible race between the driver checking for more work,
+ *	and enabling callbacks.
  * virtqueue_detach_unused_buf: detach first unused buffer
  * 	vq: the struct virtqueue we're talking about.
  * 	Returns NULL or the "data" token handed to add_buf
@@ -86,6 +93,8 @@ void virtqueue_disable_cb(struct virtqueue *vq);
 
 bool virtqueue_enable_cb(struct virtqueue *vq);
 
+bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 /**
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 06/14] virtio: add api for delayed callbacks
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (10 preceding siblings ...)
  (?)
@ 2011-05-19 23:11 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Add an API that tells the other side that callbacks
should be delayed until a lot of work has been done.
Implement using the new event_idx feature.

Note: it might seem advantageous to let the drivers
ask for a callback after a specific capacity has
been reached. However, as a single head can
free many entries in the descriptor table,
we don't really have a clue about capacity
until get_buf is called. The API is the simplest
to implement at the moment, we'll see what kind of
hints drivers can pass when there's more than one
user of the feature.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   27 +++++++++++++++++++++++++++
 include/linux/virtio.h       |    9 +++++++++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1d0f9be..6578e1a 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -376,6 +376,33 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 
+bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 bufs;
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
+	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
+	/* TODO: tune this threshold */
+	bufs = (u16)(vq->vring.avail->idx - vq->last_used_idx) * 3 / 4;
+	vring_used_event(&vq->vring) = vq->last_used_idx + bufs;
+	virtio_mb();
+	if (unlikely((u16)(vq->vring.used->idx - vq->last_used_idx) > bufs)) {
+		END_USE(vq);
+		return false;
+	}
+
+	END_USE(vq);
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index aff5b4f..7108857 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -51,6 +51,13 @@ struct virtqueue {
  *	This re-enables callbacks; it returns "false" if there are pending
  *	buffers in the queue, to detect a possible race between the driver
  *	checking for more work, and enabling callbacks.
+ * virtqueue_enable_cb_delayed: restart callbacks after disable_cb.
+ *	vq: the struct virtqueue we're talking about.
+ *	This re-enables callbacks but hints to the other side to delay
+ *	interrupts until most of the available buffers have been processed;
+ *	it returns "false" if there are many pending buffers in the queue,
+ *	to detect a possible race between the driver checking for more work,
+ *	and enabling callbacks.
  * virtqueue_detach_unused_buf: detach first unused buffer
  * 	vq: the struct virtqueue we're talking about.
  * 	Returns NULL or the "data" token handed to add_buf
@@ -86,6 +93,8 @@ void virtqueue_disable_cb(struct virtqueue *vq);
 
 bool virtqueue_enable_cb(struct virtqueue *vq);
 
+bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 /**
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 07/14] virtio_net: delay TX callbacks
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Ask for delayed callbacks on TX ring full, to give the
other side more of a chance to make progress.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0cb0b06..f685324 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -609,7 +609,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
 		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
+		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
 			capacity += free_old_xmit_skbs(vi);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 07/14] virtio_net: delay TX callbacks
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Ask for delayed callbacks on TX ring full, to give the
other side more of a chance to make progress.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/net/virtio_net.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0cb0b06..f685324 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -609,7 +609,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
 		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
+		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
 			capacity += free_old_xmit_skbs(vi);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 07/14] virtio_net: delay TX callbacks
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (13 preceding siblings ...)
  (?)
@ 2011-05-19 23:11 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Ask for delayed callbacks on TX ring full, to give the
other side more of a chance to make progress.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0cb0b06..f685324 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -609,7 +609,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
 		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
+		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
 			capacity += free_old_xmit_skbs(vi);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 08/14] virtio_ring: Add capacity check API
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

>From: Shirley Ma <mashirle@us.ibm.com>

Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |    8 ++++++++
 include/linux/virtio.h       |    5 +++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6578e1a..eed5f29 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -344,6 +344,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_buf);
 
+int virtqueue_get_capacity(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	return vq->num_free;
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_capacity);
+
 void virtqueue_disable_cb(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 7108857..58c0953 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -42,6 +42,9 @@ struct virtqueue {
  *	vq: the struct virtqueue we're talking about.
  *	len: the length written into the buffer
  *	Returns NULL or the "data" token handed to add_buf.
+ * virtqueue_get_capacity: get the current capacity of the queue
+ *	vq: the struct virtqueue we're talking about.
+ *	Returns remaining capacity of the queue.
  * virtqueue_disable_cb: disable callbacks
  *	vq: the struct virtqueue we're talking about.
  *	Note that this is not necessarily synchronous, hence unreliable and only
@@ -89,6 +92,8 @@ void virtqueue_kick(struct virtqueue *vq);
 
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
+int virtqueue_get_capacity(struct virtqueue *vq);
+
 void virtqueue_disable_cb(struct virtqueue *vq);
 
 bool virtqueue_enable_cb(struct virtqueue *vq);
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 08/14] virtio_ring: Add capacity check API
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

>From: Shirley Ma <mashirle-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Shirley Ma <xma-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/virtio/virtio_ring.c |    8 ++++++++
 include/linux/virtio.h       |    5 +++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6578e1a..eed5f29 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -344,6 +344,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_buf);
 
+int virtqueue_get_capacity(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	return vq->num_free;
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_capacity);
+
 void virtqueue_disable_cb(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 7108857..58c0953 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -42,6 +42,9 @@ struct virtqueue {
  *	vq: the struct virtqueue we're talking about.
  *	len: the length written into the buffer
  *	Returns NULL or the "data" token handed to add_buf.
+ * virtqueue_get_capacity: get the current capacity of the queue
+ *	vq: the struct virtqueue we're talking about.
+ *	Returns remaining capacity of the queue.
  * virtqueue_disable_cb: disable callbacks
  *	vq: the struct virtqueue we're talking about.
  *	Note that this is not necessarily synchronous, hence unreliable and only
@@ -89,6 +92,8 @@ void virtqueue_kick(struct virtqueue *vq);
 
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
+int virtqueue_get_capacity(struct virtqueue *vq);
+
 void virtqueue_disable_cb(struct virtqueue *vq);
 
 bool virtqueue_enable_cb(struct virtqueue *vq);
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 08/14] virtio_ring: Add capacity check API
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (15 preceding siblings ...)
  (?)
@ 2011-05-19 23:11 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

>From: Shirley Ma <mashirle@us.ibm.com>

Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |    8 ++++++++
 include/linux/virtio.h       |    5 +++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6578e1a..eed5f29 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -344,6 +344,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_buf);
 
+int virtqueue_get_capacity(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	return vq->num_free;
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_capacity);
+
 void virtqueue_disable_cb(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 7108857..58c0953 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -42,6 +42,9 @@ struct virtqueue {
  *	vq: the struct virtqueue we're talking about.
  *	len: the length written into the buffer
  *	Returns NULL or the "data" token handed to add_buf.
+ * virtqueue_get_capacity: get the current capacity of the queue
+ *	vq: the struct virtqueue we're talking about.
+ *	Returns remaining capacity of the queue.
  * virtqueue_disable_cb: disable callbacks
  *	vq: the struct virtqueue we're talking about.
  *	Note that this is not necessarily synchronous, hence unreliable and only
@@ -89,6 +92,8 @@ void virtqueue_kick(struct virtqueue *vq);
 
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
+int virtqueue_get_capacity(struct virtqueue *vq);
+
 void virtqueue_disable_cb(struct virtqueue *vq);
 
 bool virtqueue_enable_cb(struct virtqueue *vq);
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 09/14] virtio_net: fix TX capacity checks using new API
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

virtio net uses the number of sg entries to
check for TX ring capacity freed. But this
gives incorrect results when indirect buffers
are used. Use the new capacity API instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f685324..f33c92b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -509,19 +509,17 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static void free_old_xmit_skbs(struct virtnet_info *vi)
 {
 	struct sk_buff *skb;
-	unsigned int len, tot_sgs = 0;
+	unsigned int len;
 
 	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
-		tot_sgs += skb_vnet_hdr(skb)->num_sg;
 		dev_kfree_skb_any(skb);
 	}
-	return tot_sgs;
 }
 
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
@@ -611,7 +609,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_queue(dev);
 		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			free_old_xmit_skbs(vi);
+			capacity = virtqueue_get_capacity(vi->svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
 				netif_start_queue(dev);
 				virtqueue_disable_cb(vi->svq);
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 09/14] virtio_net: fix TX capacity checks using new API
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

virtio net uses the number of sg entries to
check for TX ring capacity freed. But this
gives incorrect results when indirect buffers
are used. Use the new capacity API instead.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/net/virtio_net.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f685324..f33c92b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -509,19 +509,17 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static void free_old_xmit_skbs(struct virtnet_info *vi)
 {
 	struct sk_buff *skb;
-	unsigned int len, tot_sgs = 0;
+	unsigned int len;
 
 	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
-		tot_sgs += skb_vnet_hdr(skb)->num_sg;
 		dev_kfree_skb_any(skb);
 	}
-	return tot_sgs;
 }
 
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
@@ -611,7 +609,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_queue(dev);
 		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			free_old_xmit_skbs(vi);
+			capacity = virtqueue_get_capacity(vi->svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
 				netif_start_queue(dev);
 				virtqueue_disable_cb(vi->svq);
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 09/14] virtio_net: fix TX capacity checks using new API
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (17 preceding siblings ...)
  (?)
@ 2011-05-19 23:11 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

virtio net uses the number of sg entries to
check for TX ring capacity freed. But this
gives incorrect results when indirect buffers
are used. Use the new capacity API instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f685324..f33c92b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -509,19 +509,17 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static void free_old_xmit_skbs(struct virtnet_info *vi)
 {
 	struct sk_buff *skb;
-	unsigned int len, tot_sgs = 0;
+	unsigned int len;
 
 	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
-		tot_sgs += skb_vnet_hdr(skb)->num_sg;
 		dev_kfree_skb_any(skb);
 	}
-	return tot_sgs;
 }
 
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
@@ -611,7 +609,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_queue(dev);
 		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			free_old_xmit_skbs(vi);
+			capacity = virtqueue_get_capacity(vi->svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
 				netif_start_queue(dev);
 				virtqueue_disable_cb(vi->svq);
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Current code might introduce a lot of latency variation
if there are many pending bufs at the time we
attempt to transmit a new one. This is bad for
real-time applications and can't be good for TCP either.

Free up just enough to both clean up all buffers
eventually and to be able to xmit the next packet.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c |   22 ++++++++++++++--------
 1 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f33c92b..42935cb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -509,17 +509,25 @@ again:
 	return received;
 }
 
-static void free_old_xmit_skbs(struct virtnet_info *vi)
+static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
 {
 	struct sk_buff *skb;
 	unsigned int len;
-
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	bool c;
+	int n;
+
+	/* We try to free up at least 2 skbs per one sent, so that we'll get
+	 * all of the memory back if they are used fast enough. */
+	for (n = 0;
+	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
+	     ((skb = virtqueue_get_buf(vi->svq, &len)));
+	     ++n) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
 		dev_kfree_skb_any(skb);
 	}
+	return !c;
 }
 
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
@@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
 
-	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	/* Free enough pending old buffers to enable queueing new ones. */
+	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
 
 	/* Try to transmit */
 	capacity = xmit_skb(vi, skb);
@@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_queue(dev);
 		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(vi);
-			capacity = virtqueue_get_capacity(vi->svq);
-			if (capacity >= 2+MAX_SKB_FRAGS) {
+			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
 				netif_start_queue(dev);
 				virtqueue_disable_cb(vi->svq);
 			}
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-19 23:11   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Current code might introduce a lot of latency variation
if there are many pending bufs at the time we
attempt to transmit a new one. This is bad for
real-time applications and can't be good for TCP either.

Free up just enough to both clean up all buffers
eventually and to be able to xmit the next packet.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/net/virtio_net.c |   22 ++++++++++++++--------
 1 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f33c92b..42935cb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -509,17 +509,25 @@ again:
 	return received;
 }
 
-static void free_old_xmit_skbs(struct virtnet_info *vi)
+static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
 {
 	struct sk_buff *skb;
 	unsigned int len;
-
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	bool c;
+	int n;
+
+	/* We try to free up at least 2 skbs per one sent, so that we'll get
+	 * all of the memory back if they are used fast enough. */
+	for (n = 0;
+	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
+	     ((skb = virtqueue_get_buf(vi->svq, &len)));
+	     ++n) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
 		dev_kfree_skb_any(skb);
 	}
+	return !c;
 }
 
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
@@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
 
-	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	/* Free enough pending old buffers to enable queueing new ones. */
+	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
 
 	/* Try to transmit */
 	capacity = xmit_skb(vi, skb);
@@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_queue(dev);
 		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(vi);
-			capacity = virtqueue_get_capacity(vi->svq);
-			if (capacity >= 2+MAX_SKB_FRAGS) {
+			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
 				netif_start_queue(dev);
 				virtqueue_disable_cb(vi->svq);
 			}
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (19 preceding siblings ...)
  (?)
@ 2011-05-19 23:11 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:11 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Current code might introduce a lot of latency variation
if there are many pending bufs at the time we
attempt to transmit a new one. This is bad for
real-time applications and can't be good for TCP either.

Free up just enough to both clean up all buffers
eventually and to be able to xmit the next packet.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c |   22 ++++++++++++++--------
 1 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f33c92b..42935cb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -509,17 +509,25 @@ again:
 	return received;
 }
 
-static void free_old_xmit_skbs(struct virtnet_info *vi)
+static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
 {
 	struct sk_buff *skb;
 	unsigned int len;
-
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	bool c;
+	int n;
+
+	/* We try to free up at least 2 skbs per one sent, so that we'll get
+	 * all of the memory back if they are used fast enough. */
+	for (n = 0;
+	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
+	     ((skb = virtqueue_get_buf(vi->svq, &len)));
+	     ++n) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
 		dev_kfree_skb_any(skb);
 	}
+	return !c;
 }
 
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
@@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
 
-	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	/* Free enough pending old buffers to enable queueing new ones. */
+	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
 
 	/* Try to transmit */
 	capacity = xmit_skb(vi, skb);
@@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_queue(dev);
 		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(vi);
-			capacity = virtqueue_get_capacity(vi->svq);
-			if (capacity >= 2+MAX_SKB_FRAGS) {
+			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
 				netif_start_queue(dev);
 				virtqueue_disable_cb(vi->svq);
 			}
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 11/14] virtio: don't delay avail index update
@ 2011-05-19 23:12   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Update avail index immediately instead of upon kick:
for virtio-net RX this helps parallelism with the host.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   28 +++++++++++++++++++---------
 1 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eed5f29..8218fe6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -89,7 +89,7 @@ struct vring_virtqueue
 	unsigned int num_free;
 	/* Head of free buffer list. */
 	unsigned int free_head;
-	/* Number we've added since last sync. */
+	/* Number we've added since last kick. */
 	unsigned int num_added;
 
 	/* Last used index we've seen. */
@@ -174,6 +174,13 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
 
 	BUG_ON(data == NULL);
 
+	/* Prevent drivers from adding more than num bufs without a kick. */
+	if (vq->num_added == vq->vring.num) {
+		printk(KERN_ERR "gaaa!!!\n");
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && (out + in) > 1 && vq->num_free) {
@@ -227,8 +234,14 @@ add_head:
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync).  FIXME: avoid modulus here? */
-	avail = (vq->vring.avail->idx + vq->num_added++) % vq->vring.num;
+	avail = vq->vring.avail->idx % vq->vring.num;
 	vq->vring.avail->ring[avail] = head;
+	vq->num_added++;
+
+	/* Descriptors and available array need to be set before we expose the
+	 * new available array entries. */
+	virtio_wmb();
+	vq->vring.avail->idx++;
 
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
@@ -242,17 +255,14 @@ void virtqueue_kick(struct virtqueue *_vq)
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	u16 new, old;
 	START_USE(vq);
-	/* Descriptors and available array need to be set before we expose the
-	 * new available array entries. */
-	virtio_wmb();
-
-	old = vq->vring.avail->idx;
-	new = vq->vring.avail->idx = old + vq->num_added;
-	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
+	new = vq->vring.avail->idx;
+	old = new - vq->num_added;
+	vq->num_added = 0;
+
 	if (vq->event ?
 	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
 	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 11/14] virtio: don't delay avail index update
@ 2011-05-19 23:12   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Update avail index immediately instead of upon kick:
for virtio-net RX this helps parallelism with the host.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/virtio/virtio_ring.c |   28 +++++++++++++++++++---------
 1 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eed5f29..8218fe6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -89,7 +89,7 @@ struct vring_virtqueue
 	unsigned int num_free;
 	/* Head of free buffer list. */
 	unsigned int free_head;
-	/* Number we've added since last sync. */
+	/* Number we've added since last kick. */
 	unsigned int num_added;
 
 	/* Last used index we've seen. */
@@ -174,6 +174,13 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
 
 	BUG_ON(data == NULL);
 
+	/* Prevent drivers from adding more than num bufs without a kick. */
+	if (vq->num_added == vq->vring.num) {
+		printk(KERN_ERR "gaaa!!!\n");
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && (out + in) > 1 && vq->num_free) {
@@ -227,8 +234,14 @@ add_head:
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync).  FIXME: avoid modulus here? */
-	avail = (vq->vring.avail->idx + vq->num_added++) % vq->vring.num;
+	avail = vq->vring.avail->idx % vq->vring.num;
 	vq->vring.avail->ring[avail] = head;
+	vq->num_added++;
+
+	/* Descriptors and available array need to be set before we expose the
+	 * new available array entries. */
+	virtio_wmb();
+	vq->vring.avail->idx++;
 
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
@@ -242,17 +255,14 @@ void virtqueue_kick(struct virtqueue *_vq)
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	u16 new, old;
 	START_USE(vq);
-	/* Descriptors and available array need to be set before we expose the
-	 * new available array entries. */
-	virtio_wmb();
-
-	old = vq->vring.avail->idx;
-	new = vq->vring.avail->idx = old + vq->num_added;
-	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
+	new = vq->vring.avail->idx;
+	old = new - vq->num_added;
+	vq->num_added = 0;
+
 	if (vq->event ?
 	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
 	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 11/14] virtio: don't delay avail index update
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (21 preceding siblings ...)
  (?)
@ 2011-05-19 23:12 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Update avail index immediately instead of upon kick:
for virtio-net RX this helps parallelism with the host.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   28 +++++++++++++++++++---------
 1 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eed5f29..8218fe6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -89,7 +89,7 @@ struct vring_virtqueue
 	unsigned int num_free;
 	/* Head of free buffer list. */
 	unsigned int free_head;
-	/* Number we've added since last sync. */
+	/* Number we've added since last kick. */
 	unsigned int num_added;
 
 	/* Last used index we've seen. */
@@ -174,6 +174,13 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
 
 	BUG_ON(data == NULL);
 
+	/* Prevent drivers from adding more than num bufs without a kick. */
+	if (vq->num_added == vq->vring.num) {
+		printk(KERN_ERR "gaaa!!!\n");
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && (out + in) > 1 && vq->num_free) {
@@ -227,8 +234,14 @@ add_head:
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync).  FIXME: avoid modulus here? */
-	avail = (vq->vring.avail->idx + vq->num_added++) % vq->vring.num;
+	avail = vq->vring.avail->idx % vq->vring.num;
 	vq->vring.avail->ring[avail] = head;
+	vq->num_added++;
+
+	/* Descriptors and available array need to be set before we expose the
+	 * new available array entries. */
+	virtio_wmb();
+	vq->vring.avail->idx++;
 
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
@@ -242,17 +255,14 @@ void virtqueue_kick(struct virtqueue *_vq)
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	u16 new, old;
 	START_USE(vq);
-	/* Descriptors and available array need to be set before we expose the
-	 * new available array entries. */
-	virtio_wmb();
-
-	old = vq->vring.avail->idx;
-	new = vq->vring.avail->idx = old + vq->num_added;
-	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
+	new = vq->vring.avail->idx;
+	old = new - vq->num_added;
+	vq->num_added = 0;
+
 	if (vq->event ?
 	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
 	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 12/14] virtio: 64 bit features
  2011-05-19 23:10 ` Michael S. Tsirkin
@ 2011-05-19 23:12   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Extend features to 64 bit so we can use more
transport bits.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/lguest/lguest_device.c |    8 ++++----
 drivers/s390/kvm/kvm_virtio.c  |    8 ++++----
 drivers/virtio/virtio.c        |    8 ++++----
 drivers/virtio/virtio_pci.c    |   34 ++++++++++++++++++++++++++++------
 drivers/virtio/virtio_ring.c   |    2 ++
 include/linux/virtio.h         |    2 +-
 include/linux/virtio_config.h  |   15 +++++++++------
 include/linux/virtio_pci.h     |    9 ++++++++-
 8 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 69c84a1..d2d6953 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -93,17 +93,17 @@ static unsigned desc_size(const struct lguest_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 lg_get_features(struct virtio_device *vdev)
+static u64 lg_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct lguest_device_desc *desc = to_lgdev(vdev)->desc;
 	u8 *in_features = lg_features(desc);
 
 	/* We do this the slow but generic way. */
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 
 	return features;
 }
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 414427d..c56293c 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -79,16 +79,16 @@ static unsigned desc_size(const struct kvm_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 kvm_get_features(struct virtio_device *vdev)
+static u64 kvm_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct kvm_device_desc *desc = to_kvmdev(vdev)->desc;
 	u8 *in_features = kvm_vq_features(desc);
 
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 	return features;
 }
 
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index efb35aa..52b24d7 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -112,7 +112,7 @@ static int virtio_dev_probe(struct device *_d)
 	struct virtio_device *dev = container_of(_d,struct virtio_device,dev);
 	struct virtio_driver *drv = container_of(dev->dev.driver,
 						 struct virtio_driver, driver);
-	u32 device_features;
+	u64 device_features;
 
 	/* We have a driver! */
 	add_status(dev, VIRTIO_CONFIG_S_DRIVER);
@@ -124,14 +124,14 @@ static int virtio_dev_probe(struct device *_d)
 	memset(dev->features, 0, sizeof(dev->features));
 	for (i = 0; i < drv->feature_table_size; i++) {
 		unsigned int f = drv->feature_table[i];
-		BUG_ON(f >= 32);
-		if (device_features & (1 << f))
+		BUG_ON(f >= 64);
+		if (device_features & (1ull << f))
 			set_bit(f, dev->features);
 	}
 
 	/* Transport features always preserved to pass to finalize_features. */
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++)
-		if (device_features & (1 << i))
+		if (device_features & (1ull << i))
 			set_bit(i, dev->features);
 
 	dev->config->finalize_features(dev);
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 4fb5b2b..04b216f 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -44,6 +44,8 @@ struct virtio_pci_device
 	spinlock_t lock;
 	struct list_head virtqueues;
 
+	/* 64 bit features */
+	int features_hi;
 	/* MSI-X support */
 	int msix_enabled;
 	int intx_enabled;
@@ -103,26 +105,46 @@ static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
 }
 
 /* virtio config->get_features() implementation */
-static u32 vp_get_features(struct virtio_device *vdev)
+static u64 vp_get_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
-	/* When someone needs more than 32 feature bits, we'll need to
+	/* When someone needs more than 32 feature bits, we need to
 	 * steal a bit to indicate that the rest are somewhere else. */
-	return ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	flo = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(0x1 << VIRTIO_F_FEATURES_HI,
+			  vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+		fhi = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+		fhi = 0;
+	}
+	return (((u64)fhi) << 32) | flo;
 }
 
 /* virtio config->finalize_features() implementation */
 static void vp_finalize_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
 	/* Give virtio_ring a chance to accept features. */
 	vring_transport_features(vdev);
 
-	/* We only support 32 feature bits. */
-	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 1);
-	iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
+	/* We only support 64 feature bits. */
+	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 64 / BITS_PER_LONG);
+	flo = vdev->features[0];
+	fhi = vdev->features[64 / BITS_PER_LONG - 1] >> (BITS_PER_LONG - 32);
+	iowrite32(flo, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(fhi, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+	}
 }
 
 /* virtio config->get() implementation */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 8218fe6..4a7a651 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -534,6 +534,8 @@ void vring_transport_features(struct virtio_device *vdev)
 
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
 		switch (i) {
+		case VIRTIO_F_FEATURES_HI:
+			break;
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
 		case VIRTIO_RING_F_EVENT_IDX:
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 58c0953..944ebcd 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -119,7 +119,7 @@ struct virtio_device {
 	struct virtio_config_ops *config;
 	struct list_head vqs;
 	/* Note that this is a Linux set_bit-style bitmap. */
-	unsigned long features[1];
+	unsigned long features[64 / BITS_PER_LONG];
 	void *priv;
 };
 
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 800617b..b1a1981 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -18,16 +18,19 @@
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED		0x80
 
-/* Some virtio feature bits (currently bits 28 through 31) are reserved for the
+/* Some virtio feature bits (currently bits 28 through 39) are reserved for the
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		32
+#define VIRTIO_TRANSPORT_F_END		40
 
 /* Do we get callbacks when the ring is completely used, even if we've
  * suppressed them? */
 #define VIRTIO_F_NOTIFY_ON_EMPTY	24
 
+/* Enables feature bits 32 to 63 (only really required for virtio_pci). */
+#define VIRTIO_F_FEATURES_HI		31
+
 #ifdef __KERNEL__
 #include <linux/err.h>
 #include <linux/virtio.h>
@@ -72,7 +75,7 @@
  * @del_vqs: free virtqueues found by find_vqs().
  * @get_features: get the array of feature bits for this device.
  *	vdev: the virtio_device
- *	Returns the first 32 feature bits (all we currently need).
+ *	Returns the first 64 feature bits (all we currently need).
  * @finalize_features: confirm what device features we'll be using.
  *	vdev: the virtio_device
  *	This gives the final feature bits for the device: it can change
@@ -92,7 +95,7 @@ struct virtio_config_ops {
 			vq_callback_t *callbacks[],
 			const char *names[]);
 	void (*del_vqs)(struct virtio_device *);
-	u32 (*get_features)(struct virtio_device *vdev);
+	u64 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 };
 
@@ -110,9 +113,9 @@ static inline bool virtio_has_feature(const struct virtio_device *vdev,
 {
 	/* Did you forget to fix assumptions on max features? */
 	if (__builtin_constant_p(fbit))
-		BUILD_BUG_ON(fbit >= 32);
+		BUILD_BUG_ON(fbit >= 64);
 	else
-		BUG_ON(fbit >= 32);
+		BUG_ON(fbit >= 64);
 
 	if (fbit < VIRTIO_TRANSPORT_F_START)
 		virtio_check_driver_offered_feature(vdev, fbit);
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index 9a3d7c4..90f9725 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -55,9 +55,16 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+/* An extended 32-bit r/o bitmask of the features supported by the host */
+#define VIRTIO_PCI_HOST_FEATURES_HI	24
+
+/* An extended 32-bit r/w bitmask of features activated by the guest */
+#define VIRTIO_PCI_GUEST_FEATURES_HI	28
+
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
-#define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
+#define VIRTIO_PCI_CONFIG(dev)		((dev)->features_hi ? 32 : \
+						(dev)->msix_enabled ? 24 : 20)
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION		0
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 12/14] virtio: 64 bit features
@ 2011-05-19 23:12   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Extend features to 64 bit so we can use more
transport bits.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/lguest/lguest_device.c |    8 ++++----
 drivers/s390/kvm/kvm_virtio.c  |    8 ++++----
 drivers/virtio/virtio.c        |    8 ++++----
 drivers/virtio/virtio_pci.c    |   34 ++++++++++++++++++++++++++++------
 drivers/virtio/virtio_ring.c   |    2 ++
 include/linux/virtio.h         |    2 +-
 include/linux/virtio_config.h  |   15 +++++++++------
 include/linux/virtio_pci.h     |    9 ++++++++-
 8 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 69c84a1..d2d6953 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -93,17 +93,17 @@ static unsigned desc_size(const struct lguest_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 lg_get_features(struct virtio_device *vdev)
+static u64 lg_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct lguest_device_desc *desc = to_lgdev(vdev)->desc;
 	u8 *in_features = lg_features(desc);
 
 	/* We do this the slow but generic way. */
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 
 	return features;
 }
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 414427d..c56293c 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -79,16 +79,16 @@ static unsigned desc_size(const struct kvm_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 kvm_get_features(struct virtio_device *vdev)
+static u64 kvm_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct kvm_device_desc *desc = to_kvmdev(vdev)->desc;
 	u8 *in_features = kvm_vq_features(desc);
 
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 	return features;
 }
 
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index efb35aa..52b24d7 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -112,7 +112,7 @@ static int virtio_dev_probe(struct device *_d)
 	struct virtio_device *dev = container_of(_d,struct virtio_device,dev);
 	struct virtio_driver *drv = container_of(dev->dev.driver,
 						 struct virtio_driver, driver);
-	u32 device_features;
+	u64 device_features;
 
 	/* We have a driver! */
 	add_status(dev, VIRTIO_CONFIG_S_DRIVER);
@@ -124,14 +124,14 @@ static int virtio_dev_probe(struct device *_d)
 	memset(dev->features, 0, sizeof(dev->features));
 	for (i = 0; i < drv->feature_table_size; i++) {
 		unsigned int f = drv->feature_table[i];
-		BUG_ON(f >= 32);
-		if (device_features & (1 << f))
+		BUG_ON(f >= 64);
+		if (device_features & (1ull << f))
 			set_bit(f, dev->features);
 	}
 
 	/* Transport features always preserved to pass to finalize_features. */
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++)
-		if (device_features & (1 << i))
+		if (device_features & (1ull << i))
 			set_bit(i, dev->features);
 
 	dev->config->finalize_features(dev);
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 4fb5b2b..04b216f 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -44,6 +44,8 @@ struct virtio_pci_device
 	spinlock_t lock;
 	struct list_head virtqueues;
 
+	/* 64 bit features */
+	int features_hi;
 	/* MSI-X support */
 	int msix_enabled;
 	int intx_enabled;
@@ -103,26 +105,46 @@ static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
 }
 
 /* virtio config->get_features() implementation */
-static u32 vp_get_features(struct virtio_device *vdev)
+static u64 vp_get_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
-	/* When someone needs more than 32 feature bits, we'll need to
+	/* When someone needs more than 32 feature bits, we need to
 	 * steal a bit to indicate that the rest are somewhere else. */
-	return ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	flo = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(0x1 << VIRTIO_F_FEATURES_HI,
+			  vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+		fhi = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+		fhi = 0;
+	}
+	return (((u64)fhi) << 32) | flo;
 }
 
 /* virtio config->finalize_features() implementation */
 static void vp_finalize_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
 	/* Give virtio_ring a chance to accept features. */
 	vring_transport_features(vdev);
 
-	/* We only support 32 feature bits. */
-	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 1);
-	iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
+	/* We only support 64 feature bits. */
+	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 64 / BITS_PER_LONG);
+	flo = vdev->features[0];
+	fhi = vdev->features[64 / BITS_PER_LONG - 1] >> (BITS_PER_LONG - 32);
+	iowrite32(flo, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(fhi, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+	}
 }
 
 /* virtio config->get() implementation */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 8218fe6..4a7a651 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -534,6 +534,8 @@ void vring_transport_features(struct virtio_device *vdev)
 
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
 		switch (i) {
+		case VIRTIO_F_FEATURES_HI:
+			break;
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
 		case VIRTIO_RING_F_EVENT_IDX:
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 58c0953..944ebcd 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -119,7 +119,7 @@ struct virtio_device {
 	struct virtio_config_ops *config;
 	struct list_head vqs;
 	/* Note that this is a Linux set_bit-style bitmap. */
-	unsigned long features[1];
+	unsigned long features[64 / BITS_PER_LONG];
 	void *priv;
 };
 
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 800617b..b1a1981 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -18,16 +18,19 @@
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED		0x80
 
-/* Some virtio feature bits (currently bits 28 through 31) are reserved for the
+/* Some virtio feature bits (currently bits 28 through 39) are reserved for the
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		32
+#define VIRTIO_TRANSPORT_F_END		40
 
 /* Do we get callbacks when the ring is completely used, even if we've
  * suppressed them? */
 #define VIRTIO_F_NOTIFY_ON_EMPTY	24
 
+/* Enables feature bits 32 to 63 (only really required for virtio_pci). */
+#define VIRTIO_F_FEATURES_HI		31
+
 #ifdef __KERNEL__
 #include <linux/err.h>
 #include <linux/virtio.h>
@@ -72,7 +75,7 @@
  * @del_vqs: free virtqueues found by find_vqs().
  * @get_features: get the array of feature bits for this device.
  *	vdev: the virtio_device
- *	Returns the first 32 feature bits (all we currently need).
+ *	Returns the first 64 feature bits (all we currently need).
  * @finalize_features: confirm what device features we'll be using.
  *	vdev: the virtio_device
  *	This gives the final feature bits for the device: it can change
@@ -92,7 +95,7 @@ struct virtio_config_ops {
 			vq_callback_t *callbacks[],
 			const char *names[]);
 	void (*del_vqs)(struct virtio_device *);
-	u32 (*get_features)(struct virtio_device *vdev);
+	u64 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 };
 
@@ -110,9 +113,9 @@ static inline bool virtio_has_feature(const struct virtio_device *vdev,
 {
 	/* Did you forget to fix assumptions on max features? */
 	if (__builtin_constant_p(fbit))
-		BUILD_BUG_ON(fbit >= 32);
+		BUILD_BUG_ON(fbit >= 64);
 	else
-		BUG_ON(fbit >= 32);
+		BUG_ON(fbit >= 64);
 
 	if (fbit < VIRTIO_TRANSPORT_F_START)
 		virtio_check_driver_offered_feature(vdev, fbit);
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index 9a3d7c4..90f9725 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -55,9 +55,16 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+/* An extended 32-bit r/o bitmask of the features supported by the host */
+#define VIRTIO_PCI_HOST_FEATURES_HI	24
+
+/* An extended 32-bit r/w bitmask of features activated by the guest */
+#define VIRTIO_PCI_GUEST_FEATURES_HI	28
+
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
-#define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
+#define VIRTIO_PCI_CONFIG(dev)		((dev)->features_hi ? 32 : \
+						(dev)->msix_enabled ? 24 : 20)
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION		0
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 12/14] virtio: 64 bit features
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (22 preceding siblings ...)
  (?)
@ 2011-05-19 23:12 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Extend features to 64 bit so we can use more
transport bits.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/lguest/lguest_device.c |    8 ++++----
 drivers/s390/kvm/kvm_virtio.c  |    8 ++++----
 drivers/virtio/virtio.c        |    8 ++++----
 drivers/virtio/virtio_pci.c    |   34 ++++++++++++++++++++++++++++------
 drivers/virtio/virtio_ring.c   |    2 ++
 include/linux/virtio.h         |    2 +-
 include/linux/virtio_config.h  |   15 +++++++++------
 include/linux/virtio_pci.h     |    9 ++++++++-
 8 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 69c84a1..d2d6953 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -93,17 +93,17 @@ static unsigned desc_size(const struct lguest_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 lg_get_features(struct virtio_device *vdev)
+static u64 lg_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct lguest_device_desc *desc = to_lgdev(vdev)->desc;
 	u8 *in_features = lg_features(desc);
 
 	/* We do this the slow but generic way. */
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 
 	return features;
 }
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 414427d..c56293c 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -79,16 +79,16 @@ static unsigned desc_size(const struct kvm_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 kvm_get_features(struct virtio_device *vdev)
+static u64 kvm_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct kvm_device_desc *desc = to_kvmdev(vdev)->desc;
 	u8 *in_features = kvm_vq_features(desc);
 
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 	return features;
 }
 
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index efb35aa..52b24d7 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -112,7 +112,7 @@ static int virtio_dev_probe(struct device *_d)
 	struct virtio_device *dev = container_of(_d,struct virtio_device,dev);
 	struct virtio_driver *drv = container_of(dev->dev.driver,
 						 struct virtio_driver, driver);
-	u32 device_features;
+	u64 device_features;
 
 	/* We have a driver! */
 	add_status(dev, VIRTIO_CONFIG_S_DRIVER);
@@ -124,14 +124,14 @@ static int virtio_dev_probe(struct device *_d)
 	memset(dev->features, 0, sizeof(dev->features));
 	for (i = 0; i < drv->feature_table_size; i++) {
 		unsigned int f = drv->feature_table[i];
-		BUG_ON(f >= 32);
-		if (device_features & (1 << f))
+		BUG_ON(f >= 64);
+		if (device_features & (1ull << f))
 			set_bit(f, dev->features);
 	}
 
 	/* Transport features always preserved to pass to finalize_features. */
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++)
-		if (device_features & (1 << i))
+		if (device_features & (1ull << i))
 			set_bit(i, dev->features);
 
 	dev->config->finalize_features(dev);
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 4fb5b2b..04b216f 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -44,6 +44,8 @@ struct virtio_pci_device
 	spinlock_t lock;
 	struct list_head virtqueues;
 
+	/* 64 bit features */
+	int features_hi;
 	/* MSI-X support */
 	int msix_enabled;
 	int intx_enabled;
@@ -103,26 +105,46 @@ static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
 }
 
 /* virtio config->get_features() implementation */
-static u32 vp_get_features(struct virtio_device *vdev)
+static u64 vp_get_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
-	/* When someone needs more than 32 feature bits, we'll need to
+	/* When someone needs more than 32 feature bits, we need to
 	 * steal a bit to indicate that the rest are somewhere else. */
-	return ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	flo = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(0x1 << VIRTIO_F_FEATURES_HI,
+			  vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+		fhi = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+		fhi = 0;
+	}
+	return (((u64)fhi) << 32) | flo;
 }
 
 /* virtio config->finalize_features() implementation */
 static void vp_finalize_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
 	/* Give virtio_ring a chance to accept features. */
 	vring_transport_features(vdev);
 
-	/* We only support 32 feature bits. */
-	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 1);
-	iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
+	/* We only support 64 feature bits. */
+	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 64 / BITS_PER_LONG);
+	flo = vdev->features[0];
+	fhi = vdev->features[64 / BITS_PER_LONG - 1] >> (BITS_PER_LONG - 32);
+	iowrite32(flo, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(fhi, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+	}
 }
 
 /* virtio config->get() implementation */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 8218fe6..4a7a651 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -534,6 +534,8 @@ void vring_transport_features(struct virtio_device *vdev)
 
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
 		switch (i) {
+		case VIRTIO_F_FEATURES_HI:
+			break;
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
 		case VIRTIO_RING_F_EVENT_IDX:
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 58c0953..944ebcd 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -119,7 +119,7 @@ struct virtio_device {
 	struct virtio_config_ops *config;
 	struct list_head vqs;
 	/* Note that this is a Linux set_bit-style bitmap. */
-	unsigned long features[1];
+	unsigned long features[64 / BITS_PER_LONG];
 	void *priv;
 };
 
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 800617b..b1a1981 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -18,16 +18,19 @@
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED		0x80
 
-/* Some virtio feature bits (currently bits 28 through 31) are reserved for the
+/* Some virtio feature bits (currently bits 28 through 39) are reserved for the
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		32
+#define VIRTIO_TRANSPORT_F_END		40
 
 /* Do we get callbacks when the ring is completely used, even if we've
  * suppressed them? */
 #define VIRTIO_F_NOTIFY_ON_EMPTY	24
 
+/* Enables feature bits 32 to 63 (only really required for virtio_pci). */
+#define VIRTIO_F_FEATURES_HI		31
+
 #ifdef __KERNEL__
 #include <linux/err.h>
 #include <linux/virtio.h>
@@ -72,7 +75,7 @@
  * @del_vqs: free virtqueues found by find_vqs().
  * @get_features: get the array of feature bits for this device.
  *	vdev: the virtio_device
- *	Returns the first 32 feature bits (all we currently need).
+ *	Returns the first 64 feature bits (all we currently need).
  * @finalize_features: confirm what device features we'll be using.
  *	vdev: the virtio_device
  *	This gives the final feature bits for the device: it can change
@@ -92,7 +95,7 @@ struct virtio_config_ops {
 			vq_callback_t *callbacks[],
 			const char *names[]);
 	void (*del_vqs)(struct virtio_device *);
-	u32 (*get_features)(struct virtio_device *vdev);
+	u64 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 };
 
@@ -110,9 +113,9 @@ static inline bool virtio_has_feature(const struct virtio_device *vdev,
 {
 	/* Did you forget to fix assumptions on max features? */
 	if (__builtin_constant_p(fbit))
-		BUILD_BUG_ON(fbit >= 32);
+		BUILD_BUG_ON(fbit >= 64);
 	else
-		BUG_ON(fbit >= 32);
+		BUG_ON(fbit >= 64);
 
 	if (fbit < VIRTIO_TRANSPORT_F_START)
 		virtio_check_driver_offered_feature(vdev, fbit);
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index 9a3d7c4..90f9725 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -55,9 +55,16 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+/* An extended 32-bit r/o bitmask of the features supported by the host */
+#define VIRTIO_PCI_HOST_FEATURES_HI	24
+
+/* An extended 32-bit r/w bitmask of features activated by the guest */
+#define VIRTIO_PCI_GUEST_FEATURES_HI	28
+
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
-#define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
+#define VIRTIO_PCI_CONFIG(dev)		((dev)->features_hi ? 32 : \
+						(dev)->msix_enabled ? 24 : 20)
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION		0
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 13/14] virtio_test: update for 64 bit features
@ 2011-05-19 23:12   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Extend the virtio_test tool so it can work with
64 bit features.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/virtio_test.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index 74d3331..96cf9bf 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -55,7 +55,6 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 {
 	struct vhost_vring_state state = { .index = info->idx };
 	struct vhost_vring_file file = { .index = info->idx };
-	unsigned long long features = dev->vdev.features[0];
 	struct vhost_vring_addr addr = {
 		.index = info->idx,
 		.desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
@@ -63,6 +62,10 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 		.used_user_addr = (uint64_t)(unsigned long)info->vring.used,
 	};
 	int r;
+	unsigned long long features = dev->vdev.features[0];
+	if (sizeof features > sizeof dev->vdev.features[0])
+		features |= ((unsigned long long)dev->vdev.features[1]) << 32;
+
 	r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
 	assert(r >= 0);
 	state.num = info->vring.num;
@@ -107,7 +110,8 @@ static void vdev_info_init(struct vdev_info* dev, unsigned long long features)
 	int r;
 	memset(dev, 0, sizeof *dev);
 	dev->vdev.features[0] = features;
-	dev->vdev.features[1] = features >> 32;
+	if (sizeof features > sizeof dev->vdev.features[0])
+		dev->vdev.features[1] = features >> 32;
 	dev->buf_size = 1024;
 	dev->buf = malloc(dev->buf_size);
 	assert(dev->buf);
-- 
1.7.5.53.gc233e


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 13/14] virtio_test: update for 64 bit features
@ 2011-05-19 23:12   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Extend the virtio_test tool so it can work with
64 bit features.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 tools/virtio/virtio_test.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index 74d3331..96cf9bf 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -55,7 +55,6 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 {
 	struct vhost_vring_state state = { .index = info->idx };
 	struct vhost_vring_file file = { .index = info->idx };
-	unsigned long long features = dev->vdev.features[0];
 	struct vhost_vring_addr addr = {
 		.index = info->idx,
 		.desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
@@ -63,6 +62,10 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 		.used_user_addr = (uint64_t)(unsigned long)info->vring.used,
 	};
 	int r;
+	unsigned long long features = dev->vdev.features[0];
+	if (sizeof features > sizeof dev->vdev.features[0])
+		features |= ((unsigned long long)dev->vdev.features[1]) << 32;
+
 	r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
 	assert(r >= 0);
 	state.num = info->vring.num;
@@ -107,7 +110,8 @@ static void vdev_info_init(struct vdev_info* dev, unsigned long long features)
 	int r;
 	memset(dev, 0, sizeof *dev);
 	dev->vdev.features[0] = features;
-	dev->vdev.features[1] = features >> 32;
+	if (sizeof features > sizeof dev->vdev.features[0])
+		dev->vdev.features[1] = features >> 32;
 	dev->buf_size = 1024;
 	dev->buf = malloc(dev->buf_size);
 	assert(dev->buf);
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 13/14] virtio_test: update for 64 bit features
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (24 preceding siblings ...)
  (?)
@ 2011-05-19 23:12 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Extend the virtio_test tool so it can work with
64 bit features.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/virtio_test.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index 74d3331..96cf9bf 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -55,7 +55,6 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 {
 	struct vhost_vring_state state = { .index = info->idx };
 	struct vhost_vring_file file = { .index = info->idx };
-	unsigned long long features = dev->vdev.features[0];
 	struct vhost_vring_addr addr = {
 		.index = info->idx,
 		.desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
@@ -63,6 +62,10 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 		.used_user_addr = (uint64_t)(unsigned long)info->vring.used,
 	};
 	int r;
+	unsigned long long features = dev->vdev.features[0];
+	if (sizeof features > sizeof dev->vdev.features[0])
+		features |= ((unsigned long long)dev->vdev.features[1]) << 32;
+
 	r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
 	assert(r >= 0);
 	state.num = info->vring.num;
@@ -107,7 +110,8 @@ static void vdev_info_init(struct vdev_info* dev, unsigned long long features)
 	int r;
 	memset(dev, 0, sizeof *dev);
 	dev->vdev.features[0] = features;
-	dev->vdev.features[1] = features >> 32;
+	if (sizeof features > sizeof dev->vdev.features[0])
+		dev->vdev.features[1] = features >> 32;
 	dev->buf_size = 1024;
 	dev->buf = malloc(dev->buf_size);
 	assert(dev->buf);
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 14/14] vhost: fix 64 bit features
@ 2011-05-19 23:12   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

Update vhost_has_feature to make it work correctly for bit > 32.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8e03379..64889d2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -123,7 +123,7 @@ struct vhost_dev {
 	struct vhost_memory __rcu *memory;
 	struct mm_struct *mm;
 	struct mutex mutex;
-	unsigned acked_features;
+	u64 acked_features;
 	struct vhost_virtqueue *vqs;
 	int nvqs;
 	struct file *log_file;
@@ -176,14 +176,14 @@ enum {
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
-static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
+static inline bool vhost_has_feature(struct vhost_dev *dev, int bit)
 {
-	unsigned acked_features;
+	u64 acked_features;
 
 	/* TODO: check that we are running from vhost_worker or dev mutex is
 	 * held? */
 	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
-	return acked_features & (1 << bit);
+	return acked_features & (1ull << bit);
 }
 
 #endif
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 14/14] vhost: fix 64 bit features
@ 2011-05-19 23:12   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Update vhost_has_feature to make it work correctly for bit > 32.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/vhost/vhost.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8e03379..64889d2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -123,7 +123,7 @@ struct vhost_dev {
 	struct vhost_memory __rcu *memory;
 	struct mm_struct *mm;
 	struct mutex mutex;
-	unsigned acked_features;
+	u64 acked_features;
 	struct vhost_virtqueue *vqs;
 	int nvqs;
 	struct file *log_file;
@@ -176,14 +176,14 @@ enum {
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
-static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
+static inline bool vhost_has_feature(struct vhost_dev *dev, int bit)
 {
-	unsigned acked_features;
+	u64 acked_features;
 
 	/* TODO: check that we are running from vhost_worker or dev mutex is
 	 * held? */
 	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
-	return acked_features & (1 << bit);
+	return acked_features & (1ull << bit);
 }
 
 #endif
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCHv2 14/14] vhost: fix 64 bit features
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (27 preceding siblings ...)
  (?)
@ 2011-05-19 23:12 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

Update vhost_has_feature to make it work correctly for bit > 32.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8e03379..64889d2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -123,7 +123,7 @@ struct vhost_dev {
 	struct vhost_memory __rcu *memory;
 	struct mm_struct *mm;
 	struct mutex mutex;
-	unsigned acked_features;
+	u64 acked_features;
 	struct vhost_virtqueue *vqs;
 	int nvqs;
 	struct file *log_file;
@@ -176,14 +176,14 @@ enum {
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
-static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
+static inline bool vhost_has_feature(struct vhost_dev *dev, int bit)
 {
-	unsigned acked_features;
+	u64 acked_features;
 
 	/* TODO: check that we are running from vhost_worker or dev mutex is
 	 * held? */
 	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
-	return acked_features & (1 << bit);
+	return acked_features & (1ull << bit);
 }
 
 #endif
-- 
1.7.5.53.gc233e

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 00/14] virtio and vhost-net performance enhancements
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (28 preceding siblings ...)
  (?)
@ 2011-05-19 23:20 ` David Miller
  -1 siblings, 0 replies; 133+ messages in thread
From: David Miller @ 2011-05-19 23:20 UTC (permalink / raw)
  To: mst
  Cc: linux-kernel, rusty, cotte, borntraeger, linux390, schwidefsky,
	heiko.carstens, xma, lguest, virtualization, netdev, linux-s390,
	kvm, krkumar2, tahm, steved, habanero

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 20 May 2011 02:10:07 +0300

> Rusty, I think it will be easier to merge vhost and virtio bits in one
> go. Can it all go in through your tree (Dave in the past acked
> sending a very similar patch through you so should not be a problem)?

And in case you want an explicit ack for the net bits:

Acked-by: David S. Miller <davem@davemloft.net>

:-)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 00/14] virtio and vhost-net performance enhancements
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (29 preceding siblings ...)
  (?)
@ 2011-05-19 23:20 ` David Miller
  -1 siblings, 0 replies; 133+ messages in thread
From: David Miller @ 2011-05-19 23:20 UTC (permalink / raw)
  To: mst
  Cc: krkumar2, cotte, lguest, xma, kvm, linux-s390, netdev, habanero,
	heiko.carstens, linux-kernel, virtualization, steved,
	borntraeger, tahm, schwidefsky, linux390

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 20 May 2011 02:10:07 +0300

> Rusty, I think it will be easier to merge vhost and virtio bits in one
> go. Can it all go in through your tree (Dave in the past acked
> sending a very similar patch through you so should not be a problem)?

And in case you want an explicit ack for the net bits:

Acked-by: David S. Miller <davem@davemloft.net>

:-)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 00/14] virtio and vhost-net performance enhancements
@ 2011-05-20  7:51   ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-20  7:51 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:10:07 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> OK, here is the large patchset that implements the virtio spec update
> that I sent earlier (the spec itself needs a minor update, will send
> that out too next week, but I think we are on the same page here
> already). It supercedes the PUBLISH_USED_IDX patches I sent
> out earlier.
> 
> What will follow will be a patchset that actually includes 4 sets of
> patches.  I note below their status.  Please consider for 2.6.40, at
> least partially. Rusty, do you think it's feasible?

Erk.  I'm still unsure that we should be using ring capacity as the
thresholding mechanism, given that *descriptor* exhaustion is what we
actually face.

That said, I will review these thoroughly in 14 hours (Sat morning my
time).  Perhaps I can convince myself that it's not a problem, because
it *is* simpler...

> List of patches and what they do:
> 
> I) With the first patchset, we change virtio ring notification
> hand-off to work like the one in Xen -
> each side publishes an event index, the other one
> notifies when it reaches that value -
> With the one difference that event index starts at 0,
> same as request index (in xen event index starts at 1).
> 
> These are the patches in this set:
> virtio: event index interface
> virtio ring: inline function to check for events
> virtio_ring: support event idx feature
> vhost: support event index
> virtio_test: support event index
> 
> Changes in this part of the patchset from v1 - address comments by Rusty et al.
> 
> I tested this a lot with virtio net block and with the simulator and esp
> with the simulator it's easy to see drastic performance improvement
> here:
> 
> [virtio]# time ./virtio_test 
> spurious wakeus: 0x7
> 
> real    0m0.169s
> user    0m0.140s
> sys     0m0.019s
> [virtio]# time ./virtio_test --no-event-idx
> spurious wakeus: 0x11
> 
> real    0m0.649s
> user    0m0.295s
> sys     0m0.335s
> 
> And these patches are mostly unchanged from the very first version,
> changes being almost exclusively code cleanups.  So I consider this part
> the most stable, I strongly think these patches should go into 2.6.40.
> One extra reason besides performance is that maintaining
> them out of tree is very painful as guest/host ABI is affected.
> 
> II) Second set of patches: new apis and use in virtio_net
> With the indexes in place it becomes possibile to request an event after
> many requests (and not just on the next one as done now). This shall fix
> the TX queue overrun which currently triggers a storm of interrupts.
> 
> Another issue I tried to fix is capacity checks in virtio-net,
> there's a new API for that, and on top of that,
> I implemented a patch improving real-time characteristics
> of virtio_net
> 
> Thus we get the second patchset:
> virtio: add api for delayed callbacks
> virtio_net: delay TX callbacks
> virtio_ring: Add capacity check API
> virtio_net: fix TX capacity checks using new API
> virtio_net: limit xmit polling
> 
> This has some fixes that I posted previously applied,
> but otherwise ideantical to v1. I tried to change API
> for enable_cb_delayed as Rusty suggested but failed to do this.
> I think it's not possible to define cleanly.
> 
> These work fine for me, I think they can be merged for 2.6.40
> too but would be nice to hear back from Shirley, Tom, Krishna.

See other mail.

> III) There's also a patch that adds a tweak to virtio ring
> virtio: don't delay avail index update
> 
> This seems to help small message sizes where we are constantly draining
> the RX VQ.

This is independent.  If someone shows some benchmark improvement I'm
definitely happy to put this in .40, if nothing else.

> I'll need to benchmark this to be able to give any numbers
> with confidence, but I don't see how it can hurt anything.
> Thoughts?
> 
> IV) Last part is a set of patches to extend feature bits
> to 64 bit. I tested this by using feature bit 32.
> vhost: fix 64 bit features
> virtio_test: update for 64 bit features
> virtio: 64 bit features

Sweetness, but .41 material at this stage.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 00/14] virtio and vhost-net performance enhancements
@ 2011-05-20  7:51   ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-20  7:51 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:10:07 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> OK, here is the large patchset that implements the virtio spec update
> that I sent earlier (the spec itself needs a minor update, will send
> that out too next week, but I think we are on the same page here
> already). It supercedes the PUBLISH_USED_IDX patches I sent
> out earlier.
> 
> What will follow will be a patchset that actually includes 4 sets of
> patches.  I note below their status.  Please consider for 2.6.40, at
> least partially. Rusty, do you think it's feasible?

Erk.  I'm still unsure that we should be using ring capacity as the
thresholding mechanism, given that *descriptor* exhaustion is what we
actually face.

That said, I will review these thoroughly in 14 hours (Sat morning my
time).  Perhaps I can convince myself that it's not a problem, because
it *is* simpler...

> List of patches and what they do:
> 
> I) With the first patchset, we change virtio ring notification
> hand-off to work like the one in Xen -
> each side publishes an event index, the other one
> notifies when it reaches that value -
> With the one difference that event index starts at 0,
> same as request index (in xen event index starts at 1).
> 
> These are the patches in this set:
> virtio: event index interface
> virtio ring: inline function to check for events
> virtio_ring: support event idx feature
> vhost: support event index
> virtio_test: support event index
> 
> Changes in this part of the patchset from v1 - address comments by Rusty et al.
> 
> I tested this a lot with virtio net block and with the simulator and esp
> with the simulator it's easy to see drastic performance improvement
> here:
> 
> [virtio]# time ./virtio_test 
> spurious wakeus: 0x7
> 
> real    0m0.169s
> user    0m0.140s
> sys     0m0.019s
> [virtio]# time ./virtio_test --no-event-idx
> spurious wakeus: 0x11
> 
> real    0m0.649s
> user    0m0.295s
> sys     0m0.335s
> 
> And these patches are mostly unchanged from the very first version,
> changes being almost exclusively code cleanups.  So I consider this part
> the most stable, I strongly think these patches should go into 2.6.40.
> One extra reason besides performance is that maintaining
> them out of tree is very painful as guest/host ABI is affected.
> 
> II) Second set of patches: new apis and use in virtio_net
> With the indexes in place it becomes possibile to request an event after
> many requests (and not just on the next one as done now). This shall fix
> the TX queue overrun which currently triggers a storm of interrupts.
> 
> Another issue I tried to fix is capacity checks in virtio-net,
> there's a new API for that, and on top of that,
> I implemented a patch improving real-time characteristics
> of virtio_net
> 
> Thus we get the second patchset:
> virtio: add api for delayed callbacks
> virtio_net: delay TX callbacks
> virtio_ring: Add capacity check API
> virtio_net: fix TX capacity checks using new API
> virtio_net: limit xmit polling
> 
> This has some fixes that I posted previously applied,
> but otherwise ideantical to v1. I tried to change API
> for enable_cb_delayed as Rusty suggested but failed to do this.
> I think it's not possible to define cleanly.
> 
> These work fine for me, I think they can be merged for 2.6.40
> too but would be nice to hear back from Shirley, Tom, Krishna.

See other mail.

> III) There's also a patch that adds a tweak to virtio ring
> virtio: don't delay avail index update
> 
> This seems to help small message sizes where we are constantly draining
> the RX VQ.

This is independent.  If someone shows some benchmark improvement I'm
definitely happy to put this in .40, if nothing else.

> I'll need to benchmark this to be able to give any numbers
> with confidence, but I don't see how it can hurt anything.
> Thoughts?
> 
> IV) Last part is a set of patches to extend feature bits
> to 64 bit. I tested this by using feature bit 32.
> vhost: fix 64 bit features
> virtio_test: update for 64 bit features
> virtio: 64 bit features

Sweetness, but .41 material at this stage.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 00/14] virtio and vhost-net performance enhancements
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (31 preceding siblings ...)
  (?)
@ 2011-05-20  7:51 ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-20  7:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:10:07 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> OK, here is the large patchset that implements the virtio spec update
> that I sent earlier (the spec itself needs a minor update, will send
> that out too next week, but I think we are on the same page here
> already). It supercedes the PUBLISH_USED_IDX patches I sent
> out earlier.
> 
> What will follow will be a patchset that actually includes 4 sets of
> patches.  I note below their status.  Please consider for 2.6.40, at
> least partially. Rusty, do you think it's feasible?

Erk.  I'm still unsure that we should be using ring capacity as the
thresholding mechanism, given that *descriptor* exhaustion is what we
actually face.

That said, I will review these thoroughly in 14 hours (Sat morning my
time).  Perhaps I can convince myself that it's not a problem, because
it *is* simpler...

> List of patches and what they do:
> 
> I) With the first patchset, we change virtio ring notification
> hand-off to work like the one in Xen -
> each side publishes an event index, the other one
> notifies when it reaches that value -
> With the one difference that event index starts at 0,
> same as request index (in xen event index starts at 1).
> 
> These are the patches in this set:
> virtio: event index interface
> virtio ring: inline function to check for events
> virtio_ring: support event idx feature
> vhost: support event index
> virtio_test: support event index
> 
> Changes in this part of the patchset from v1 - address comments by Rusty et al.
> 
> I tested this a lot with virtio net block and with the simulator and esp
> with the simulator it's easy to see drastic performance improvement
> here:
> 
> [virtio]# time ./virtio_test 
> spurious wakeus: 0x7
> 
> real    0m0.169s
> user    0m0.140s
> sys     0m0.019s
> [virtio]# time ./virtio_test --no-event-idx
> spurious wakeus: 0x11
> 
> real    0m0.649s
> user    0m0.295s
> sys     0m0.335s
> 
> And these patches are mostly unchanged from the very first version,
> changes being almost exclusively code cleanups.  So I consider this part
> the most stable, I strongly think these patches should go into 2.6.40.
> One extra reason besides performance is that maintaining
> them out of tree is very painful as guest/host ABI is affected.
> 
> II) Second set of patches: new apis and use in virtio_net
> With the indexes in place it becomes possibile to request an event after
> many requests (and not just on the next one as done now). This shall fix
> the TX queue overrun which currently triggers a storm of interrupts.
> 
> Another issue I tried to fix is capacity checks in virtio-net,
> there's a new API for that, and on top of that,
> I implemented a patch improving real-time characteristics
> of virtio_net
> 
> Thus we get the second patchset:
> virtio: add api for delayed callbacks
> virtio_net: delay TX callbacks
> virtio_ring: Add capacity check API
> virtio_net: fix TX capacity checks using new API
> virtio_net: limit xmit polling
> 
> This has some fixes that I posted previously applied,
> but otherwise ideantical to v1. I tried to change API
> for enable_cb_delayed as Rusty suggested but failed to do this.
> I think it's not possible to define cleanly.
> 
> These work fine for me, I think they can be merged for 2.6.40
> too but would be nice to hear back from Shirley, Tom, Krishna.

See other mail.

> III) There's also a patch that adds a tweak to virtio ring
> virtio: don't delay avail index update
> 
> This seems to help small message sizes where we are constantly draining
> the RX VQ.

This is independent.  If someone shows some benchmark improvement I'm
definitely happy to put this in .40, if nothing else.

> I'll need to benchmark this to be able to give any numbers
> with confidence, but I don't see how it can hurt anything.
> Thoughts?
> 
> IV) Last part is a set of patches to extend feature bits
> to 64 bit. I tested this by using feature bit 32.
> vhost: fix 64 bit features
> virtio_test: update for 64 bit features
> virtio: 64 bit features

Sweetness, but .41 material at this stage.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 09/14] virtio_net: fix TX capacity checks using new API
@ 2011-05-21  2:13     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:13 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:11:47 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> virtio net uses the number of sg entries to
> check for TX ring capacity freed. But this
> gives incorrect results when indirect buffers
> are used. Use the new capacity API instead.

OK, but this explanation needs enhancement, such as noting the actual
results of that miscalculation.  Something like:

 virtio_net uses the number of sg entries in the skb it frees to
 calculate how many descriptors in the ring have just been made
 available.  But this value is an overestimate: with indirect buffers
 each skb only uses one descriptor entry, meaning we may wake the queue
 only to find we still can't transmit anything.

 Using the new virtqueue_get_capacity() call, we can exactly determine
 the remaining capacity, so we should use that instead.

But, here's the side effect:
>  			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> +			free_old_xmit_skbs(vi);
> +			capacity = virtqueue_get_capacity(vi->svq);
>  			if (capacity >= 2+MAX_SKB_FRAGS) {

That capacity >= 2+MAX_SKB_FRAGS is too much for indirect buffers.  This
means we waste 20 entries in the ring, but OTOH if we hit OOM we fall
back to direct buffers and we *will* need this.

Which means this comment in the driver is now wrong:

	/* This can happen with OOM and indirect buffers. */
	if (unlikely(capacity < 0)) {
		if (net_ratelimit()) {
			if (likely(capacity == -ENOMEM)) {
				dev_warn(&dev->dev,
					 "TX queue failure: out of memory\n");
			} else {
				dev->stats.tx_fifo_errors++;
				dev_warn(&dev->dev,
					 "Unexpected TX queue failure: %d\n",
					 capacity);
			}
		}
		dev->stats.tx_dropped++;
		kfree_skb(skb);
		return NETDEV_TX_OK;
	}
	virtqueue_kick(vi->svq);

So I'm not applying this patch (nor the virtqueue_get_capacity
predeccessor) for the moment.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 09/14] virtio_net: fix TX capacity checks using new API
@ 2011-05-21  2:13     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:13 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:11:47 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> virtio net uses the number of sg entries to
> check for TX ring capacity freed. But this
> gives incorrect results when indirect buffers
> are used. Use the new capacity API instead.

OK, but this explanation needs enhancement, such as noting the actual
results of that miscalculation.  Something like:

 virtio_net uses the number of sg entries in the skb it frees to
 calculate how many descriptors in the ring have just been made
 available.  But this value is an overestimate: with indirect buffers
 each skb only uses one descriptor entry, meaning we may wake the queue
 only to find we still can't transmit anything.

 Using the new virtqueue_get_capacity() call, we can exactly determine
 the remaining capacity, so we should use that instead.

But, here's the side effect:
>  			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> +			free_old_xmit_skbs(vi);
> +			capacity = virtqueue_get_capacity(vi->svq);
>  			if (capacity >= 2+MAX_SKB_FRAGS) {

That capacity >= 2+MAX_SKB_FRAGS is too much for indirect buffers.  This
means we waste 20 entries in the ring, but OTOH if we hit OOM we fall
back to direct buffers and we *will* need this.

Which means this comment in the driver is now wrong:

	/* This can happen with OOM and indirect buffers. */
	if (unlikely(capacity < 0)) {
		if (net_ratelimit()) {
			if (likely(capacity == -ENOMEM)) {
				dev_warn(&dev->dev,
					 "TX queue failure: out of memory\n");
			} else {
				dev->stats.tx_fifo_errors++;
				dev_warn(&dev->dev,
					 "Unexpected TX queue failure: %d\n",
					 capacity);
			}
		}
		dev->stats.tx_dropped++;
		kfree_skb(skb);
		return NETDEV_TX_OK;
	}
	virtqueue_kick(vi->svq);

So I'm not applying this patch (nor the virtqueue_get_capacity
predeccessor) for the moment.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 09/14] virtio_net: fix TX capacity checks using new API
  2011-05-19 23:11   ` Michael S. Tsirkin
  (?)
@ 2011-05-21  2:13   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:11:47 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> virtio net uses the number of sg entries to
> check for TX ring capacity freed. But this
> gives incorrect results when indirect buffers
> are used. Use the new capacity API instead.

OK, but this explanation needs enhancement, such as noting the actual
results of that miscalculation.  Something like:

 virtio_net uses the number of sg entries in the skb it frees to
 calculate how many descriptors in the ring have just been made
 available.  But this value is an overestimate: with indirect buffers
 each skb only uses one descriptor entry, meaning we may wake the queue
 only to find we still can't transmit anything.

 Using the new virtqueue_get_capacity() call, we can exactly determine
 the remaining capacity, so we should use that instead.

But, here's the side effect:
>  			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> +			free_old_xmit_skbs(vi);
> +			capacity = virtqueue_get_capacity(vi->svq);
>  			if (capacity >= 2+MAX_SKB_FRAGS) {

That capacity >= 2+MAX_SKB_FRAGS is too much for indirect buffers.  This
means we waste 20 entries in the ring, but OTOH if we hit OOM we fall
back to direct buffers and we *will* need this.

Which means this comment in the driver is now wrong:

	/* This can happen with OOM and indirect buffers. */
	if (unlikely(capacity < 0)) {
		if (net_ratelimit()) {
			if (likely(capacity == -ENOMEM)) {
				dev_warn(&dev->dev,
					 "TX queue failure: out of memory\n");
			} else {
				dev->stats.tx_fifo_errors++;
				dev_warn(&dev->dev,
					 "Unexpected TX queue failure: %d\n",
					 capacity);
			}
		}
		dev->stats.tx_dropped++;
		kfree_skb(skb);
		return NETDEV_TX_OK;
	}
	virtqueue_kick(vi->svq);

So I'm not applying this patch (nor the virtqueue_get_capacity
predeccessor) for the moment.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-21  2:19     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:19 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Current code might introduce a lot of latency variation
> if there are many pending bufs at the time we
> attempt to transmit a new one. This is bad for
> real-time applications and can't be good for TCP either.

Do we have more than speculation to back that up, BTW?

This patch is pretty sloppy; the previous ones were better polished.

> -static void free_old_xmit_skbs(struct virtnet_info *vi)
> +static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
>  {

A comment here indicating it returns true if it frees something?

>  	struct sk_buff *skb;
>  	unsigned int len;
> -
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	bool c;
> +	int n;
> +
> +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> +	 * all of the memory back if they are used fast enough. */
> +	for (n = 0;
> +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> +	     ++n) {
>  		pr_debug("Sent skb %p\n", skb);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
>  		dev_kfree_skb_any(skb);
>  	}
> +	return !c;

This is for() abuse :)

Why is the capacity check in there at all?  Surely it's simpler to try
to free 2 skbs each time around?

   for (n = 0; n < 2; n++) {
        skb = virtqueue_get_buf(vi->svq, &len);
        if (!skb)
                break;
	pr_debug("Sent skb %p\n", skb);
	vi->dev->stats.tx_bytes += skb->len;
	vi->dev->stats.tx_packets++;
	dev_kfree_skb_any(skb);
   }

>  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> @@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
>  
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> +	/* Free enough pending old buffers to enable queueing new ones. */
> +	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
>  
>  	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
> @@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		netif_stop_queue(dev);
>  		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
>  			/* More just got used, free them then recheck. */
> -			free_old_xmit_skbs(vi);
> -			capacity = virtqueue_get_capacity(vi->svq);
> -			if (capacity >= 2+MAX_SKB_FRAGS) {
> +			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {

This extra argument to free_old_xmit_skbs seems odd, unless you have
future plans?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-21  2:19     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:19 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Current code might introduce a lot of latency variation
> if there are many pending bufs at the time we
> attempt to transmit a new one. This is bad for
> real-time applications and can't be good for TCP either.

Do we have more than speculation to back that up, BTW?

This patch is pretty sloppy; the previous ones were better polished.

> -static void free_old_xmit_skbs(struct virtnet_info *vi)
> +static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
>  {

A comment here indicating it returns true if it frees something?

>  	struct sk_buff *skb;
>  	unsigned int len;
> -
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	bool c;
> +	int n;
> +
> +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> +	 * all of the memory back if they are used fast enough. */
> +	for (n = 0;
> +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> +	     ++n) {
>  		pr_debug("Sent skb %p\n", skb);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
>  		dev_kfree_skb_any(skb);
>  	}
> +	return !c;

This is for() abuse :)

Why is the capacity check in there at all?  Surely it's simpler to try
to free 2 skbs each time around?

   for (n = 0; n < 2; n++) {
        skb = virtqueue_get_buf(vi->svq, &len);
        if (!skb)
                break;
	pr_debug("Sent skb %p\n", skb);
	vi->dev->stats.tx_bytes += skb->len;
	vi->dev->stats.tx_packets++;
	dev_kfree_skb_any(skb);
   }

>  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> @@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
>  
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> +	/* Free enough pending old buffers to enable queueing new ones. */
> +	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
>  
>  	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
> @@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		netif_stop_queue(dev);
>  		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
>  			/* More just got used, free them then recheck. */
> -			free_old_xmit_skbs(vi);
> -			capacity = virtqueue_get_capacity(vi->svq);
> -			if (capacity >= 2+MAX_SKB_FRAGS) {
> +			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {

This extra argument to free_old_xmit_skbs seems odd, unless you have
future plans?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-19 23:11   ` Michael S. Tsirkin
  (?)
@ 2011-05-21  2:19   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Current code might introduce a lot of latency variation
> if there are many pending bufs at the time we
> attempt to transmit a new one. This is bad for
> real-time applications and can't be good for TCP either.

Do we have more than speculation to back that up, BTW?

This patch is pretty sloppy; the previous ones were better polished.

> -static void free_old_xmit_skbs(struct virtnet_info *vi)
> +static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
>  {

A comment here indicating it returns true if it frees something?

>  	struct sk_buff *skb;
>  	unsigned int len;
> -
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	bool c;
> +	int n;
> +
> +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> +	 * all of the memory back if they are used fast enough. */
> +	for (n = 0;
> +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> +	     ++n) {
>  		pr_debug("Sent skb %p\n", skb);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
>  		dev_kfree_skb_any(skb);
>  	}
> +	return !c;

This is for() abuse :)

Why is the capacity check in there at all?  Surely it's simpler to try
to free 2 skbs each time around?

   for (n = 0; n < 2; n++) {
        skb = virtqueue_get_buf(vi->svq, &len);
        if (!skb)
                break;
	pr_debug("Sent skb %p\n", skb);
	vi->dev->stats.tx_bytes += skb->len;
	vi->dev->stats.tx_packets++;
	dev_kfree_skb_any(skb);
   }

>  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> @@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
>  
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> +	/* Free enough pending old buffers to enable queueing new ones. */
> +	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
>  
>  	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
> @@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		netif_stop_queue(dev);
>  		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
>  			/* More just got used, free them then recheck. */
> -			free_old_xmit_skbs(vi);
> -			capacity = virtqueue_get_capacity(vi->svq);
> -			if (capacity >= 2+MAX_SKB_FRAGS) {
> +			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {

This extra argument to free_old_xmit_skbs seems odd, unless you have
future plans?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 11/14] virtio: don't delay avail index update
@ 2011-05-21  2:26     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:26 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:12:19 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Update avail index immediately instead of upon kick:
> for virtio-net RX this helps parallelism with the host.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_ring.c |   28 +++++++++++++++++++---------
>  1 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index eed5f29..8218fe6 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -89,7 +89,7 @@ struct vring_virtqueue
>  	unsigned int num_free;
>  	/* Head of free buffer list. */
>  	unsigned int free_head;
> -	/* Number we've added since last sync. */
> +	/* Number we've added since last kick. */
>  	unsigned int num_added;

I always like to see obsolescent nomenclature cleaned up like this.
Thanks.

>  	/* Last used index we've seen. */
> @@ -174,6 +174,13 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
>  
>  	BUG_ON(data == NULL);
>  
> +	/* Prevent drivers from adding more than num bufs without a kick. */
> +	if (vq->num_added == vq->vring.num) {
> +		printk(KERN_ERR "gaaa!!!\n");
> +		END_USE(vq);
> +		return -ENOSPC;
> +	}
> +

I like "gaaa!" but it won't tell us which driver.  How about the more
conventional:

        if (WARN_ON(vq->num_added >= vq->vring.num)) {
        	END_USE(vq);
        	return -ENOSPC;
        }

I'd really like to see the results of this patch.  It's useless for
outgoing net traffic (we deal with one packet at a time) but perhaps a
flood of incoming packets would show something.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 11/14] virtio: don't delay avail index update
@ 2011-05-21  2:26     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:26 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:12:19 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Update avail index immediately instead of upon kick:
> for virtio-net RX this helps parallelism with the host.
> 
> Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/virtio/virtio_ring.c |   28 +++++++++++++++++++---------
>  1 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index eed5f29..8218fe6 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -89,7 +89,7 @@ struct vring_virtqueue
>  	unsigned int num_free;
>  	/* Head of free buffer list. */
>  	unsigned int free_head;
> -	/* Number we've added since last sync. */
> +	/* Number we've added since last kick. */
>  	unsigned int num_added;

I always like to see obsolescent nomenclature cleaned up like this.
Thanks.

>  	/* Last used index we've seen. */
> @@ -174,6 +174,13 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
>  
>  	BUG_ON(data == NULL);
>  
> +	/* Prevent drivers from adding more than num bufs without a kick. */
> +	if (vq->num_added == vq->vring.num) {
> +		printk(KERN_ERR "gaaa!!!\n");
> +		END_USE(vq);
> +		return -ENOSPC;
> +	}
> +

I like "gaaa!" but it won't tell us which driver.  How about the more
conventional:

        if (WARN_ON(vq->num_added >= vq->vring.num)) {
        	END_USE(vq);
        	return -ENOSPC;
        }

I'd really like to see the results of this patch.  It's useless for
outgoing net traffic (we deal with one packet at a time) but perhaps a
flood of incoming packets would show something.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 11/14] virtio: don't delay avail index update
  2011-05-19 23:12   ` Michael S. Tsirkin
  (?)
@ 2011-05-21  2:26   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:12:19 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Update avail index immediately instead of upon kick:
> for virtio-net RX this helps parallelism with the host.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_ring.c |   28 +++++++++++++++++++---------
>  1 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index eed5f29..8218fe6 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -89,7 +89,7 @@ struct vring_virtqueue
>  	unsigned int num_free;
>  	/* Head of free buffer list. */
>  	unsigned int free_head;
> -	/* Number we've added since last sync. */
> +	/* Number we've added since last kick. */
>  	unsigned int num_added;

I always like to see obsolescent nomenclature cleaned up like this.
Thanks.

>  	/* Last used index we've seen. */
> @@ -174,6 +174,13 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
>  
>  	BUG_ON(data == NULL);
>  
> +	/* Prevent drivers from adding more than num bufs without a kick. */
> +	if (vq->num_added == vq->vring.num) {
> +		printk(KERN_ERR "gaaa!!!\n");
> +		END_USE(vq);
> +		return -ENOSPC;
> +	}
> +

I like "gaaa!" but it won't tell us which driver.  How about the more
conventional:

        if (WARN_ON(vq->num_added >= vq->vring.num)) {
        	END_USE(vq);
        	return -ENOSPC;
        }

I'd really like to see the results of this patch.  It's useless for
outgoing net traffic (we deal with one packet at a time) but perhaps a
flood of incoming packets would show something.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 01/14] virtio: event index interface
@ 2011-05-21  2:29     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:10:17 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Define a new feature bit for the guest and host to utilize
> an event index (like Xen) instead if a flag bit to enable/disable
> interrupts and kicks.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 01/14] virtio: event index interface
@ 2011-05-21  2:29     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:10:17 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Define a new feature bit for the guest and host to utilize
> an event index (like Xen) instead if a flag bit to enable/disable
> interrupts and kicks.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 01/14] virtio: event index interface
  2011-05-19 23:10   ` Michael S. Tsirkin
  (?)
@ 2011-05-21  2:29   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:10:17 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Define a new feature bit for the guest and host to utilize
> an event index (like Xen) instead if a flag bit to enable/disable
> interrupts and kicks.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 02/14] virtio ring: inline function to check for events
@ 2011-05-21  2:29     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:10:27 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> With the new used_event and avail_event and features, both
> host and guest need similar logic to check whether events are
> enabled, so it helps to put the common code in the header.
> 
> Note that Xen has similar logic for notification hold-off
> in include/xen/interface/io/ring.h with req_event and req_prod
> corresponding to event_idx + 1 and new_idx respectively.
> +1 comes from the fact that req_event and req_prod in Xen start at 1,
> while event index in virtio starts at 0.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 02/14] virtio ring: inline function to check for events
@ 2011-05-21  2:29     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:10:27 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> With the new used_event and avail_event and features, both
> host and guest need similar logic to check whether events are
> enabled, so it helps to put the common code in the header.
> 
> Note that Xen has similar logic for notification hold-off
> in include/xen/interface/io/ring.h with req_event and req_prod
> corresponding to event_idx + 1 and new_idx respectively.
> +1 comes from the fact that req_event and req_prod in Xen start at 1,
> while event index in virtio starts at 0.
> 
> Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 02/14] virtio ring: inline function to check for events
  2011-05-19 23:10   ` Michael S. Tsirkin
  (?)
  (?)
@ 2011-05-21  2:29   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:10:27 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> With the new used_event and avail_event and features, both
> host and guest need similar logic to check whether events are
> enabled, so it helps to put the common code in the header.
> 
> Note that Xen has similar logic for notification hold-off
> in include/xen/interface/io/ring.h with req_event and req_prod
> corresponding to event_idx + 1 and new_idx respectively.
> +1 comes from the fact that req_event and req_prod in Xen start at 1,
> while event index in virtio starts at 0.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 03/14] virtio_ring: support event idx feature
@ 2011-05-21  2:31     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:10:44 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Support for the new event idx feature:
> 1. When enabling interrupts, publish the current avail index
>    value to the host to get interrupts on the next update.
> 2. Use the new avail_event feature to reduce the number
>    of exits from the guest.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 03/14] virtio_ring: support event idx feature
@ 2011-05-21  2:31     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:10:44 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Support for the new event idx feature:
> 1. When enabling interrupts, publish the current avail index
>    value to the host to get interrupts on the next update.
> 2. Use the new avail_event feature to reduce the number
>    of exits from the guest.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 03/14] virtio_ring: support event idx feature
  2011-05-19 23:10   ` Michael S. Tsirkin
  (?)
@ 2011-05-21  2:31   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:10:44 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Support for the new event idx feature:
> 1. When enabling interrupts, publish the current avail index
>    value to the host to get interrupts on the next update.
> 2. Use the new avail_event feature to reduce the number
>    of exits from the guest.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 04/14] vhost: support event index
@ 2011-05-21  2:31     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:10:54 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Support the new event index feature. When acked,
> utilize it to reduce the # of interrupts sent to the guest.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Applied, even though it'd normally be in your tree, it's easier for me
to push all together.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 04/14] vhost: support event index
@ 2011-05-21  2:31     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:10:54 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Support the new event index feature. When acked,
> utilize it to reduce the # of interrupts sent to the guest.
> 
> Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Applied, even though it'd normally be in your tree, it's easier for me
to push all together.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 04/14] vhost: support event index
  2011-05-19 23:10   ` Michael S. Tsirkin
  (?)
@ 2011-05-21  2:31   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:10:54 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Support the new event index feature. When acked,
> utilize it to reduce the # of interrupts sent to the guest.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Applied, even though it'd normally be in your tree, it's easier for me
to push all together.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 05/14] virtio_test: support event index
@ 2011-05-21  2:32     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:32 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:11:05 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Add ability to test the new event idx feature,
> enable by default.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 05/14] virtio_test: support event index
@ 2011-05-21  2:32     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:32 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:11:05 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Add ability to test the new event idx feature,
> enable by default.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 05/14] virtio_test: support event index
  2011-05-19 23:11   ` Michael S. Tsirkin
  (?)
@ 2011-05-21  2:32   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:11:05 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Add ability to test the new event idx feature,
> enable by default.

Applied.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 06/14] virtio: add api for delayed callbacks
@ 2011-05-21  2:33     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:33 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

On Fri, 20 May 2011 02:11:14 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Add an API that tells the other side that callbacks
> should be delayed until a lot of work has been done.
> Implement using the new event_idx feature.
> 
> Note: it might seem advantageous to let the drivers
> ask for a callback after a specific capacity has
> been reached. However, as a single head can
> free many entries in the descriptor table,
> we don't really have a clue about capacity
> until get_buf is called. The API is the simplest
> to implement at the moment, we'll see what kind of
> hints drivers can pass when there's more than one
> user of the feature.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Yes, I've applied this (and the next one which uses it in virtio_net),
despite my reservations about the API.  But that is fixable...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 06/14] virtio: add api for delayed callbacks
@ 2011-05-21  2:33     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:33 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Fri, 20 May 2011 02:11:14 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Add an API that tells the other side that callbacks
> should be delayed until a lot of work has been done.
> Implement using the new event_idx feature.
> 
> Note: it might seem advantageous to let the drivers
> ask for a callback after a specific capacity has
> been reached. However, as a single head can
> free many entries in the descriptor table,
> we don't really have a clue about capacity
> until get_buf is called. The API is the simplest
> to implement at the moment, we'll see what kind of
> hints drivers can pass when there's more than one
> user of the feature.
> 
> Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Yes, I've applied this (and the next one which uses it in virtio_net),
despite my reservations about the API.  But that is fixable...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 06/14] virtio: add api for delayed callbacks
  2011-05-19 23:11   ` Michael S. Tsirkin
  (?)
@ 2011-05-21  2:33   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-21  2:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Fri, 20 May 2011 02:11:14 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Add an API that tells the other side that callbacks
> should be delayed until a lot of work has been done.
> Implement using the new event_idx feature.
> 
> Note: it might seem advantageous to let the drivers
> ask for a callback after a specific capacity has
> been reached. However, as a single head can
> free many entries in the descriptor table,
> we don't really have a clue about capacity
> until get_buf is called. The API is the simplest
> to implement at the moment, we'll see what kind of
> hints drivers can pass when there's more than one
> user of the feature.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Yes, I've applied this (and the next one which uses it in virtio_net),
despite my reservations about the API.  But that is fixable...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-22 12:10       ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-22 12:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > Current code might introduce a lot of latency variation
> > if there are many pending bufs at the time we
> > attempt to transmit a new one. This is bad for
> > real-time applications and can't be good for TCP either.
> 
> Do we have more than speculation to back that up, BTW?

Need to dig this up: I thought we saw some reports of this on the list?

> This patch is pretty sloppy; the previous ones were better polished.
> 
> > -static void free_old_xmit_skbs(struct virtnet_info *vi)
> > +static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
> >  {
> 
> A comment here indicating it returns true if it frees something?

Agree.

> >  	struct sk_buff *skb;
> >  	unsigned int len;
> > -
> > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > +	bool c;
> > +	int n;
> > +
> > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > +	 * all of the memory back if they are used fast enough. */
> > +	for (n = 0;
> > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > +	     ++n) {
> >  		pr_debug("Sent skb %p\n", skb);
> >  		vi->dev->stats.tx_bytes += skb->len;
> >  		vi->dev->stats.tx_packets++;
> >  		dev_kfree_skb_any(skb);
> >  	}
> > +	return !c;
> 
> This is for() abuse :)
> 
> Why is the capacity check in there at all?  Surely it's simpler to try
> to free 2 skbs each time around?

This is in case we can't use indirect: we want to free up
enough buffers for the following add_buf to succeed.


>    for (n = 0; n < 2; n++) {
>         skb = virtqueue_get_buf(vi->svq, &len);
>         if (!skb)
>                 break;
> 	pr_debug("Sent skb %p\n", skb);
> 	vi->dev->stats.tx_bytes += skb->len;
> 	vi->dev->stats.tx_packets++;
> 	dev_kfree_skb_any(skb);
>    }
> 
> >  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> > @@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  	struct virtnet_info *vi = netdev_priv(dev);
> >  	int capacity;
> >  
> > -	/* Free up any pending old buffers before queueing new ones. */
> > -	free_old_xmit_skbs(vi);
> > +	/* Free enough pending old buffers to enable queueing new ones. */
> > +	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
> >  
> >  	/* Try to transmit */
> >  	capacity = xmit_skb(vi, skb);
> > @@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  		netif_stop_queue(dev);
> >  		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
> >  			/* More just got used, free them then recheck. */
> > -			free_old_xmit_skbs(vi);
> > -			capacity = virtqueue_get_capacity(vi->svq);
> > -			if (capacity >= 2+MAX_SKB_FRAGS) {
> > +			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> 
> This extra argument to free_old_xmit_skbs seems odd, unless you have
> future plans?
> 
> Thanks,
> Rusty.

I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
sure we have enough space in the buffer. Another way to do
that is with a define :).

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-22 12:10       ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-22 12:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Current code might introduce a lot of latency variation
> > if there are many pending bufs at the time we
> > attempt to transmit a new one. This is bad for
> > real-time applications and can't be good for TCP either.
> 
> Do we have more than speculation to back that up, BTW?

Need to dig this up: I thought we saw some reports of this on the list?

> This patch is pretty sloppy; the previous ones were better polished.
> 
> > -static void free_old_xmit_skbs(struct virtnet_info *vi)
> > +static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
> >  {
> 
> A comment here indicating it returns true if it frees something?

Agree.

> >  	struct sk_buff *skb;
> >  	unsigned int len;
> > -
> > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > +	bool c;
> > +	int n;
> > +
> > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > +	 * all of the memory back if they are used fast enough. */
> > +	for (n = 0;
> > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > +	     ++n) {
> >  		pr_debug("Sent skb %p\n", skb);
> >  		vi->dev->stats.tx_bytes += skb->len;
> >  		vi->dev->stats.tx_packets++;
> >  		dev_kfree_skb_any(skb);
> >  	}
> > +	return !c;
> 
> This is for() abuse :)
> 
> Why is the capacity check in there at all?  Surely it's simpler to try
> to free 2 skbs each time around?

This is in case we can't use indirect: we want to free up
enough buffers for the following add_buf to succeed.


>    for (n = 0; n < 2; n++) {
>         skb = virtqueue_get_buf(vi->svq, &len);
>         if (!skb)
>                 break;
> 	pr_debug("Sent skb %p\n", skb);
> 	vi->dev->stats.tx_bytes += skb->len;
> 	vi->dev->stats.tx_packets++;
> 	dev_kfree_skb_any(skb);
>    }
> 
> >  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> > @@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  	struct virtnet_info *vi = netdev_priv(dev);
> >  	int capacity;
> >  
> > -	/* Free up any pending old buffers before queueing new ones. */
> > -	free_old_xmit_skbs(vi);
> > +	/* Free enough pending old buffers to enable queueing new ones. */
> > +	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
> >  
> >  	/* Try to transmit */
> >  	capacity = xmit_skb(vi, skb);
> > @@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  		netif_stop_queue(dev);
> >  		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
> >  			/* More just got used, free them then recheck. */
> > -			free_old_xmit_skbs(vi);
> > -			capacity = virtqueue_get_capacity(vi->svq);
> > -			if (capacity >= 2+MAX_SKB_FRAGS) {
> > +			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> 
> This extra argument to free_old_xmit_skbs seems odd, unless you have
> future plans?
> 
> Thanks,
> Rusty.

I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
sure we have enough space in the buffer. Another way to do
that is with a define :).

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-21  2:19     ` Rusty Russell
  (?)
  (?)
@ 2011-05-22 12:10     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-22 12:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > Current code might introduce a lot of latency variation
> > if there are many pending bufs at the time we
> > attempt to transmit a new one. This is bad for
> > real-time applications and can't be good for TCP either.
> 
> Do we have more than speculation to back that up, BTW?

Need to dig this up: I thought we saw some reports of this on the list?

> This patch is pretty sloppy; the previous ones were better polished.
> 
> > -static void free_old_xmit_skbs(struct virtnet_info *vi)
> > +static bool free_old_xmit_skbs(struct virtnet_info *vi, int capacity)
> >  {
> 
> A comment here indicating it returns true if it frees something?

Agree.

> >  	struct sk_buff *skb;
> >  	unsigned int len;
> > -
> > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > +	bool c;
> > +	int n;
> > +
> > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > +	 * all of the memory back if they are used fast enough. */
> > +	for (n = 0;
> > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > +	     ++n) {
> >  		pr_debug("Sent skb %p\n", skb);
> >  		vi->dev->stats.tx_bytes += skb->len;
> >  		vi->dev->stats.tx_packets++;
> >  		dev_kfree_skb_any(skb);
> >  	}
> > +	return !c;
> 
> This is for() abuse :)
> 
> Why is the capacity check in there at all?  Surely it's simpler to try
> to free 2 skbs each time around?

This is in case we can't use indirect: we want to free up
enough buffers for the following add_buf to succeed.


>    for (n = 0; n < 2; n++) {
>         skb = virtqueue_get_buf(vi->svq, &len);
>         if (!skb)
>                 break;
> 	pr_debug("Sent skb %p\n", skb);
> 	vi->dev->stats.tx_bytes += skb->len;
> 	vi->dev->stats.tx_packets++;
> 	dev_kfree_skb_any(skb);
>    }
> 
> >  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> > @@ -574,8 +582,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  	struct virtnet_info *vi = netdev_priv(dev);
> >  	int capacity;
> >  
> > -	/* Free up any pending old buffers before queueing new ones. */
> > -	free_old_xmit_skbs(vi);
> > +	/* Free enough pending old buffers to enable queueing new ones. */
> > +	free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS);
> >  
> >  	/* Try to transmit */
> >  	capacity = xmit_skb(vi, skb);
> > @@ -609,9 +617,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  		netif_stop_queue(dev);
> >  		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
> >  			/* More just got used, free them then recheck. */
> > -			free_old_xmit_skbs(vi);
> > -			capacity = virtqueue_get_capacity(vi->svq);
> > -			if (capacity >= 2+MAX_SKB_FRAGS) {
> > +			if (!likely(free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> 
> This extra argument to free_old_xmit_skbs seems odd, unless you have
> future plans?
> 
> Thanks,
> Rusty.

I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
sure we have enough space in the buffer. Another way to do
that is with a define :).

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-23  2:07         ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-23  2:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Sun, 22 May 2011 15:10:08 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> > On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > Current code might introduce a lot of latency variation
> > > if there are many pending bufs at the time we
> > > attempt to transmit a new one. This is bad for
> > > real-time applications and can't be good for TCP either.
> > 
> > Do we have more than speculation to back that up, BTW?
> 
> Need to dig this up: I thought we saw some reports of this on the list?

I think so too, but a reference needs to be here too.

It helps to have exact benchmarks on what's being tested, otherwise we
risk unexpected interaction with the other optimization patches.

> > >  	struct sk_buff *skb;
> > >  	unsigned int len;
> > > -
> > > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > > +	bool c;
> > > +	int n;
> > > +
> > > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > > +	 * all of the memory back if they are used fast enough. */
> > > +	for (n = 0;
> > > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > > +	     ++n) {
> > >  		pr_debug("Sent skb %p\n", skb);
> > >  		vi->dev->stats.tx_bytes += skb->len;
> > >  		vi->dev->stats.tx_packets++;
> > >  		dev_kfree_skb_any(skb);
> > >  	}
> > > +	return !c;
> > 
> > This is for() abuse :)
> > 
> > Why is the capacity check in there at all?  Surely it's simpler to try
> > to free 2 skbs each time around?
> 
> This is in case we can't use indirect: we want to free up
> enough buffers for the following add_buf to succeed.

Sure, or we could just count the frags of the skb we're taking out,
which would be accurate for both cases and far more intuitive.

ie. always try to free up twice as much as we're about to put in.

Can we hit problems with OOM?  Sure, but no worse than now...

The problem is that this "virtqueue_get_capacity()" returns the worst
case, not the normal case.  So using it is deceptive.

> I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> sure we have enough space in the buffer. Another way to do
> that is with a define :).

To do this properly, we should really be using the actual number of sg
elements needed, but we'd have to do most of xmit_skb beforehand so we
know how many.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-23  2:07         ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-23  2:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Sun, 22 May 2011 15:10:08 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> > On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > Current code might introduce a lot of latency variation
> > > if there are many pending bufs at the time we
> > > attempt to transmit a new one. This is bad for
> > > real-time applications and can't be good for TCP either.
> > 
> > Do we have more than speculation to back that up, BTW?
> 
> Need to dig this up: I thought we saw some reports of this on the list?

I think so too, but a reference needs to be here too.

It helps to have exact benchmarks on what's being tested, otherwise we
risk unexpected interaction with the other optimization patches.

> > >  	struct sk_buff *skb;
> > >  	unsigned int len;
> > > -
> > > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > > +	bool c;
> > > +	int n;
> > > +
> > > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > > +	 * all of the memory back if they are used fast enough. */
> > > +	for (n = 0;
> > > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > > +	     ++n) {
> > >  		pr_debug("Sent skb %p\n", skb);
> > >  		vi->dev->stats.tx_bytes += skb->len;
> > >  		vi->dev->stats.tx_packets++;
> > >  		dev_kfree_skb_any(skb);
> > >  	}
> > > +	return !c;
> > 
> > This is for() abuse :)
> > 
> > Why is the capacity check in there at all?  Surely it's simpler to try
> > to free 2 skbs each time around?
> 
> This is in case we can't use indirect: we want to free up
> enough buffers for the following add_buf to succeed.

Sure, or we could just count the frags of the skb we're taking out,
which would be accurate for both cases and far more intuitive.

ie. always try to free up twice as much as we're about to put in.

Can we hit problems with OOM?  Sure, but no worse than now...

The problem is that this "virtqueue_get_capacity()" returns the worst
case, not the normal case.  So using it is deceptive.

> I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> sure we have enough space in the buffer. Another way to do
> that is with a define :).

To do this properly, we should really be using the actual number of sg
elements needed, but we'd have to do most of xmit_skb beforehand so we
know how many.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-22 12:10       ` Michael S. Tsirkin
  (?)
  (?)
@ 2011-05-23  2:07       ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-23  2:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Sun, 22 May 2011 15:10:08 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> > On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > Current code might introduce a lot of latency variation
> > > if there are many pending bufs at the time we
> > > attempt to transmit a new one. This is bad for
> > > real-time applications and can't be good for TCP either.
> > 
> > Do we have more than speculation to back that up, BTW?
> 
> Need to dig this up: I thought we saw some reports of this on the list?

I think so too, but a reference needs to be here too.

It helps to have exact benchmarks on what's being tested, otherwise we
risk unexpected interaction with the other optimization patches.

> > >  	struct sk_buff *skb;
> > >  	unsigned int len;
> > > -
> > > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > > +	bool c;
> > > +	int n;
> > > +
> > > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > > +	 * all of the memory back if they are used fast enough. */
> > > +	for (n = 0;
> > > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > > +	     ++n) {
> > >  		pr_debug("Sent skb %p\n", skb);
> > >  		vi->dev->stats.tx_bytes += skb->len;
> > >  		vi->dev->stats.tx_packets++;
> > >  		dev_kfree_skb_any(skb);
> > >  	}
> > > +	return !c;
> > 
> > This is for() abuse :)
> > 
> > Why is the capacity check in there at all?  Surely it's simpler to try
> > to free 2 skbs each time around?
> 
> This is in case we can't use indirect: we want to free up
> enough buffers for the following add_buf to succeed.

Sure, or we could just count the frags of the skb we're taking out,
which would be accurate for both cases and far more intuitive.

ie. always try to free up twice as much as we're about to put in.

Can we hit problems with OOM?  Sure, but no worse than now...

The problem is that this "virtqueue_get_capacity()" returns the worst
case, not the normal case.  So using it is deceptive.

> I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> sure we have enough space in the buffer. Another way to do
> that is with a define :).

To do this properly, we should really be using the actual number of sg
elements needed, but we'd have to do most of xmit_skb beforehand so we
know how many.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-23 11:19           ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-23 11:19 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> On Sun, 22 May 2011 15:10:08 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> > > On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > Current code might introduce a lot of latency variation
> > > > if there are many pending bufs at the time we
> > > > attempt to transmit a new one. This is bad for
> > > > real-time applications and can't be good for TCP either.
> > > 
> > > Do we have more than speculation to back that up, BTW?
> > 
> > Need to dig this up: I thought we saw some reports of this on the list?
> 
> I think so too, but a reference needs to be here too.
> 
> It helps to have exact benchmarks on what's being tested, otherwise we
> risk unexpected interaction with the other optimization patches.
> 
> > > >  	struct sk_buff *skb;
> > > >  	unsigned int len;
> > > > -
> > > > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > > > +	bool c;
> > > > +	int n;
> > > > +
> > > > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > > > +	 * all of the memory back if they are used fast enough. */
> > > > +	for (n = 0;
> > > > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > > > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > > > +	     ++n) {
> > > >  		pr_debug("Sent skb %p\n", skb);
> > > >  		vi->dev->stats.tx_bytes += skb->len;
> > > >  		vi->dev->stats.tx_packets++;
> > > >  		dev_kfree_skb_any(skb);
> > > >  	}
> > > > +	return !c;
> > > 
> > > This is for() abuse :)
> > > 
> > > Why is the capacity check in there at all?  Surely it's simpler to try
> > > to free 2 skbs each time around?
> > 
> > This is in case we can't use indirect: we want to free up
> > enough buffers for the following add_buf to succeed.
> 
> Sure, or we could just count the frags of the skb we're taking out,
> which would be accurate for both cases and far more intuitive.
> 
> ie. always try to free up twice as much as we're about to put in.
> 
> Can we hit problems with OOM?  Sure, but no worse than now...
> The problem is that this "virtqueue_get_capacity()" returns the worst
> case, not the normal case.  So using it is deceptive.
> 

Maybe just document this?

I still believe capacity really needs to be decided
at the virtqueue level, not in the driver.
E.g. with indirect each skb uses a single entry: freeing
1 small skb is always enough to have space for a large one.

I do understand how it seems a waste to leave direct space
in the ring while we might in practice have space
due to indirect. Didn't come up with a nice way to
solve this yet - but 'no worse than now :)'

> > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > sure we have enough space in the buffer. Another way to do
> > that is with a define :).
> 
> To do this properly, we should really be using the actual number of sg
> elements needed, but we'd have to do most of xmit_skb beforehand so we
> know how many.
> 
> Cheers,
> Rusty.

Maybe I'm confused here.  The problem isn't the failing
add_buf for the given skb IIUC.  What we are trying to do here is stop
the queue *before xmit_skb fails*. We can't look at the
number of fragments in the current skb - the next one can be
much larger.  That's why we check capacity after xmit_skb,
not before it, right?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-23 11:19           ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-23 11:19 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> On Sun, 22 May 2011 15:10:08 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> > > On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > Current code might introduce a lot of latency variation
> > > > if there are many pending bufs at the time we
> > > > attempt to transmit a new one. This is bad for
> > > > real-time applications and can't be good for TCP either.
> > > 
> > > Do we have more than speculation to back that up, BTW?
> > 
> > Need to dig this up: I thought we saw some reports of this on the list?
> 
> I think so too, but a reference needs to be here too.
> 
> It helps to have exact benchmarks on what's being tested, otherwise we
> risk unexpected interaction with the other optimization patches.
> 
> > > >  	struct sk_buff *skb;
> > > >  	unsigned int len;
> > > > -
> > > > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > > > +	bool c;
> > > > +	int n;
> > > > +
> > > > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > > > +	 * all of the memory back if they are used fast enough. */
> > > > +	for (n = 0;
> > > > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > > > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > > > +	     ++n) {
> > > >  		pr_debug("Sent skb %p\n", skb);
> > > >  		vi->dev->stats.tx_bytes += skb->len;
> > > >  		vi->dev->stats.tx_packets++;
> > > >  		dev_kfree_skb_any(skb);
> > > >  	}
> > > > +	return !c;
> > > 
> > > This is for() abuse :)
> > > 
> > > Why is the capacity check in there at all?  Surely it's simpler to try
> > > to free 2 skbs each time around?
> > 
> > This is in case we can't use indirect: we want to free up
> > enough buffers for the following add_buf to succeed.
> 
> Sure, or we could just count the frags of the skb we're taking out,
> which would be accurate for both cases and far more intuitive.
> 
> ie. always try to free up twice as much as we're about to put in.
> 
> Can we hit problems with OOM?  Sure, but no worse than now...
> The problem is that this "virtqueue_get_capacity()" returns the worst
> case, not the normal case.  So using it is deceptive.
> 

Maybe just document this?

I still believe capacity really needs to be decided
at the virtqueue level, not in the driver.
E.g. with indirect each skb uses a single entry: freeing
1 small skb is always enough to have space for a large one.

I do understand how it seems a waste to leave direct space
in the ring while we might in practice have space
due to indirect. Didn't come up with a nice way to
solve this yet - but 'no worse than now :)'

> > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > sure we have enough space in the buffer. Another way to do
> > that is with a define :).
> 
> To do this properly, we should really be using the actual number of sg
> elements needed, but we'd have to do most of xmit_skb beforehand so we
> know how many.
> 
> Cheers,
> Rusty.

Maybe I'm confused here.  The problem isn't the failing
add_buf for the given skb IIUC.  What we are trying to do here is stop
the queue *before xmit_skb fails*. We can't look at the
number of fragments in the current skb - the next one can be
much larger.  That's why we check capacity after xmit_skb,
not before it, right?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-23  2:07         ` Rusty Russell
  (?)
@ 2011-05-23 11:19         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-23 11:19 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> On Sun, 22 May 2011 15:10:08 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Sat, May 21, 2011 at 11:49:59AM +0930, Rusty Russell wrote:
> > > On Fri, 20 May 2011 02:11:56 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > Current code might introduce a lot of latency variation
> > > > if there are many pending bufs at the time we
> > > > attempt to transmit a new one. This is bad for
> > > > real-time applications and can't be good for TCP either.
> > > 
> > > Do we have more than speculation to back that up, BTW?
> > 
> > Need to dig this up: I thought we saw some reports of this on the list?
> 
> I think so too, but a reference needs to be here too.
> 
> It helps to have exact benchmarks on what's being tested, otherwise we
> risk unexpected interaction with the other optimization patches.
> 
> > > >  	struct sk_buff *skb;
> > > >  	unsigned int len;
> > > > -
> > > > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > > > +	bool c;
> > > > +	int n;
> > > > +
> > > > +	/* We try to free up at least 2 skbs per one sent, so that we'll get
> > > > +	 * all of the memory back if they are used fast enough. */
> > > > +	for (n = 0;
> > > > +	     ((c = virtqueue_get_capacity(vi->svq) < capacity) || n < 2) &&
> > > > +	     ((skb = virtqueue_get_buf(vi->svq, &len)));
> > > > +	     ++n) {
> > > >  		pr_debug("Sent skb %p\n", skb);
> > > >  		vi->dev->stats.tx_bytes += skb->len;
> > > >  		vi->dev->stats.tx_packets++;
> > > >  		dev_kfree_skb_any(skb);
> > > >  	}
> > > > +	return !c;
> > > 
> > > This is for() abuse :)
> > > 
> > > Why is the capacity check in there at all?  Surely it's simpler to try
> > > to free 2 skbs each time around?
> > 
> > This is in case we can't use indirect: we want to free up
> > enough buffers for the following add_buf to succeed.
> 
> Sure, or we could just count the frags of the skb we're taking out,
> which would be accurate for both cases and far more intuitive.
> 
> ie. always try to free up twice as much as we're about to put in.
> 
> Can we hit problems with OOM?  Sure, but no worse than now...
> The problem is that this "virtqueue_get_capacity()" returns the worst
> case, not the normal case.  So using it is deceptive.
> 

Maybe just document this?

I still believe capacity really needs to be decided
at the virtqueue level, not in the driver.
E.g. with indirect each skb uses a single entry: freeing
1 small skb is always enough to have space for a large one.

I do understand how it seems a waste to leave direct space
in the ring while we might in practice have space
due to indirect. Didn't come up with a nice way to
solve this yet - but 'no worse than now :)'

> > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > sure we have enough space in the buffer. Another way to do
> > that is with a define :).
> 
> To do this properly, we should really be using the actual number of sg
> elements needed, but we'd have to do most of xmit_skb beforehand so we
> know how many.
> 
> Cheers,
> Rusty.

Maybe I'm confused here.  The problem isn't the failing
add_buf for the given skb IIUC.  What we are trying to do here is stop
the queue *before xmit_skb fails*. We can't look at the
number of fragments in the current skb - the next one can be
much larger.  That's why we check capacity after xmit_skb,
not before it, right?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24  7:54             ` Krishna Kumar2
  0 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-24  7:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390, netdev,
	Rusty Russell, Martin Schwidefsky, steved, Tom Lendacky,
	virtualization, Shirley Ma

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/23/2011 04:49:00 PM:

> > To do this properly, we should really be using the actual number of sg
> > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > know how many.
> >
> > Cheers,
> > Rusty.
>
> Maybe I'm confused here.  The problem isn't the failing
> add_buf for the given skb IIUC.  What we are trying to do here is stop
> the queue *before xmit_skb fails*. We can't look at the
> number of fragments in the current skb - the next one can be
> much larger.  That's why we check capacity after xmit_skb,
> not before it, right?

Maybe Rusty means it is a simpler model to free the amount
of space that this xmit needs. We will still fail anyway
at some time but it is unlikely, since earlier iteration
freed up atleast the space that it was going to use. The
code could become much simpler:

start_xmit()
{
{
        num_sgs = get num_sgs for this skb;

        /* Free enough pending old buffers to enable queueing this one */
        free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */

        if (virtqueue_get_capacity() < num_sgs) {
                netif_stop_queue(dev);
                if (virtqueue_enable_cb_delayed(vi->svq) ||
                    free_old_xmit_skbs(vi, num_sgs)) {
                        /* Nothing freed up, or not enough freed up */
                        kfree_skb(skb);
                        return NETDEV_TX_OK;
                }
                netif_start_queue(dev);
                virtqueue_disable_cb(vi->svq);
        }

        /* xmit_skb cannot fail now, also pass 'num_sgs' */
        xmit_skb(vi, skb, num_sgs);
        virtqueue_kick(vi->svq);

        skb_orphan(skb);
        nf_reset(skb);

        return NETDEV_TX_OK;
}

We could even return TX_BUSY since that makes the dequeue
code more efficient. See dev_dequeue_skb() - you can skip a
lot of code (and avoid taking locks) to check if the queue
is already stopped but that code runs only if you return
TX_BUSY in the earlier iteration.

BTW, shouldn't the check in start_xmit be:
	if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
		...
	}

Thanks,

- KK


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24  7:54             ` Krishna Kumar2
  0 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-24  7:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	lguest-uLR06cmDAlY/bJ5BZ2RsiQ, Shirley Ma,
	kvm-u79uwXL29TY76Z2rM5mHXA, Carsten Otte,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, netdev-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

"Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote on 05/23/2011 04:49:00 PM:

> > To do this properly, we should really be using the actual number of sg
> > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > know how many.
> >
> > Cheers,
> > Rusty.
>
> Maybe I'm confused here.  The problem isn't the failing
> add_buf for the given skb IIUC.  What we are trying to do here is stop
> the queue *before xmit_skb fails*. We can't look at the
> number of fragments in the current skb - the next one can be
> much larger.  That's why we check capacity after xmit_skb,
> not before it, right?

Maybe Rusty means it is a simpler model to free the amount
of space that this xmit needs. We will still fail anyway
at some time but it is unlikely, since earlier iteration
freed up atleast the space that it was going to use. The
code could become much simpler:

start_xmit()
{
{
        num_sgs = get num_sgs for this skb;

        /* Free enough pending old buffers to enable queueing this one */
        free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */

        if (virtqueue_get_capacity() < num_sgs) {
                netif_stop_queue(dev);
                if (virtqueue_enable_cb_delayed(vi->svq) ||
                    free_old_xmit_skbs(vi, num_sgs)) {
                        /* Nothing freed up, or not enough freed up */
                        kfree_skb(skb);
                        return NETDEV_TX_OK;
                }
                netif_start_queue(dev);
                virtqueue_disable_cb(vi->svq);
        }

        /* xmit_skb cannot fail now, also pass 'num_sgs' */
        xmit_skb(vi, skb, num_sgs);
        virtqueue_kick(vi->svq);

        skb_orphan(skb);
        nf_reset(skb);

        return NETDEV_TX_OK;
}

We could even return TX_BUSY since that makes the dequeue
code more efficient. See dev_dequeue_skb() - you can skip a
lot of code (and avoid taking locks) to check if the queue
is already stopped but that code runs only if you return
TX_BUSY in the earlier iteration.

BTW, shouldn't the check in start_xmit be:
	if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
		...
	}

Thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-23 11:19           ` Michael S. Tsirkin
  (?)
  (?)
@ 2011-05-24  7:54           ` Krishna Kumar2
  -1 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-24  7:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, lguest, Shirley Ma, kvm, Carsten Otte, linux-s390,
	Heiko Carstens, linux-kernel, virtualization, steved,
	Christian Borntraeger, Tom Lendacky, netdev, Martin Schwidefsky,
	linux390

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/23/2011 04:49:00 PM:

> > To do this properly, we should really be using the actual number of sg
> > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > know how many.
> >
> > Cheers,
> > Rusty.
>
> Maybe I'm confused here.  The problem isn't the failing
> add_buf for the given skb IIUC.  What we are trying to do here is stop
> the queue *before xmit_skb fails*. We can't look at the
> number of fragments in the current skb - the next one can be
> much larger.  That's why we check capacity after xmit_skb,
> not before it, right?

Maybe Rusty means it is a simpler model to free the amount
of space that this xmit needs. We will still fail anyway
at some time but it is unlikely, since earlier iteration
freed up atleast the space that it was going to use. The
code could become much simpler:

start_xmit()
{
{
        num_sgs = get num_sgs for this skb;

        /* Free enough pending old buffers to enable queueing this one */
        free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */

        if (virtqueue_get_capacity() < num_sgs) {
                netif_stop_queue(dev);
                if (virtqueue_enable_cb_delayed(vi->svq) ||
                    free_old_xmit_skbs(vi, num_sgs)) {
                        /* Nothing freed up, or not enough freed up */
                        kfree_skb(skb);
                        return NETDEV_TX_OK;
                }
                netif_start_queue(dev);
                virtqueue_disable_cb(vi->svq);
        }

        /* xmit_skb cannot fail now, also pass 'num_sgs' */
        xmit_skb(vi, skb, num_sgs);
        virtqueue_kick(vi->svq);

        skb_orphan(skb);
        nf_reset(skb);

        return NETDEV_TX_OK;
}

We could even return TX_BUSY since that makes the dequeue
code more efficient. See dev_dequeue_skb() - you can skip a
lot of code (and avoid taking locks) to check if the queue
is already stopped but that code runs only if you return
TX_BUSY in the earlier iteration.

BTW, shouldn't the check in start_xmit be:
	if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
		...
	}

Thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24  9:12               ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24  9:12 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390, netdev,
	Rusty Russell, Martin Schwidefsky, steved, Tom Lendacky,
	virtualization, Shirley Ma

On Tue, May 24, 2011 at 01:24:15PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 05/23/2011 04:49:00 PM:
> 
> > > To do this properly, we should really be using the actual number of sg
> > > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > > know how many.
> > >
> > > Cheers,
> > > Rusty.
> >
> > Maybe I'm confused here.  The problem isn't the failing
> > add_buf for the given skb IIUC.  What we are trying to do here is stop
> > the queue *before xmit_skb fails*. We can't look at the
> > number of fragments in the current skb - the next one can be
> > much larger.  That's why we check capacity after xmit_skb,
> > not before it, right?
> 
> Maybe Rusty means it is a simpler model to free the amount
> of space that this xmit needs. We will still fail anyway
> at some time but it is unlikely, since earlier iteration
> freed up atleast the space that it was going to use.

Not sure I nderstand.  We can't know space is freed in the previous
iteration as buffers might not have been used by then.

> The
> code could become much simpler:
> 
> start_xmit()
> {
> {
>         num_sgs = get num_sgs for this skb;
> 
>         /* Free enough pending old buffers to enable queueing this one */
>         free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */
> 
>         if (virtqueue_get_capacity() < num_sgs) {
>                 netif_stop_queue(dev);
>                 if (virtqueue_enable_cb_delayed(vi->svq) ||
>                     free_old_xmit_skbs(vi, num_sgs)) {
>                         /* Nothing freed up, or not enough freed up */
>                         kfree_skb(skb);
>                         return NETDEV_TX_OK;

This packet drop is what we wanted to avoid.


>                 }
>                 netif_start_queue(dev);
>                 virtqueue_disable_cb(vi->svq);
>         }
> 
>         /* xmit_skb cannot fail now, also pass 'num_sgs' */
>         xmit_skb(vi, skb, num_sgs);
>         virtqueue_kick(vi->svq);
> 
>         skb_orphan(skb);
>         nf_reset(skb);
> 
>         return NETDEV_TX_OK;
> }
> 
> We could even return TX_BUSY since that makes the dequeue
> code more efficient. See dev_dequeue_skb() - you can skip a
> lot of code (and avoid taking locks) to check if the queue
> is already stopped but that code runs only if you return
> TX_BUSY in the earlier iteration.
> 
> BTW, shouldn't the check in start_xmit be:
> 	if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> 		...
> 	}
> 
> Thanks,
> 
> - KK

I thought we used to do basically this but other devices moved to a
model where they stop *before* queueing fails, so we did too.

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24  9:12               ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24  9:12 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	lguest-uLR06cmDAlY/bJ5BZ2RsiQ, Shirley Ma,
	kvm-u79uwXL29TY76Z2rM5mHXA, Carsten Otte,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, netdev-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Tue, May 24, 2011 at 01:24:15PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote on 05/23/2011 04:49:00 PM:
> 
> > > To do this properly, we should really be using the actual number of sg
> > > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > > know how many.
> > >
> > > Cheers,
> > > Rusty.
> >
> > Maybe I'm confused here.  The problem isn't the failing
> > add_buf for the given skb IIUC.  What we are trying to do here is stop
> > the queue *before xmit_skb fails*. We can't look at the
> > number of fragments in the current skb - the next one can be
> > much larger.  That's why we check capacity after xmit_skb,
> > not before it, right?
> 
> Maybe Rusty means it is a simpler model to free the amount
> of space that this xmit needs. We will still fail anyway
> at some time but it is unlikely, since earlier iteration
> freed up atleast the space that it was going to use.

Not sure I nderstand.  We can't know space is freed in the previous
iteration as buffers might not have been used by then.

> The
> code could become much simpler:
> 
> start_xmit()
> {
> {
>         num_sgs = get num_sgs for this skb;
> 
>         /* Free enough pending old buffers to enable queueing this one */
>         free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */
> 
>         if (virtqueue_get_capacity() < num_sgs) {
>                 netif_stop_queue(dev);
>                 if (virtqueue_enable_cb_delayed(vi->svq) ||
>                     free_old_xmit_skbs(vi, num_sgs)) {
>                         /* Nothing freed up, or not enough freed up */
>                         kfree_skb(skb);
>                         return NETDEV_TX_OK;

This packet drop is what we wanted to avoid.


>                 }
>                 netif_start_queue(dev);
>                 virtqueue_disable_cb(vi->svq);
>         }
> 
>         /* xmit_skb cannot fail now, also pass 'num_sgs' */
>         xmit_skb(vi, skb, num_sgs);
>         virtqueue_kick(vi->svq);
> 
>         skb_orphan(skb);
>         nf_reset(skb);
> 
>         return NETDEV_TX_OK;
> }
> 
> We could even return TX_BUSY since that makes the dequeue
> code more efficient. See dev_dequeue_skb() - you can skip a
> lot of code (and avoid taking locks) to check if the queue
> is already stopped but that code runs only if you return
> TX_BUSY in the earlier iteration.
> 
> BTW, shouldn't the check in start_xmit be:
> 	if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> 		...
> 	}
> 
> Thanks,
> 
> - KK

I thought we used to do basically this but other devices moved to a
model where they stop *before* queueing fails, so we did too.

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-24  7:54             ` Krishna Kumar2
  (?)
@ 2011-05-24  9:12             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24  9:12 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero, lguest, Shirley Ma, kvm, Carsten Otte, linux-s390,
	Heiko Carstens, linux-kernel, virtualization, steved,
	Christian Borntraeger, Tom Lendacky, netdev, Martin Schwidefsky,
	linux390

On Tue, May 24, 2011 at 01:24:15PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 05/23/2011 04:49:00 PM:
> 
> > > To do this properly, we should really be using the actual number of sg
> > > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > > know how many.
> > >
> > > Cheers,
> > > Rusty.
> >
> > Maybe I'm confused here.  The problem isn't the failing
> > add_buf for the given skb IIUC.  What we are trying to do here is stop
> > the queue *before xmit_skb fails*. We can't look at the
> > number of fragments in the current skb - the next one can be
> > much larger.  That's why we check capacity after xmit_skb,
> > not before it, right?
> 
> Maybe Rusty means it is a simpler model to free the amount
> of space that this xmit needs. We will still fail anyway
> at some time but it is unlikely, since earlier iteration
> freed up atleast the space that it was going to use.

Not sure I nderstand.  We can't know space is freed in the previous
iteration as buffers might not have been used by then.

> The
> code could become much simpler:
> 
> start_xmit()
> {
> {
>         num_sgs = get num_sgs for this skb;
> 
>         /* Free enough pending old buffers to enable queueing this one */
>         free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */
> 
>         if (virtqueue_get_capacity() < num_sgs) {
>                 netif_stop_queue(dev);
>                 if (virtqueue_enable_cb_delayed(vi->svq) ||
>                     free_old_xmit_skbs(vi, num_sgs)) {
>                         /* Nothing freed up, or not enough freed up */
>                         kfree_skb(skb);
>                         return NETDEV_TX_OK;

This packet drop is what we wanted to avoid.


>                 }
>                 netif_start_queue(dev);
>                 virtqueue_disable_cb(vi->svq);
>         }
> 
>         /* xmit_skb cannot fail now, also pass 'num_sgs' */
>         xmit_skb(vi, skb, num_sgs);
>         virtqueue_kick(vi->svq);
> 
>         skb_orphan(skb);
>         nf_reset(skb);
> 
>         return NETDEV_TX_OK;
> }
> 
> We could even return TX_BUSY since that makes the dequeue
> code more efficient. See dev_dequeue_skb() - you can skip a
> lot of code (and avoid taking locks) to check if the queue
> is already stopped but that code runs only if you return
> TX_BUSY in the earlier iteration.
> 
> BTW, shouldn't the check in start_xmit be:
> 	if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> 		...
> 	}
> 
> Thanks,
> 
> - KK

I thought we used to do basically this but other devices moved to a
model where they stop *before* queueing fails, so we did too.

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-24  9:12               ` Michael S. Tsirkin
  (?)
  (?)
@ 2011-05-24  9:27               ` Krishna Kumar2
  2011-05-24 11:29                   ` Michael S. Tsirkin
  2011-05-24 11:29                 ` Michael S. Tsirkin
  -1 siblings, 2 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-24  9:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390, netdev,
	Rusty Russell, Martin Schwidefsky, steved, Tom Lendacky,
	virtualization, Shirley Ma

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/24/2011 02:42:55 PM:

> > > > To do this properly, we should really be using the actual number of
sg
> > > > elements needed, but we'd have to do most of xmit_skb beforehand so
we
> > > > know how many.
> > > >
> > > > Cheers,
> > > > Rusty.
> > >
> > > Maybe I'm confused here.  The problem isn't the failing
> > > add_buf for the given skb IIUC.  What we are trying to do here is
stop
> > > the queue *before xmit_skb fails*. We can't look at the
> > > number of fragments in the current skb - the next one can be
> > > much larger.  That's why we check capacity after xmit_skb,
> > > not before it, right?
> >
> > Maybe Rusty means it is a simpler model to free the amount
> > of space that this xmit needs. We will still fail anyway
> > at some time but it is unlikely, since earlier iteration
> > freed up atleast the space that it was going to use.
>
> Not sure I nderstand.  We can't know space is freed in the previous
> iteration as buffers might not have been used by then.

Yes, the first few iterations may not have freed up space, but
later ones should. The amount of free space should increase
from then on, especially since we try to free double of what
we consume.

> > The
> > code could become much simpler:
> >
> > start_xmit()
> > {
> > {
> >         num_sgs = get num_sgs for this skb;
> >
> >         /* Free enough pending old buffers to enable queueing this one
*/
> >         free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */
> >
> >         if (virtqueue_get_capacity() < num_sgs) {
> >                 netif_stop_queue(dev);
> >                 if (virtqueue_enable_cb_delayed(vi->svq) ||
> >                     free_old_xmit_skbs(vi, num_sgs)) {
> >                         /* Nothing freed up, or not enough freed up */
> >                         kfree_skb(skb);
> >                         return NETDEV_TX_OK;
>
> This packet drop is what we wanted to avoid.

Please see below on returning NETDEV_TX_BUSY.

>
> >                 }
> >                 netif_start_queue(dev);
> >                 virtqueue_disable_cb(vi->svq);
> >         }
> >
> >         /* xmit_skb cannot fail now, also pass 'num_sgs' */
> >         xmit_skb(vi, skb, num_sgs);
> >         virtqueue_kick(vi->svq);
> >
> >         skb_orphan(skb);
> >         nf_reset(skb);
> >
> >         return NETDEV_TX_OK;
> > }
> >
> > We could even return TX_BUSY since that makes the dequeue
> > code more efficient. See dev_dequeue_skb() - you can skip a
> > lot of code (and avoid taking locks) to check if the queue
> > is already stopped but that code runs only if you return
> > TX_BUSY in the earlier iteration.
> >
> > BTW, shouldn't the check in start_xmit be:
> >    if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> >       ...
> >    }
> >
> > Thanks,
> >
> > - KK
>
> I thought we used to do basically this but other devices moved to a
> model where they stop *before* queueing fails, so we did too.

I am not sure of why it was changed, since returning TX_BUSY
seems more efficient IMHO. qdisc_restart() handles requeue'd
packets much better than a stopped queue, as a significant
part of this code is skipped if gso_skb is present (qdisc
will eventually start dropping packets when tx_queue_len is
exceeded anyway).

Thanks,

- KK


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-24  9:12               ` Michael S. Tsirkin
  (?)
@ 2011-05-24  9:27               ` Krishna Kumar2
  -1 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-24  9:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, lguest, Shirley Ma, kvm, Carsten Otte, linux-s390,
	Heiko Carstens, linux-kernel, virtualization, steved,
	Christian Borntraeger, Tom Lendacky, netdev, Martin Schwidefsky,
	linux390

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/24/2011 02:42:55 PM:

> > > > To do this properly, we should really be using the actual number of
sg
> > > > elements needed, but we'd have to do most of xmit_skb beforehand so
we
> > > > know how many.
> > > >
> > > > Cheers,
> > > > Rusty.
> > >
> > > Maybe I'm confused here.  The problem isn't the failing
> > > add_buf for the given skb IIUC.  What we are trying to do here is
stop
> > > the queue *before xmit_skb fails*. We can't look at the
> > > number of fragments in the current skb - the next one can be
> > > much larger.  That's why we check capacity after xmit_skb,
> > > not before it, right?
> >
> > Maybe Rusty means it is a simpler model to free the amount
> > of space that this xmit needs. We will still fail anyway
> > at some time but it is unlikely, since earlier iteration
> > freed up atleast the space that it was going to use.
>
> Not sure I nderstand.  We can't know space is freed in the previous
> iteration as buffers might not have been used by then.

Yes, the first few iterations may not have freed up space, but
later ones should. The amount of free space should increase
from then on, especially since we try to free double of what
we consume.

> > The
> > code could become much simpler:
> >
> > start_xmit()
> > {
> > {
> >         num_sgs = get num_sgs for this skb;
> >
> >         /* Free enough pending old buffers to enable queueing this one
*/
> >         free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */
> >
> >         if (virtqueue_get_capacity() < num_sgs) {
> >                 netif_stop_queue(dev);
> >                 if (virtqueue_enable_cb_delayed(vi->svq) ||
> >                     free_old_xmit_skbs(vi, num_sgs)) {
> >                         /* Nothing freed up, or not enough freed up */
> >                         kfree_skb(skb);
> >                         return NETDEV_TX_OK;
>
> This packet drop is what we wanted to avoid.

Please see below on returning NETDEV_TX_BUSY.

>
> >                 }
> >                 netif_start_queue(dev);
> >                 virtqueue_disable_cb(vi->svq);
> >         }
> >
> >         /* xmit_skb cannot fail now, also pass 'num_sgs' */
> >         xmit_skb(vi, skb, num_sgs);
> >         virtqueue_kick(vi->svq);
> >
> >         skb_orphan(skb);
> >         nf_reset(skb);
> >
> >         return NETDEV_TX_OK;
> > }
> >
> > We could even return TX_BUSY since that makes the dequeue
> > code more efficient. See dev_dequeue_skb() - you can skip a
> > lot of code (and avoid taking locks) to check if the queue
> > is already stopped but that code runs only if you return
> > TX_BUSY in the earlier iteration.
> >
> > BTW, shouldn't the check in start_xmit be:
> >    if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> >       ...
> >    }
> >
> > Thanks,
> >
> > - KK
>
> I thought we used to do basically this but other devices moved to a
> model where they stop *before* queueing fails, so we did too.

I am not sure of why it was changed, since returning TX_BUSY
seems more efficient IMHO. qdisc_restart() handles requeue'd
packets much better than a stopped queue, as a significant
part of this code is skipped if gso_skb is present (qdisc
will eventually start dropping packets when tx_queue_len is
exceeded anyway).

Thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24 11:29                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24 11:29 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390, netdev,
	Rusty Russell, Martin Schwidefsky, steved, Tom Lendacky,
	virtualization, Shirley Ma

On Tue, May 24, 2011 at 02:57:43PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 05/24/2011 02:42:55 PM:
> 
> > > > > To do this properly, we should really be using the actual number of
> sg
> > > > > elements needed, but we'd have to do most of xmit_skb beforehand so
> we
> > > > > know how many.
> > > > >
> > > > > Cheers,
> > > > > Rusty.
> > > >
> > > > Maybe I'm confused here.  The problem isn't the failing
> > > > add_buf for the given skb IIUC.  What we are trying to do here is
> stop
> > > > the queue *before xmit_skb fails*. We can't look at the
> > > > number of fragments in the current skb - the next one can be
> > > > much larger.  That's why we check capacity after xmit_skb,
> > > > not before it, right?
> > >
> > > Maybe Rusty means it is a simpler model to free the amount
> > > of space that this xmit needs. We will still fail anyway
> > > at some time but it is unlikely, since earlier iteration
> > > freed up atleast the space that it was going to use.
> >
> > Not sure I nderstand.  We can't know space is freed in the previous
> > iteration as buffers might not have been used by then.
> 
> Yes, the first few iterations may not have freed up space, but
> later ones should. The amount of free space should increase
> from then on, especially since we try to free double of what
> we consume.

Hmm. This is only an upper limit on the # of entries in the queue.
Assume that vq size is 4 and we transmit 4 enties without
getting anything in the used ring. The next transmit will fail.

So I don't really see why it's unlikely that we reach the packet
drop code with your patch.

> > > The
> > > code could become much simpler:
> > >
> > > start_xmit()
> > > {
> > > {
> > >         num_sgs = get num_sgs for this skb;
> > >
> > >         /* Free enough pending old buffers to enable queueing this one
> */
> > >         free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */
> > >
> > >         if (virtqueue_get_capacity() < num_sgs) {
> > >                 netif_stop_queue(dev);
> > >                 if (virtqueue_enable_cb_delayed(vi->svq) ||
> > >                     free_old_xmit_skbs(vi, num_sgs)) {
> > >                         /* Nothing freed up, or not enough freed up */
> > >                         kfree_skb(skb);
> > >                         return NETDEV_TX_OK;
> >
> > This packet drop is what we wanted to avoid.
> 
> Please see below on returning NETDEV_TX_BUSY.
> 
> >
> > >                 }
> > >                 netif_start_queue(dev);
> > >                 virtqueue_disable_cb(vi->svq);
> > >         }
> > >
> > >         /* xmit_skb cannot fail now, also pass 'num_sgs' */
> > >         xmit_skb(vi, skb, num_sgs);
> > >         virtqueue_kick(vi->svq);
> > >
> > >         skb_orphan(skb);
> > >         nf_reset(skb);
> > >
> > >         return NETDEV_TX_OK;
> > > }
> > >
> > > We could even return TX_BUSY since that makes the dequeue
> > > code more efficient. See dev_dequeue_skb() - you can skip a
> > > lot of code (and avoid taking locks) to check if the queue
> > > is already stopped but that code runs only if you return
> > > TX_BUSY in the earlier iteration.
> > >
> > > BTW, shouldn't the check in start_xmit be:
> > >    if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> > >       ...
> > >    }
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > I thought we used to do basically this but other devices moved to a
> > model where they stop *before* queueing fails, so we did too.
> 
> I am not sure of why it was changed, since returning TX_BUSY
> seems more efficient IMHO.
> qdisc_restart() handles requeue'd
> packets much better than a stopped queue, as a significant
> part of this code is skipped if gso_skb is present

I think this is the argument:
http://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg06364.html


> (qdisc
> will eventually start dropping packets when tx_queue_len is
> exceeded anyway).
> 
> Thanks,
> 
> - KK

tx_queue_len is a pretty large buffer so maybe no.
I think the packet drops from the scheduler queue can also be
done intelligently (e.g. with CHOKe) which should
work better than dropping a random packet?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24 11:29                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24 11:29 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	lguest-uLR06cmDAlY/bJ5BZ2RsiQ, Shirley Ma,
	kvm-u79uwXL29TY76Z2rM5mHXA, Carsten Otte,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, netdev-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Tue, May 24, 2011 at 02:57:43PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote on 05/24/2011 02:42:55 PM:
> 
> > > > > To do this properly, we should really be using the actual number of
> sg
> > > > > elements needed, but we'd have to do most of xmit_skb beforehand so
> we
> > > > > know how many.
> > > > >
> > > > > Cheers,
> > > > > Rusty.
> > > >
> > > > Maybe I'm confused here.  The problem isn't the failing
> > > > add_buf for the given skb IIUC.  What we are trying to do here is
> stop
> > > > the queue *before xmit_skb fails*. We can't look at the
> > > > number of fragments in the current skb - the next one can be
> > > > much larger.  That's why we check capacity after xmit_skb,
> > > > not before it, right?
> > >
> > > Maybe Rusty means it is a simpler model to free the amount
> > > of space that this xmit needs. We will still fail anyway
> > > at some time but it is unlikely, since earlier iteration
> > > freed up atleast the space that it was going to use.
> >
> > Not sure I nderstand.  We can't know space is freed in the previous
> > iteration as buffers might not have been used by then.
> 
> Yes, the first few iterations may not have freed up space, but
> later ones should. The amount of free space should increase
> from then on, especially since we try to free double of what
> we consume.

Hmm. This is only an upper limit on the # of entries in the queue.
Assume that vq size is 4 and we transmit 4 enties without
getting anything in the used ring. The next transmit will fail.

So I don't really see why it's unlikely that we reach the packet
drop code with your patch.

> > > The
> > > code could become much simpler:
> > >
> > > start_xmit()
> > > {
> > > {
> > >         num_sgs = get num_sgs for this skb;
> > >
> > >         /* Free enough pending old buffers to enable queueing this one
> */
> > >         free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */
> > >
> > >         if (virtqueue_get_capacity() < num_sgs) {
> > >                 netif_stop_queue(dev);
> > >                 if (virtqueue_enable_cb_delayed(vi->svq) ||
> > >                     free_old_xmit_skbs(vi, num_sgs)) {
> > >                         /* Nothing freed up, or not enough freed up */
> > >                         kfree_skb(skb);
> > >                         return NETDEV_TX_OK;
> >
> > This packet drop is what we wanted to avoid.
> 
> Please see below on returning NETDEV_TX_BUSY.
> 
> >
> > >                 }
> > >                 netif_start_queue(dev);
> > >                 virtqueue_disable_cb(vi->svq);
> > >         }
> > >
> > >         /* xmit_skb cannot fail now, also pass 'num_sgs' */
> > >         xmit_skb(vi, skb, num_sgs);
> > >         virtqueue_kick(vi->svq);
> > >
> > >         skb_orphan(skb);
> > >         nf_reset(skb);
> > >
> > >         return NETDEV_TX_OK;
> > > }
> > >
> > > We could even return TX_BUSY since that makes the dequeue
> > > code more efficient. See dev_dequeue_skb() - you can skip a
> > > lot of code (and avoid taking locks) to check if the queue
> > > is already stopped but that code runs only if you return
> > > TX_BUSY in the earlier iteration.
> > >
> > > BTW, shouldn't the check in start_xmit be:
> > >    if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> > >       ...
> > >    }
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > I thought we used to do basically this but other devices moved to a
> > model where they stop *before* queueing fails, so we did too.
> 
> I am not sure of why it was changed, since returning TX_BUSY
> seems more efficient IMHO.
> qdisc_restart() handles requeue'd
> packets much better than a stopped queue, as a significant
> part of this code is skipped if gso_skb is present

I think this is the argument:
http://www.mail-archive.com/virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org/msg06364.html


> (qdisc
> will eventually start dropping packets when tx_queue_len is
> exceeded anyway).
> 
> Thanks,
> 
> - KK

tx_queue_len is a pretty large buffer so maybe no.
I think the packet drops from the scheduler queue can also be
done intelligently (e.g. with CHOKe) which should
work better than dropping a random packet?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-24  9:27               ` Krishna Kumar2
  2011-05-24 11:29                   ` Michael S. Tsirkin
@ 2011-05-24 11:29                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24 11:29 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero, lguest, Shirley Ma, kvm, Carsten Otte, linux-s390,
	Heiko Carstens, linux-kernel, virtualization, steved,
	Christian Borntraeger, Tom Lendacky, netdev, Martin Schwidefsky,
	linux390

On Tue, May 24, 2011 at 02:57:43PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 05/24/2011 02:42:55 PM:
> 
> > > > > To do this properly, we should really be using the actual number of
> sg
> > > > > elements needed, but we'd have to do most of xmit_skb beforehand so
> we
> > > > > know how many.
> > > > >
> > > > > Cheers,
> > > > > Rusty.
> > > >
> > > > Maybe I'm confused here.  The problem isn't the failing
> > > > add_buf for the given skb IIUC.  What we are trying to do here is
> stop
> > > > the queue *before xmit_skb fails*. We can't look at the
> > > > number of fragments in the current skb - the next one can be
> > > > much larger.  That's why we check capacity after xmit_skb,
> > > > not before it, right?
> > >
> > > Maybe Rusty means it is a simpler model to free the amount
> > > of space that this xmit needs. We will still fail anyway
> > > at some time but it is unlikely, since earlier iteration
> > > freed up atleast the space that it was going to use.
> >
> > Not sure I nderstand.  We can't know space is freed in the previous
> > iteration as buffers might not have been used by then.
> 
> Yes, the first few iterations may not have freed up space, but
> later ones should. The amount of free space should increase
> from then on, especially since we try to free double of what
> we consume.

Hmm. This is only an upper limit on the # of entries in the queue.
Assume that vq size is 4 and we transmit 4 enties without
getting anything in the used ring. The next transmit will fail.

So I don't really see why it's unlikely that we reach the packet
drop code with your patch.

> > > The
> > > code could become much simpler:
> > >
> > > start_xmit()
> > > {
> > > {
> > >         num_sgs = get num_sgs for this skb;
> > >
> > >         /* Free enough pending old buffers to enable queueing this one
> */
> > >         free_old_xmit_skbs(vi, num_sgs * 2);     /* ?? */
> > >
> > >         if (virtqueue_get_capacity() < num_sgs) {
> > >                 netif_stop_queue(dev);
> > >                 if (virtqueue_enable_cb_delayed(vi->svq) ||
> > >                     free_old_xmit_skbs(vi, num_sgs)) {
> > >                         /* Nothing freed up, or not enough freed up */
> > >                         kfree_skb(skb);
> > >                         return NETDEV_TX_OK;
> >
> > This packet drop is what we wanted to avoid.
> 
> Please see below on returning NETDEV_TX_BUSY.
> 
> >
> > >                 }
> > >                 netif_start_queue(dev);
> > >                 virtqueue_disable_cb(vi->svq);
> > >         }
> > >
> > >         /* xmit_skb cannot fail now, also pass 'num_sgs' */
> > >         xmit_skb(vi, skb, num_sgs);
> > >         virtqueue_kick(vi->svq);
> > >
> > >         skb_orphan(skb);
> > >         nf_reset(skb);
> > >
> > >         return NETDEV_TX_OK;
> > > }
> > >
> > > We could even return TX_BUSY since that makes the dequeue
> > > code more efficient. See dev_dequeue_skb() - you can skip a
> > > lot of code (and avoid taking locks) to check if the queue
> > > is already stopped but that code runs only if you return
> > > TX_BUSY in the earlier iteration.
> > >
> > > BTW, shouldn't the check in start_xmit be:
> > >    if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
> > >       ...
> > >    }
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > I thought we used to do basically this but other devices moved to a
> > model where they stop *before* queueing fails, so we did too.
> 
> I am not sure of why it was changed, since returning TX_BUSY
> seems more efficient IMHO.
> qdisc_restart() handles requeue'd
> packets much better than a stopped queue, as a significant
> part of this code is skipped if gso_skb is present

I think this is the argument:
http://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg06364.html


> (qdisc
> will eventually start dropping packets when tx_queue_len is
> exceeded anyway).
> 
> Thanks,
> 
> - KK

tx_queue_len is a pretty large buffer so maybe no.
I think the packet drops from the scheduler queue can also be
done intelligently (e.g. with CHOKe) which should
work better than dropping a random packet?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24 12:50                     ` Krishna Kumar2
  0 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-24 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390, netdev,
	Rusty Russell, Martin Schwidefsky, steved, Tom Lendacky,
	virtualization, Shirley Ma

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/24/2011 04:59:39 PM:

> > > > Maybe Rusty means it is a simpler model to free the amount
> > > > of space that this xmit needs. We will still fail anyway
> > > > at some time but it is unlikely, since earlier iteration
> > > > freed up atleast the space that it was going to use.
> > >
> > > Not sure I nderstand.  We can't know space is freed in the previous
> > > iteration as buffers might not have been used by then.
> >
> > Yes, the first few iterations may not have freed up space, but
> > later ones should. The amount of free space should increase
> > from then on, especially since we try to free double of what
> > we consume.
>
> Hmm. This is only an upper limit on the # of entries in the queue.
> Assume that vq size is 4 and we transmit 4 enties without
> getting anything in the used ring. The next transmit will fail.
>
> So I don't really see why it's unlikely that we reach the packet
> drop code with your patch.

I was assuming 256 entries :) I will try to get some
numbers to see how often it is true tomorrow.

> > I am not sure of why it was changed, since returning TX_BUSY
> > seems more efficient IMHO.
> > qdisc_restart() handles requeue'd
> > packets much better than a stopped queue, as a significant
> > part of this code is skipped if gso_skb is present
>
> I think this is the argument:
> http://www.mail-archive.com/virtualization@lists.linux-
> foundation.org/msg06364.html

Thanks for digging up that thread! Yes, that one skb would get
sent first ahead of possibly higher priority skbs. However,
from a performance point, TX_BUSY code skips a lot of checks
and code for all subsequent packets till the device is
restarted. I can test performance with both cases and report
what I find (the requeue code has become very simple and clean
from "horribly complex", thanks to Herbert and Dave).

> > (qdisc
> > will eventually start dropping packets when tx_queue_len is
>
> tx_queue_len is a pretty large buffer so maybe no.

I remember seeing tons of drops (pfifo_fast_enqueue) when
xmit returns TX_BUSY.

> I think the packet drops from the scheduler queue can also be
> done intelligently (e.g. with CHOKe) which should
> work better than dropping a random packet?

I am not sure of that - choke_enqueue checks against a random
skb to drop current skb, and also during congestion. But for
my "sample driver xmit", returning TX_BUSY could still allow
to be used with CHOKe.

thanks,

- KK


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24 12:50                     ` Krishna Kumar2
  0 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-24 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	lguest-uLR06cmDAlY/bJ5BZ2RsiQ, Shirley Ma,
	kvm-u79uwXL29TY76Z2rM5mHXA, Carsten Otte,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, netdev-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

"Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote on 05/24/2011 04:59:39 PM:

> > > > Maybe Rusty means it is a simpler model to free the amount
> > > > of space that this xmit needs. We will still fail anyway
> > > > at some time but it is unlikely, since earlier iteration
> > > > freed up atleast the space that it was going to use.
> > >
> > > Not sure I nderstand.  We can't know space is freed in the previous
> > > iteration as buffers might not have been used by then.
> >
> > Yes, the first few iterations may not have freed up space, but
> > later ones should. The amount of free space should increase
> > from then on, especially since we try to free double of what
> > we consume.
>
> Hmm. This is only an upper limit on the # of entries in the queue.
> Assume that vq size is 4 and we transmit 4 enties without
> getting anything in the used ring. The next transmit will fail.
>
> So I don't really see why it's unlikely that we reach the packet
> drop code with your patch.

I was assuming 256 entries :) I will try to get some
numbers to see how often it is true tomorrow.

> > I am not sure of why it was changed, since returning TX_BUSY
> > seems more efficient IMHO.
> > qdisc_restart() handles requeue'd
> > packets much better than a stopped queue, as a significant
> > part of this code is skipped if gso_skb is present
>
> I think this is the argument:
> http://www.mail-archive.com/virtualization-cunTk1MwBs/ROKNJybVBZg@public.gmane.org
> foundation.org/msg06364.html

Thanks for digging up that thread! Yes, that one skb would get
sent first ahead of possibly higher priority skbs. However,
from a performance point, TX_BUSY code skips a lot of checks
and code for all subsequent packets till the device is
restarted. I can test performance with both cases and report
what I find (the requeue code has become very simple and clean
from "horribly complex", thanks to Herbert and Dave).

> > (qdisc
> > will eventually start dropping packets when tx_queue_len is
>
> tx_queue_len is a pretty large buffer so maybe no.

I remember seeing tons of drops (pfifo_fast_enqueue) when
xmit returns TX_BUSY.

> I think the packet drops from the scheduler queue can also be
> done intelligently (e.g. with CHOKe) which should
> work better than dropping a random packet?

I am not sure of that - choke_enqueue checks against a random
skb to drop current skb, and also during congestion. But for
my "sample driver xmit", returning TX_BUSY could still allow
to be used with CHOKe.

thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-24 11:29                   ` Michael S. Tsirkin
  (?)
  (?)
@ 2011-05-24 12:50                   ` Krishna Kumar2
  -1 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-24 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, lguest, Shirley Ma, kvm, Carsten Otte, linux-s390,
	Heiko Carstens, linux-kernel, virtualization, steved,
	Christian Borntraeger, Tom Lendacky, netdev, Martin Schwidefsky,
	linux390

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/24/2011 04:59:39 PM:

> > > > Maybe Rusty means it is a simpler model to free the amount
> > > > of space that this xmit needs. We will still fail anyway
> > > > at some time but it is unlikely, since earlier iteration
> > > > freed up atleast the space that it was going to use.
> > >
> > > Not sure I nderstand.  We can't know space is freed in the previous
> > > iteration as buffers might not have been used by then.
> >
> > Yes, the first few iterations may not have freed up space, but
> > later ones should. The amount of free space should increase
> > from then on, especially since we try to free double of what
> > we consume.
>
> Hmm. This is only an upper limit on the # of entries in the queue.
> Assume that vq size is 4 and we transmit 4 enties without
> getting anything in the used ring. The next transmit will fail.
>
> So I don't really see why it's unlikely that we reach the packet
> drop code with your patch.

I was assuming 256 entries :) I will try to get some
numbers to see how often it is true tomorrow.

> > I am not sure of why it was changed, since returning TX_BUSY
> > seems more efficient IMHO.
> > qdisc_restart() handles requeue'd
> > packets much better than a stopped queue, as a significant
> > part of this code is skipped if gso_skb is present
>
> I think this is the argument:
> http://www.mail-archive.com/virtualization@lists.linux-
> foundation.org/msg06364.html

Thanks for digging up that thread! Yes, that one skb would get
sent first ahead of possibly higher priority skbs. However,
from a performance point, TX_BUSY code skips a lot of checks
and code for all subsequent packets till the device is
restarted. I can test performance with both cases and report
what I find (the requeue code has become very simple and clean
from "horribly complex", thanks to Herbert and Dave).

> > (qdisc
> > will eventually start dropping packets when tx_queue_len is
>
> tx_queue_len is a pretty large buffer so maybe no.

I remember seeing tons of drops (pfifo_fast_enqueue) when
xmit returns TX_BUSY.

> I think the packet drops from the scheduler queue can also be
> done intelligently (e.g. with CHOKe) which should
> work better than dropping a random packet?

I am not sure of that - choke_enqueue checks against a random
skb to drop current skb, and also during congestion. But for
my "sample driver xmit", returning TX_BUSY could still allow
to be used with CHOKe.

thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24 13:52                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24 13:52 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390, netdev,
	Rusty Russell, Martin Schwidefsky, steved, Tom Lendacky,
	virtualization, Shirley Ma

On Tue, May 24, 2011 at 06:20:35PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 05/24/2011 04:59:39 PM:
> 
> > > > > Maybe Rusty means it is a simpler model to free the amount
> > > > > of space that this xmit needs. We will still fail anyway
> > > > > at some time but it is unlikely, since earlier iteration
> > > > > freed up atleast the space that it was going to use.
> > > >
> > > > Not sure I nderstand.  We can't know space is freed in the previous
> > > > iteration as buffers might not have been used by then.
> > >
> > > Yes, the first few iterations may not have freed up space, but
> > > later ones should. The amount of free space should increase
> > > from then on, especially since we try to free double of what
> > > we consume.
> >
> > Hmm. This is only an upper limit on the # of entries in the queue.
> > Assume that vq size is 4 and we transmit 4 enties without
> > getting anything in the used ring. The next transmit will fail.
> >
> > So I don't really see why it's unlikely that we reach the packet
> > drop code with your patch.
> 
> I was assuming 256 entries :) I will try to get some
> numbers to see how often it is true tomorrow.

That would depend on how fast the hypervisor is.
Try doing something to make hypervisor slower than the guest.  I don't
think we need measurements to realize that with the host being slower
than guest that would happen a lot, though.

> > > I am not sure of why it was changed, since returning TX_BUSY
> > > seems more efficient IMHO.
> > > qdisc_restart() handles requeue'd
> > > packets much better than a stopped queue, as a significant
> > > part of this code is skipped if gso_skb is present
> >
> > I think this is the argument:
> > http://www.mail-archive.com/virtualization@lists.linux-
> > foundation.org/msg06364.html
> 
> Thanks for digging up that thread! Yes, that one skb would get
> sent first ahead of possibly higher priority skbs. However,
> from a performance point, TX_BUSY code skips a lot of checks
> and code for all subsequent packets till the device is
> restarted. I can test performance with both cases and report
> what I find (the requeue code has become very simple and clean
> from "horribly complex", thanks to Herbert and Dave).

Cc Herbert, and try to convince him :)

> > > (qdisc
> > > will eventually start dropping packets when tx_queue_len is
> >
> > tx_queue_len is a pretty large buffer so maybe no.
> 
> I remember seeing tons of drops (pfifo_fast_enqueue) when
> xmit returns TX_BUSY.
> 
> > I think the packet drops from the scheduler queue can also be
> > done intelligently (e.g. with CHOKe) which should
> > work better than dropping a random packet?
> 
> I am not sure of that - choke_enqueue checks against a random
> skb to drop current skb, and also during congestion. But for
> my "sample driver xmit", returning TX_BUSY could still allow
> to be used with CHOKe.
> 
> thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-24 13:52                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24 13:52 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	lguest-uLR06cmDAlY/bJ5BZ2RsiQ, Shirley Ma,
	kvm-u79uwXL29TY76Z2rM5mHXA, Carsten Otte,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, netdev-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Tue, May 24, 2011 at 06:20:35PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote on 05/24/2011 04:59:39 PM:
> 
> > > > > Maybe Rusty means it is a simpler model to free the amount
> > > > > of space that this xmit needs. We will still fail anyway
> > > > > at some time but it is unlikely, since earlier iteration
> > > > > freed up atleast the space that it was going to use.
> > > >
> > > > Not sure I nderstand.  We can't know space is freed in the previous
> > > > iteration as buffers might not have been used by then.
> > >
> > > Yes, the first few iterations may not have freed up space, but
> > > later ones should. The amount of free space should increase
> > > from then on, especially since we try to free double of what
> > > we consume.
> >
> > Hmm. This is only an upper limit on the # of entries in the queue.
> > Assume that vq size is 4 and we transmit 4 enties without
> > getting anything in the used ring. The next transmit will fail.
> >
> > So I don't really see why it's unlikely that we reach the packet
> > drop code with your patch.
> 
> I was assuming 256 entries :) I will try to get some
> numbers to see how often it is true tomorrow.

That would depend on how fast the hypervisor is.
Try doing something to make hypervisor slower than the guest.  I don't
think we need measurements to realize that with the host being slower
than guest that would happen a lot, though.

> > > I am not sure of why it was changed, since returning TX_BUSY
> > > seems more efficient IMHO.
> > > qdisc_restart() handles requeue'd
> > > packets much better than a stopped queue, as a significant
> > > part of this code is skipped if gso_skb is present
> >
> > I think this is the argument:
> > http://www.mail-archive.com/virtualization-cunTk1MwBs/ROKNJybVBZg@public.gmane.org
> > foundation.org/msg06364.html
> 
> Thanks for digging up that thread! Yes, that one skb would get
> sent first ahead of possibly higher priority skbs. However,
> from a performance point, TX_BUSY code skips a lot of checks
> and code for all subsequent packets till the device is
> restarted. I can test performance with both cases and report
> what I find (the requeue code has become very simple and clean
> from "horribly complex", thanks to Herbert and Dave).

Cc Herbert, and try to convince him :)

> > > (qdisc
> > > will eventually start dropping packets when tx_queue_len is
> >
> > tx_queue_len is a pretty large buffer so maybe no.
> 
> I remember seeing tons of drops (pfifo_fast_enqueue) when
> xmit returns TX_BUSY.
> 
> > I think the packet drops from the scheduler queue can also be
> > done intelligently (e.g. with CHOKe) which should
> > work better than dropping a random packet?
> 
> I am not sure of that - choke_enqueue checks against a random
> skb to drop current skb, and also during congestion. But for
> my "sample driver xmit", returning TX_BUSY could still allow
> to be used with CHOKe.
> 
> thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-24 12:50                     ` Krishna Kumar2
  (?)
  (?)
@ 2011-05-24 13:52                     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-24 13:52 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero, lguest, Shirley Ma, kvm, Carsten Otte, linux-s390,
	Heiko Carstens, linux-kernel, virtualization, steved,
	Christian Borntraeger, Tom Lendacky, netdev, Martin Schwidefsky,
	linux390

On Tue, May 24, 2011 at 06:20:35PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 05/24/2011 04:59:39 PM:
> 
> > > > > Maybe Rusty means it is a simpler model to free the amount
> > > > > of space that this xmit needs. We will still fail anyway
> > > > > at some time but it is unlikely, since earlier iteration
> > > > > freed up atleast the space that it was going to use.
> > > >
> > > > Not sure I nderstand.  We can't know space is freed in the previous
> > > > iteration as buffers might not have been used by then.
> > >
> > > Yes, the first few iterations may not have freed up space, but
> > > later ones should. The amount of free space should increase
> > > from then on, especially since we try to free double of what
> > > we consume.
> >
> > Hmm. This is only an upper limit on the # of entries in the queue.
> > Assume that vq size is 4 and we transmit 4 enties without
> > getting anything in the used ring. The next transmit will fail.
> >
> > So I don't really see why it's unlikely that we reach the packet
> > drop code with your patch.
> 
> I was assuming 256 entries :) I will try to get some
> numbers to see how often it is true tomorrow.

That would depend on how fast the hypervisor is.
Try doing something to make hypervisor slower than the guest.  I don't
think we need measurements to realize that with the host being slower
than guest that would happen a lot, though.

> > > I am not sure of why it was changed, since returning TX_BUSY
> > > seems more efficient IMHO.
> > > qdisc_restart() handles requeue'd
> > > packets much better than a stopped queue, as a significant
> > > part of this code is skipped if gso_skb is present
> >
> > I think this is the argument:
> > http://www.mail-archive.com/virtualization@lists.linux-
> > foundation.org/msg06364.html
> 
> Thanks for digging up that thread! Yes, that one skb would get
> sent first ahead of possibly higher priority skbs. However,
> from a performance point, TX_BUSY code skips a lot of checks
> and code for all subsequent packets till the device is
> restarted. I can test performance with both cases and report
> what I find (the requeue code has become very simple and clean
> from "horribly complex", thanks to Herbert and Dave).

Cc Herbert, and try to convince him :)

> > > (qdisc
> > > will eventually start dropping packets when tx_queue_len is
> >
> > tx_queue_len is a pretty large buffer so maybe no.
> 
> I remember seeing tons of drops (pfifo_fast_enqueue) when
> xmit returns TX_BUSY.
> 
> > I think the packet drops from the scheduler queue can also be
> > done intelligently (e.g. with CHOKe) which should
> > work better than dropping a random packet?
> 
> I am not sure of that - choke_enqueue checks against a random
> skb to drop current skb, and also during congestion. But for
> my "sample driver xmit", returning TX_BUSY could still allow
> to be used with CHOKe.
> 
> thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-25  1:28             ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-25  1:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> > Can we hit problems with OOM?  Sure, but no worse than now...
> > The problem is that this "virtqueue_get_capacity()" returns the worst
> > case, not the normal case.  So using it is deceptive.
> > 
> 
> Maybe just document this?

Yes, but also by renaming virtqueue_get_capacity().  Takes it from a 3
to a 6 on the API hard-to-misuse scale.

How about, virtqueue_min_capacity()?  Makes the reader realize something
weird is going on.

> I still believe capacity really needs to be decided
> at the virtqueue level, not in the driver.
> E.g. with indirect each skb uses a single entry: freeing
> 1 small skb is always enough to have space for a large one.
> 
> I do understand how it seems a waste to leave direct space
> in the ring while we might in practice have space
> due to indirect. Didn't come up with a nice way to
> solve this yet - but 'no worse than now :)'

Agreed.

> > > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > > sure we have enough space in the buffer. Another way to do
> > > that is with a define :).
> > 
> > To do this properly, we should really be using the actual number of sg
> > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > know how many.
> > 
> > Cheers,
> > Rusty.
> 
> Maybe I'm confused here.  The problem isn't the failing
> add_buf for the given skb IIUC.  What we are trying to do here is stop
> the queue *before xmit_skb fails*. We can't look at the
> number of fragments in the current skb - the next one can be
> much larger.  That's why we check capacity after xmit_skb,
> not before it, right?

No, I was confused...  More coffee!

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-25  1:28             ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-25  1:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> > Can we hit problems with OOM?  Sure, but no worse than now...
> > The problem is that this "virtqueue_get_capacity()" returns the worst
> > case, not the normal case.  So using it is deceptive.
> > 
> 
> Maybe just document this?

Yes, but also by renaming virtqueue_get_capacity().  Takes it from a 3
to a 6 on the API hard-to-misuse scale.

How about, virtqueue_min_capacity()?  Makes the reader realize something
weird is going on.

> I still believe capacity really needs to be decided
> at the virtqueue level, not in the driver.
> E.g. with indirect each skb uses a single entry: freeing
> 1 small skb is always enough to have space for a large one.
> 
> I do understand how it seems a waste to leave direct space
> in the ring while we might in practice have space
> due to indirect. Didn't come up with a nice way to
> solve this yet - but 'no worse than now :)'

Agreed.

> > > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > > sure we have enough space in the buffer. Another way to do
> > > that is with a define :).
> > 
> > To do this properly, we should really be using the actual number of sg
> > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > know how many.
> > 
> > Cheers,
> > Rusty.
> 
> Maybe I'm confused here.  The problem isn't the failing
> add_buf for the given skb IIUC.  What we are trying to do here is stop
> the queue *before xmit_skb fails*. We can't look at the
> number of fragments in the current skb - the next one can be
> much larger.  That's why we check capacity after xmit_skb,
> not before it, right?

No, I was confused...  More coffee!

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-23 11:19           ` Michael S. Tsirkin
                             ` (3 preceding siblings ...)
  (?)
@ 2011-05-25  1:28           ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-25  1:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> > Can we hit problems with OOM?  Sure, but no worse than now...
> > The problem is that this "virtqueue_get_capacity()" returns the worst
> > case, not the normal case.  So using it is deceptive.
> > 
> 
> Maybe just document this?

Yes, but also by renaming virtqueue_get_capacity().  Takes it from a 3
to a 6 on the API hard-to-misuse scale.

How about, virtqueue_min_capacity()?  Makes the reader realize something
weird is going on.

> I still believe capacity really needs to be decided
> at the virtqueue level, not in the driver.
> E.g. with indirect each skb uses a single entry: freeing
> 1 small skb is always enough to have space for a large one.
> 
> I do understand how it seems a waste to leave direct space
> in the ring while we might in practice have space
> due to indirect. Didn't come up with a nice way to
> solve this yet - but 'no worse than now :)'

Agreed.

> > > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > > sure we have enough space in the buffer. Another way to do
> > > that is with a define :).
> > 
> > To do this properly, we should really be using the actual number of sg
> > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > know how many.
> > 
> > Cheers,
> > Rusty.
> 
> Maybe I'm confused here.  The problem isn't the failing
> add_buf for the given skb IIUC.  What we are trying to do here is stop
> the queue *before xmit_skb fails*. We can't look at the
> number of fragments in the current skb - the next one can be
> much larger.  That's why we check capacity after xmit_skb,
> not before it, right?

No, I was confused...  More coffee!

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-25  1:35             ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-25  1:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> I do understand how it seems a waste to leave direct space
> in the ring while we might in practice have space
> due to indirect. Didn't come up with a nice way to
> solve this yet - but 'no worse than now :)'

Let's just make it "bool free_old_xmit_skbs(unsigned int max)".  max ==
2 for the normal xmit path, so we're low latency but we keep ahead on
average.  max == -1 for the "we're out of capacity, we may have to stop
the queue".

That keeps it simple and probably the right thing...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-25  1:35             ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-25  1:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> I do understand how it seems a waste to leave direct space
> in the ring while we might in practice have space
> due to indirect. Didn't come up with a nice way to
> solve this yet - but 'no worse than now :)'

Let's just make it "bool free_old_xmit_skbs(unsigned int max)".  max ==
2 for the normal xmit path, so we're low latency but we keep ahead on
average.  max == -1 for the "we're out of capacity, we may have to stop
the queue".

That keeps it simple and probably the right thing...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-23 11:19           ` Michael S. Tsirkin
                             ` (4 preceding siblings ...)
  (?)
@ 2011-05-25  1:35           ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-25  1:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> I do understand how it seems a waste to leave direct space
> in the ring while we might in practice have space
> due to indirect. Didn't come up with a nice way to
> solve this yet - but 'no worse than now :)'

Let's just make it "bool free_old_xmit_skbs(unsigned int max)".  max ==
2 for the normal xmit path, so we're low latency but we keep ahead on
average.  max == -1 for the "we're out of capacity, we may have to stop
the queue".

That keeps it simple and probably the right thing...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-25  5:50               ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-25  5:50 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Wed, May 25, 2011 at 10:58:26AM +0930, Rusty Russell wrote:
> On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> > > Can we hit problems with OOM?  Sure, but no worse than now...
> > > The problem is that this "virtqueue_get_capacity()" returns the worst
> > > case, not the normal case.  So using it is deceptive.
> > > 
> > 
> > Maybe just document this?
> 
> Yes, but also by renaming virtqueue_get_capacity().  Takes it from a 3
> to a 6 on the API hard-to-misuse scale.
> 
> How about, virtqueue_min_capacity()?  Makes the reader realize something
> weird is going on.

Absolutely. Great idea.

> > I still believe capacity really needs to be decided
> > at the virtqueue level, not in the driver.
> > E.g. with indirect each skb uses a single entry: freeing
> > 1 small skb is always enough to have space for a large one.
> > 
> > I do understand how it seems a waste to leave direct space
> > in the ring while we might in practice have space
> > due to indirect. Didn't come up with a nice way to
> > solve this yet - but 'no worse than now :)'
> 
> Agreed.
> 
> > > > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > > > sure we have enough space in the buffer. Another way to do
> > > > that is with a define :).
> > > 
> > > To do this properly, we should really be using the actual number of sg
> > > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > > know how many.
> > > 
> > > Cheers,
> > > Rusty.
> > 
> > Maybe I'm confused here.  The problem isn't the failing
> > add_buf for the given skb IIUC.  What we are trying to do here is stop
> > the queue *before xmit_skb fails*. We can't look at the
> > number of fragments in the current skb - the next one can be
> > much larger.  That's why we check capacity after xmit_skb,
> > not before it, right?
> 
> No, I was confused...  More coffee!
> 
> Thanks,
> Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-25  5:50               ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-25  5:50 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Wed, May 25, 2011 at 10:58:26AM +0930, Rusty Russell wrote:
> On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> > > Can we hit problems with OOM?  Sure, but no worse than now...
> > > The problem is that this "virtqueue_get_capacity()" returns the worst
> > > case, not the normal case.  So using it is deceptive.
> > > 
> > 
> > Maybe just document this?
> 
> Yes, but also by renaming virtqueue_get_capacity().  Takes it from a 3
> to a 6 on the API hard-to-misuse scale.
> 
> How about, virtqueue_min_capacity()?  Makes the reader realize something
> weird is going on.

Absolutely. Great idea.

> > I still believe capacity really needs to be decided
> > at the virtqueue level, not in the driver.
> > E.g. with indirect each skb uses a single entry: freeing
> > 1 small skb is always enough to have space for a large one.
> > 
> > I do understand how it seems a waste to leave direct space
> > in the ring while we might in practice have space
> > due to indirect. Didn't come up with a nice way to
> > solve this yet - but 'no worse than now :)'
> 
> Agreed.
> 
> > > > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > > > sure we have enough space in the buffer. Another way to do
> > > > that is with a define :).
> > > 
> > > To do this properly, we should really be using the actual number of sg
> > > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > > know how many.
> > > 
> > > Cheers,
> > > Rusty.
> > 
> > Maybe I'm confused here.  The problem isn't the failing
> > add_buf for the given skb IIUC.  What we are trying to do here is stop
> > the queue *before xmit_skb fails*. We can't look at the
> > number of fragments in the current skb - the next one can be
> > much larger.  That's why we check capacity after xmit_skb,
> > not before it, right?
> 
> No, I was confused...  More coffee!
> 
> Thanks,
> Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-25  1:28             ` Rusty Russell
  (?)
@ 2011-05-25  5:50             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-25  5:50 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Wed, May 25, 2011 at 10:58:26AM +0930, Rusty Russell wrote:
> On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Mon, May 23, 2011 at 11:37:15AM +0930, Rusty Russell wrote:
> > > Can we hit problems with OOM?  Sure, but no worse than now...
> > > The problem is that this "virtqueue_get_capacity()" returns the worst
> > > case, not the normal case.  So using it is deceptive.
> > > 
> > 
> > Maybe just document this?
> 
> Yes, but also by renaming virtqueue_get_capacity().  Takes it from a 3
> to a 6 on the API hard-to-misuse scale.
> 
> How about, virtqueue_min_capacity()?  Makes the reader realize something
> weird is going on.

Absolutely. Great idea.

> > I still believe capacity really needs to be decided
> > at the virtqueue level, not in the driver.
> > E.g. with indirect each skb uses a single entry: freeing
> > 1 small skb is always enough to have space for a large one.
> > 
> > I do understand how it seems a waste to leave direct space
> > in the ring while we might in practice have space
> > due to indirect. Didn't come up with a nice way to
> > solve this yet - but 'no worse than now :)'
> 
> Agreed.
> 
> > > > I just wanted to localize the 2+MAX_SKB_FRAGS logic that tries to make
> > > > sure we have enough space in the buffer. Another way to do
> > > > that is with a define :).
> > > 
> > > To do this properly, we should really be using the actual number of sg
> > > elements needed, but we'd have to do most of xmit_skb beforehand so we
> > > know how many.
> > > 
> > > Cheers,
> > > Rusty.
> > 
> > Maybe I'm confused here.  The problem isn't the failing
> > add_buf for the given skb IIUC.  What we are trying to do here is stop
> > the queue *before xmit_skb fails*. We can't look at the
> > number of fragments in the current skb - the next one can be
> > much larger.  That's why we check capacity after xmit_skb,
> > not before it, right?
> 
> No, I was confused...  More coffee!
> 
> Thanks,
> Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-25  6:07               ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-25  6:07 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > I do understand how it seems a waste to leave direct space
> > in the ring while we might in practice have space
> > due to indirect. Didn't come up with a nice way to
> > solve this yet - but 'no worse than now :)'
> 
> Let's just make it "bool free_old_xmit_skbs(unsigned int max)".  max ==
> 2 for the normal xmit path, so we're low latency but we keep ahead on
> average.  max == -1 for the "we're out of capacity, we may have to stop
> the queue".
> 
> That keeps it simple and probably the right thing...
> 
> Thanks,
> Rusty.

Hmm I'm not sure I got it, need to think about this.
I'd like to go back and document how my design was supposed to work.
This really should have been in commit log or even a comment.
I thought we need a min, not a max.
We start with this:

	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
		(skb = get_buf)))
		kfree_skb(skb);
	return !c;

This is clean and simple, right? And it's exactly asking for what we need.

But this way we always keep a lot of memory in skbs even when rate of
communication is low.

So we add the min parameter:

	int n = 0;

	while ((((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS)) ||
		 n++ < min) && (skb = get_buf)))
		kfree_skb(skb);
	return !c;


on the normal path min == 2 so we're low latency but we keep ahead on
average. min == 0 for the "we're out of capacity, we may have to stop
the queue".

Does the above make sense at all?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-25  6:07               ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-25  6:07 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > I do understand how it seems a waste to leave direct space
> > in the ring while we might in practice have space
> > due to indirect. Didn't come up with a nice way to
> > solve this yet - but 'no worse than now :)'
> 
> Let's just make it "bool free_old_xmit_skbs(unsigned int max)".  max ==
> 2 for the normal xmit path, so we're low latency but we keep ahead on
> average.  max == -1 for the "we're out of capacity, we may have to stop
> the queue".
> 
> That keeps it simple and probably the right thing...
> 
> Thanks,
> Rusty.

Hmm I'm not sure I got it, need to think about this.
I'd like to go back and document how my design was supposed to work.
This really should have been in commit log or even a comment.
I thought we need a min, not a max.
We start with this:

	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
		(skb = get_buf)))
		kfree_skb(skb);
	return !c;

This is clean and simple, right? And it's exactly asking for what we need.

But this way we always keep a lot of memory in skbs even when rate of
communication is low.

So we add the min parameter:

	int n = 0;

	while ((((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS)) ||
		 n++ < min) && (skb = get_buf)))
		kfree_skb(skb);
	return !c;


on the normal path min == 2 so we're low latency but we keep ahead on
average. min == 0 for the "we're out of capacity, we may have to stop
the queue".

Does the above make sense at all?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-25  1:35             ` Rusty Russell
  (?)
  (?)
@ 2011-05-25  6:07             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-25  6:07 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> On Mon, 23 May 2011 14:19:00 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > I do understand how it seems a waste to leave direct space
> > in the ring while we might in practice have space
> > due to indirect. Didn't come up with a nice way to
> > solve this yet - but 'no worse than now :)'
> 
> Let's just make it "bool free_old_xmit_skbs(unsigned int max)".  max ==
> 2 for the normal xmit path, so we're low latency but we keep ahead on
> average.  max == -1 for the "we're out of capacity, we may have to stop
> the queue".
> 
> That keeps it simple and probably the right thing...
> 
> Thanks,
> Rusty.

Hmm I'm not sure I got it, need to think about this.
I'd like to go back and document how my design was supposed to work.
This really should have been in commit log or even a comment.
I thought we need a min, not a max.
We start with this:

	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
		(skb = get_buf)))
		kfree_skb(skb);
	return !c;

This is clean and simple, right? And it's exactly asking for what we need.

But this way we always keep a lot of memory in skbs even when rate of
communication is low.

So we add the min parameter:

	int n = 0;

	while ((((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS)) ||
		 n++ < min) && (skb = get_buf)))
		kfree_skb(skb);
	return !c;


on the normal path min == 2 so we're low latency but we keep ahead on
average. min == 0 for the "we're out of capacity, we may have to stop
the queue".

Does the above make sense at all?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-26  3:28                 ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-26  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Wed, 25 May 2011 09:07:59 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> Hmm I'm not sure I got it, need to think about this.
> I'd like to go back and document how my design was supposed to work.
> This really should have been in commit log or even a comment.
> I thought we need a min, not a max.
> We start with this:
> 
> 	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
> 		(skb = get_buf)))
> 		kfree_skb(skb);
> 	return !c;
> 
> This is clean and simple, right? And it's exactly asking for what we need.

No, I started from the other direction:

        for (i = 0; i < 2; i++) {
                skb = get_buf();
                if (!skb)
                        break;
                kfree_skb(skb);
        }

ie. free two packets for every one we're about to add.  For steady state
that would work really well.  Then we hit the case where the ring seems
full after we do the add: at that point, screw latency, and just try to
free all the buffers we can.

> on the normal path min == 2 so we're low latency but we keep ahead on
> average. min == 0 for the "we're out of capacity, we may have to stop
> the queue".
> 
> Does the above make sense at all?

It makes sense, but I think it's a classic case where incremental
improvements aren't as good as starting from scratch.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-26  3:28                 ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-26  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Wed, 25 May 2011 09:07:59 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> Hmm I'm not sure I got it, need to think about this.
> I'd like to go back and document how my design was supposed to work.
> This really should have been in commit log or even a comment.
> I thought we need a min, not a max.
> We start with this:
> 
> 	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
> 		(skb = get_buf)))
> 		kfree_skb(skb);
> 	return !c;
> 
> This is clean and simple, right? And it's exactly asking for what we need.

No, I started from the other direction:

        for (i = 0; i < 2; i++) {
                skb = get_buf();
                if (!skb)
                        break;
                kfree_skb(skb);
        }

ie. free two packets for every one we're about to add.  For steady state
that would work really well.  Then we hit the case where the ring seems
full after we do the add: at that point, screw latency, and just try to
free all the buffers we can.

> on the normal path min == 2 so we're low latency but we keep ahead on
> average. min == 0 for the "we're out of capacity, we may have to stop
> the queue".
> 
> Does the above make sense at all?

It makes sense, but I think it's a classic case where incremental
improvements aren't as good as starting from scratch.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-25  6:07               ` Michael S. Tsirkin
  (?)
@ 2011-05-26  3:28               ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-26  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Wed, 25 May 2011 09:07:59 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> Hmm I'm not sure I got it, need to think about this.
> I'd like to go back and document how my design was supposed to work.
> This really should have been in commit log or even a comment.
> I thought we need a min, not a max.
> We start with this:
> 
> 	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
> 		(skb = get_buf)))
> 		kfree_skb(skb);
> 	return !c;
> 
> This is clean and simple, right? And it's exactly asking for what we need.

No, I started from the other direction:

        for (i = 0; i < 2; i++) {
                skb = get_buf();
                if (!skb)
                        break;
                kfree_skb(skb);
        }

ie. free two packets for every one we're about to add.  For steady state
that would work really well.  Then we hit the case where the ring seems
full after we do the add: at that point, screw latency, and just try to
free all the buffers we can.

> on the normal path min == 2 so we're low latency but we keep ahead on
> average. min == 0 for the "we're out of capacity, we may have to stop
> the queue".
> 
> Does the above make sense at all?

It makes sense, but I think it's a classic case where incremental
improvements aren't as good as starting from scratch.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PERF RESULTS] virtio and vhost-net performance enhancements
@ 2011-05-26 15:32   ` Krishna Kumar2
  0 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-26 15:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390, netdev,
	Rusty Russell, Martin Schwidefsky, steved, Tom Lendacky,
	virtualization, Shirley Ma

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/20/2011 04:40:07 AM:

> OK, here is the large patchset that implements the virtio spec update
> that I sent earlier (the spec itself needs a minor update, will send
> that out too next week, but I think we are on the same page here
> already). It supercedes the PUBLISH_USED_IDX patches I sent
> out earlier.

I was able to get this tested by applying the v2 patches
to git-next tree (somehow MST's git tree hung on my guest
which never got resolved). Testing was from Guest -> Remote
node, using an ixgbe 10g card. The test results are
*excellent* (table: #netperf sesssions, BW% improvement,
SD% improvement, CPU% improvement):

___________________________________
           512 byte I/O
#     BW%     SD%      CPU%
____________________________________
1     151.6   -65.1    -10.7
2     180.6   -66.6    -6.4
4     15.5    -35.8    -26.1
8     1.8     -28.4    -26.7
16    3.1     -29.0    -26.5
32    1.1     -27.4    -27.5
64    3.8     -30.9    -26.7
96    5.4     -21.7    -24.2
128   5.7     -24.4    -25.5
____________________________________
BW: 16.6%   SD: -24.6%    CPU: -25.5%


____________________________________
            1K I/O
#     BW%     SD%      CPU%
____________________________________
1     233.9   -76.5    -18.0
2     112.2   -64.0    -23.2
4     9.2     -31.6    -26.1
8    -1.7     -26.8    -30.3
16    3.5     -31.5    -30.6
32    4.8     -25.2    -30.5
64    5.7     -31.0    -28.9
96    5.3     -32.2    -31.7
128   4.6     -38.2    -33.6
____________________________________
BW: 16.4%   SD: -35.%    CPU: -31.5%


____________________________________
             16K I/O
#     BW%     SD%      CPU%
____________________________________
1     18.8    -27.2    -18.3
2     14.8    -36.7    -27.7
4     12.7    -45.2    -38.1
8     4.4     -56.4    -54.4
16    4.8     -38.3    -36.1
32    0        78.0     79.2
64    3.8     -38.1    -37.5
96    7.3     -35.2    -31.1
128   3.4     -31.1    -32.1
____________________________________
BW: 7.6%   SD: -30.1%   CPU: -23.7%


I plan to run some more tests tomorrow. Please let
me know if any other scenario will help.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PERF RESULTS] virtio and vhost-net performance enhancements
@ 2011-05-26 15:32   ` Krishna Kumar2
  0 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-26 15:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	lguest-uLR06cmDAlY/bJ5BZ2RsiQ, Shirley Ma,
	kvm-u79uwXL29TY76Z2rM5mHXA, Carsten Otte,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, netdev-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

"Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote on 05/20/2011 04:40:07 AM:

> OK, here is the large patchset that implements the virtio spec update
> that I sent earlier (the spec itself needs a minor update, will send
> that out too next week, but I think we are on the same page here
> already). It supercedes the PUBLISH_USED_IDX patches I sent
> out earlier.

I was able to get this tested by applying the v2 patches
to git-next tree (somehow MST's git tree hung on my guest
which never got resolved). Testing was from Guest -> Remote
node, using an ixgbe 10g card. The test results are
*excellent* (table: #netperf sesssions, BW% improvement,
SD% improvement, CPU% improvement):

___________________________________
           512 byte I/O
#     BW%     SD%      CPU%
____________________________________
1     151.6   -65.1    -10.7
2     180.6   -66.6    -6.4
4     15.5    -35.8    -26.1
8     1.8     -28.4    -26.7
16    3.1     -29.0    -26.5
32    1.1     -27.4    -27.5
64    3.8     -30.9    -26.7
96    5.4     -21.7    -24.2
128   5.7     -24.4    -25.5
____________________________________
BW: 16.6%   SD: -24.6%    CPU: -25.5%


____________________________________
            1K I/O
#     BW%     SD%      CPU%
____________________________________
1     233.9   -76.5    -18.0
2     112.2   -64.0    -23.2
4     9.2     -31.6    -26.1
8    -1.7     -26.8    -30.3
16    3.5     -31.5    -30.6
32    4.8     -25.2    -30.5
64    5.7     -31.0    -28.9
96    5.3     -32.2    -31.7
128   4.6     -38.2    -33.6
____________________________________
BW: 16.4%   SD: -35.%    CPU: -31.5%


____________________________________
             16K I/O
#     BW%     SD%      CPU%
____________________________________
1     18.8    -27.2    -18.3
2     14.8    -36.7    -27.7
4     12.7    -45.2    -38.1
8     4.4     -56.4    -54.4
16    4.8     -38.3    -36.1
32    0        78.0     79.2
64    3.8     -38.1    -37.5
96    7.3     -35.2    -31.1
128   3.4     -31.1    -32.1
____________________________________
BW: 7.6%   SD: -30.1%   CPU: -23.7%


I plan to run some more tests tomorrow. Please let
me know if any other scenario will help.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PERF RESULTS] virtio and vhost-net performance enhancements
  2011-05-19 23:10 ` Michael S. Tsirkin
                   ` (33 preceding siblings ...)
  (?)
@ 2011-05-26 15:32 ` Krishna Kumar2
  -1 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-26 15:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, lguest, Shirley Ma, kvm, Carsten Otte, linux-s390,
	Heiko Carstens, linux-kernel, virtualization, steved,
	Christian Borntraeger, Tom Lendacky, netdev, Martin Schwidefsky,
	linux390

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/20/2011 04:40:07 AM:

> OK, here is the large patchset that implements the virtio spec update
> that I sent earlier (the spec itself needs a minor update, will send
> that out too next week, but I think we are on the same page here
> already). It supercedes the PUBLISH_USED_IDX patches I sent
> out earlier.

I was able to get this tested by applying the v2 patches
to git-next tree (somehow MST's git tree hung on my guest
which never got resolved). Testing was from Guest -> Remote
node, using an ixgbe 10g card. The test results are
*excellent* (table: #netperf sesssions, BW% improvement,
SD% improvement, CPU% improvement):

___________________________________
           512 byte I/O
#     BW%     SD%      CPU%
____________________________________
1     151.6   -65.1    -10.7
2     180.6   -66.6    -6.4
4     15.5    -35.8    -26.1
8     1.8     -28.4    -26.7
16    3.1     -29.0    -26.5
32    1.1     -27.4    -27.5
64    3.8     -30.9    -26.7
96    5.4     -21.7    -24.2
128   5.7     -24.4    -25.5
____________________________________
BW: 16.6%   SD: -24.6%    CPU: -25.5%


____________________________________
            1K I/O
#     BW%     SD%      CPU%
____________________________________
1     233.9   -76.5    -18.0
2     112.2   -64.0    -23.2
4     9.2     -31.6    -26.1
8    -1.7     -26.8    -30.3
16    3.5     -31.5    -30.6
32    4.8     -25.2    -30.5
64    5.7     -31.0    -28.9
96    5.3     -32.2    -31.7
128   4.6     -38.2    -33.6
____________________________________
BW: 16.4%   SD: -35.%    CPU: -31.5%


____________________________________
             16K I/O
#     BW%     SD%      CPU%
____________________________________
1     18.8    -27.2    -18.3
2     14.8    -36.7    -27.7
4     12.7    -45.2    -38.1
8     4.4     -56.4    -54.4
16    4.8     -38.3    -36.1
32    0        78.0     79.2
64    3.8     -38.1    -37.5
96    7.3     -35.2    -31.1
128   3.4     -31.1    -32.1
____________________________________
BW: 7.6%   SD: -30.1%   CPU: -23.7%


I plan to run some more tests tomorrow. Please let
me know if any other scenario will help.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PERF RESULTS] virtio and vhost-net performance enhancements
  2011-05-26 15:32   ` Krishna Kumar2
  (?)
@ 2011-05-26 15:42   ` Shirley Ma
  2011-05-26 16:21     ` Krishna Kumar2
                       ` (2 more replies)
  -1 siblings, 3 replies; 133+ messages in thread
From: Shirley Ma @ 2011-05-26 15:42 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero, lguest, kvm, Carsten Otte, linux-s390,
	Michael S. Tsirkin, Heiko Carstens, linux-kernel, virtualization,
	Steve Dobbelstein, Christian Borntraeger, Tom Lendacky, netdev,
	Martin Schwidefsky, linux390


[-- Attachment #1.1.1: Type: text/plain, Size: 4977 bytes --]


Hello KK,

	Could you please try TCP_RRs as well?

Thanks
Shirley


                                                                       
             Krishna Kumar2                                            
             <krkumar2@in.ibm.                                         
             com>                                                       To
                                       "Michael S. Tsirkin"            
             05/26/2011 08:32          <mst@redhat.com>                
             AM                                                         cc
                                       Christian Borntraeger           
                                       <borntraeger@de.ibm.com>, Carsten
                                       Otte <cotte@de.ibm.com>,        
                                       habanero@linux.vnet.ibm.com, Heiko
                                       Carstens                        
                                       <heiko.carstens@de.ibm.com>,    
                                       kvm@vger.kernel.org,            
                                       lguest@lists.ozlabs.org,        
                                       linux-kernel@vger.kernel.org,   
                                       linux-s390@vger.kernel.org,     
                                       linux390@de.ibm.com,            
                                       netdev@vger.kernel.org, Rusty   
                                       Russell <rusty@rustcorp.com.au>,
                                       Martin Schwidefsky              
                                       <schwidefsky@de.ibm.com>, Steve 
                                       Dobbelstein/Austin/IBM@IBMUS, Tom
                                       Lendacky <tahm@linux.vnet.ibm.com>,
                                       virtualization@lists.linux-foundati
                                       on.org, Shirley                 
                                       Ma/Beaverton/IBM@IBMUS          
                                                                   Subject
                                       [PERF RESULTS] virtio and vhost-net
                                       performance enhancements        
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       




"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/20/2011 04:40:07 AM:

> OK, here is the large patchset that implements the virtio spec update
> that I sent earlier (the spec itself needs a minor update, will send
> that out too next week, but I think we are on the same page here
> already). It supercedes the PUBLISH_USED_IDX patches I sent
> out earlier.

I was able to get this tested by applying the v2 patches
to git-next tree (somehow MST's git tree hung on my guest
which never got resolved). Testing was from Guest -> Remote
node, using an ixgbe 10g card. The test results are
*excellent* (table: #netperf sesssions, BW% improvement,
SD% improvement, CPU% improvement):

___________________________________
           512 byte I/O
#     BW%     SD%      CPU%
____________________________________
1     151.6   -65.1    -10.7
2     180.6   -66.6    -6.4
4     15.5    -35.8    -26.1
8     1.8     -28.4    -26.7
16    3.1     -29.0    -26.5
32    1.1     -27.4    -27.5
64    3.8     -30.9    -26.7
96    5.4     -21.7    -24.2
128   5.7     -24.4    -25.5
____________________________________
BW: 16.6%   SD: -24.6%    CPU: -25.5%


____________________________________
            1K I/O
#     BW%     SD%      CPU%
____________________________________
1     233.9   -76.5    -18.0
2     112.2   -64.0    -23.2
4     9.2     -31.6    -26.1
8    -1.7     -26.8    -30.3
16    3.5     -31.5    -30.6
32    4.8     -25.2    -30.5
64    5.7     -31.0    -28.9
96    5.3     -32.2    -31.7
128   4.6     -38.2    -33.6
____________________________________
BW: 16.4%   SD: -35.%    CPU: -31.5%


____________________________________
             16K I/O
#     BW%     SD%      CPU%
____________________________________
1     18.8    -27.2    -18.3
2     14.8    -36.7    -27.7
4     12.7    -45.2    -38.1
8     4.4     -56.4    -54.4
16    4.8     -38.3    -36.1
32    0        78.0     79.2
64    3.8     -38.1    -37.5
96    7.3     -35.2    -31.1
128   3.4     -31.1    -32.1
____________________________________
BW: 7.6%   SD: -30.1%   CPU: -23.7%


I plan to run some more tests tomorrow. Please let
me know if any other scenario will help.

Thanks,

- KK


[-- Attachment #1.1.2: Type: text/html, Size: 6741 bytes --]

[-- Attachment #1.2: graycol.gif --]
[-- Type: image/gif, Size: 105 bytes --]

[-- Attachment #1.3: pic17152.gif --]
[-- Type: image/gif, Size: 1255 bytes --]

[-- Attachment #1.4: ecblank.gif --]
[-- Type: image/gif, Size: 45 bytes --]

[-- Attachment #2: Type: text/plain, Size: 184 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PERF RESULTS] virtio and vhost-net performance enhancements
  2011-05-26 15:42   ` Shirley Ma
@ 2011-05-26 16:21     ` Krishna Kumar2
  2011-05-26 16:21     ` Krishna Kumar2
       [not found]     ` <OFF9D0E604.B865A006-ON6525789C.00597010-6525789C.0059987A@LocalDomain>
  2 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-26 16:21 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390,
	Michael S. Tsirkin, netdev, Rusty Russell, Martin Schwidefsky,
	Steve Dobbelstein, Tom Lendacky, virtualization

Shirley Ma <xma@us.ibm.com> wrote on 05/26/2011 09:12:22 PM:

> Could you please try TCP_RRs as well?

Right. Here's the result for TCP_RR:

__________________________________
#       RR%     SD%     CPU%
__________________________________
1       4.5       -31.4    -27.9
2       5.1       -9.7      -5.4
4       60.4     -13.4     38.8
8       67.8     -13.5     45.0
16     55.8     -8.0       43.2
32     66.9     -14.1     43.3
64     47.2     -23.7     12.2
96     29.7     -11.8     14.3
128    8.0       2.2       10.7
___________________________________
BW: 37.3%   SD: -6.7%   CPU: 15.7%
___________________________________

Thanks,

- KK


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PERF RESULTS] virtio and vhost-net performance enhancements
  2011-05-26 15:42   ` Shirley Ma
  2011-05-26 16:21     ` Krishna Kumar2
@ 2011-05-26 16:21     ` Krishna Kumar2
       [not found]     ` <OFF9D0E604.B865A006-ON6525789C.00597010-6525789C.0059987A@LocalDomain>
  2 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-26 16:21 UTC (permalink / raw)
  To: Shirley Ma
  Cc: habanero, lguest, kvm, Carsten Otte, linux-s390,
	Michael S. Tsirkin, Heiko Carstens, linux-kernel, virtualization,
	Steve Dobbelstein, Christian Borntraeger, Tom Lendacky, netdev,
	Martin Schwidefsky, linux390

Shirley Ma <xma@us.ibm.com> wrote on 05/26/2011 09:12:22 PM:

> Could you please try TCP_RRs as well?

Right. Here's the result for TCP_RR:

__________________________________
#       RR%     SD%     CPU%
__________________________________
1       4.5       -31.4    -27.9
2       5.1       -9.7      -5.4
4       60.4     -13.4     38.8
8       67.8     -13.5     45.0
16     55.8     -8.0       43.2
32     66.9     -14.1     43.3
64     47.2     -23.7     12.2
96     29.7     -11.8     14.3
128    8.0       2.2       10.7
___________________________________
BW: 37.3%   SD: -6.7%   CPU: 15.7%
___________________________________

Thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PERF RESULTS] virtio and vhost-net performance enhancements
       [not found]     ` <OFF9D0E604.B865A006-ON6525789C.00597010-6525789C.0059987A@LocalDomain>
@ 2011-05-26 16:29         ` Krishna Kumar2
  2011-05-26 16:29         ` Krishna Kumar2
  1 sibling, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-26 16:29 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Christian Borntraeger, Carsten Otte, habanero, Heiko Carstens,
	kvm, lguest, linux-kernel, linux-s390, linux390,
	Michael S. Tsirkin, netdev, Rusty Russell, Martin Schwidefsky,
	Steve Dobbelstein, Tom Lendacky, virtualization, Shirley Ma

Krishna Kumar2/India/IBM wrote on 05/26/2011 09:51:32 PM:

> > Could you please try TCP_RRs as well?
>
> Right. Here's the result for TCP_RR:

The actual transaction rate/second numbers are:

_____________________________________________________________
#     RR1      RR2 (%)          SD1        SD2 (%)
_____________________________________________________________
1     9476     9903 (4.5)       28.9       19.8 (-31.4)
2     17337    18225 (5.1)      92.7       83.7 (-9.7)
4     17385    27902 (60.4)     364.8      315.8 (-13.4)
8     25560    42912 (67.8)     1428.1     1234.0 (-13.5)
16    35898    55934 (55.8)     4391.6     4038.1 (-8.0)
32    48048    80228 (66.9)     17391.4    14932.0 (-14.1)
64    60412    88929 (47.2)     71087.7    54230.1 (-23.7)
96    71263    92439 (29.7)     145434.1   128214.0 (-11.8)
128   84208    91014 (8.0)      233668.2   238888.6 (2.2)
_____________________________________________________________
RR: 37.3%     SD: -6.7%
_____________________________________________________________

Thanks,

- KK


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PERF RESULTS] virtio and vhost-net performance enhancements
@ 2011-05-26 16:29         ` Krishna Kumar2
  0 siblings, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-26 16:29 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	lguest-uLR06cmDAlY/bJ5BZ2RsiQ, Shirley Ma,
	kvm-u79uwXL29TY76Z2rM5mHXA, Carsten Otte,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Michael S. Tsirkin,
	Heiko Carstens, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Steve Dobbelstein, Christian Borntraeger, Tom Lendacky,
	netdev-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

Krishna Kumar2/India/IBM wrote on 05/26/2011 09:51:32 PM:

> > Could you please try TCP_RRs as well?
>
> Right. Here's the result for TCP_RR:

The actual transaction rate/second numbers are:

_____________________________________________________________
#     RR1      RR2 (%)          SD1        SD2 (%)
_____________________________________________________________
1     9476     9903 (4.5)       28.9       19.8 (-31.4)
2     17337    18225 (5.1)      92.7       83.7 (-9.7)
4     17385    27902 (60.4)     364.8      315.8 (-13.4)
8     25560    42912 (67.8)     1428.1     1234.0 (-13.5)
16    35898    55934 (55.8)     4391.6     4038.1 (-8.0)
32    48048    80228 (66.9)     17391.4    14932.0 (-14.1)
64    60412    88929 (47.2)     71087.7    54230.1 (-23.7)
96    71263    92439 (29.7)     145434.1   128214.0 (-11.8)
128   84208    91014 (8.0)      233668.2   238888.6 (2.2)
_____________________________________________________________
RR: 37.3%     SD: -6.7%
_____________________________________________________________

Thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PERF RESULTS] virtio and vhost-net performance enhancements
       [not found]     ` <OFF9D0E604.B865A006-ON6525789C.00597010-6525789C.0059987A@LocalDomain>
@ 2011-05-26 16:29       ` Krishna Kumar2
  2011-05-26 16:29         ` Krishna Kumar2
  1 sibling, 0 replies; 133+ messages in thread
From: Krishna Kumar2 @ 2011-05-26 16:29 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: habanero, lguest, Shirley Ma, kvm, Carsten Otte, linux-s390,
	Michael S. Tsirkin, Heiko Carstens, linux-kernel, virtualization,
	Steve Dobbelstein, Christian Borntraeger, Tom Lendacky, netdev,
	Martin Schwidefsky, linux390

Krishna Kumar2/India/IBM wrote on 05/26/2011 09:51:32 PM:

> > Could you please try TCP_RRs as well?
>
> Right. Here's the result for TCP_RR:

The actual transaction rate/second numbers are:

_____________________________________________________________
#     RR1      RR2 (%)          SD1        SD2 (%)
_____________________________________________________________
1     9476     9903 (4.5)       28.9       19.8 (-31.4)
2     17337    18225 (5.1)      92.7       83.7 (-9.7)
4     17385    27902 (60.4)     364.8      315.8 (-13.4)
8     25560    42912 (67.8)     1428.1     1234.0 (-13.5)
16    35898    55934 (55.8)     4391.6     4038.1 (-8.0)
32    48048    80228 (66.9)     17391.4    14932.0 (-14.1)
64    60412    88929 (47.2)     71087.7    54230.1 (-23.7)
96    71263    92439 (29.7)     145434.1   128214.0 (-11.8)
128   84208    91014 (8.0)      233668.2   238888.6 (2.2)
_____________________________________________________________
RR: 37.3%     SD: -6.7%
_____________________________________________________________

Thanks,

- KK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-28 20:02                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-28 20:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Thu, May 26, 2011 at 12:58:23PM +0930, Rusty Russell wrote:
> On Wed, 25 May 2011 09:07:59 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> > Hmm I'm not sure I got it, need to think about this.
> > I'd like to go back and document how my design was supposed to work.
> > This really should have been in commit log or even a comment.
> > I thought we need a min, not a max.
> > We start with this:
> > 
> > 	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
> > 		(skb = get_buf)))
> > 		kfree_skb(skb);
> > 	return !c;
> > 
> > This is clean and simple, right? And it's exactly asking for what we need.
> 
> No, I started from the other direction:
> 
>         for (i = 0; i < 2; i++) {
>                 skb = get_buf();
>                 if (!skb)
>                         break;
>                 kfree_skb(skb);
>         }
> 
> ie. free two packets for every one we're about to add.  For steady state
> that would work really well.

Sure, with indirect buffers, but if we
don't use indirect (and we discussed switching indirect off
dynamically in the past) this becomes harder to
be sure about. I think I understand why but
does not a simple capacity check make it more obvious?

>  Then we hit the case where the ring
> seems full after we do the add: at that point, screw latency, and just
> try to free all the buffers we can.

I see. But the code currently does this:

	for(..)
		get_buf
	add_buf
	if (capacity < max_sk_frags+2) {
		if (!enable_cb)
			for(..)
				get_buf
	}


In other words the second get_buf is only called
in the unlikely case of race condition.

So we'll need to add *another* call to get_buf.
Is it just me or is this becoming messy?

I was also be worried that we are adding more
"modes" to the code: high and low latency
depending on different speeds between host and guest,
which would be hard to trigger and test.
That's why I tried hard to make the code behave the
same all the time and free up just a bit more than
the minimum necessary.

> > on the normal path min == 2 so we're low latency but we keep ahead on
> > average. min == 0 for the "we're out of capacity, we may have to stop
> > the queue".
> > 
> > Does the above make sense at all?
> 
> It makes sense, but I think it's a classic case where incremental
> improvements aren't as good as starting from scratch.
> 
> Cheers,
> Rusty.

The only difference on good path seems an extra capacity check,
so I don't expect the difference will be testable, do you?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-28 20:02                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-28 20:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Thu, May 26, 2011 at 12:58:23PM +0930, Rusty Russell wrote:
> On Wed, 25 May 2011 09:07:59 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> > Hmm I'm not sure I got it, need to think about this.
> > I'd like to go back and document how my design was supposed to work.
> > This really should have been in commit log or even a comment.
> > I thought we need a min, not a max.
> > We start with this:
> > 
> > 	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
> > 		(skb = get_buf)))
> > 		kfree_skb(skb);
> > 	return !c;
> > 
> > This is clean and simple, right? And it's exactly asking for what we need.
> 
> No, I started from the other direction:
> 
>         for (i = 0; i < 2; i++) {
>                 skb = get_buf();
>                 if (!skb)
>                         break;
>                 kfree_skb(skb);
>         }
> 
> ie. free two packets for every one we're about to add.  For steady state
> that would work really well.

Sure, with indirect buffers, but if we
don't use indirect (and we discussed switching indirect off
dynamically in the past) this becomes harder to
be sure about. I think I understand why but
does not a simple capacity check make it more obvious?

>  Then we hit the case where the ring
> seems full after we do the add: at that point, screw latency, and just
> try to free all the buffers we can.

I see. But the code currently does this:

	for(..)
		get_buf
	add_buf
	if (capacity < max_sk_frags+2) {
		if (!enable_cb)
			for(..)
				get_buf
	}


In other words the second get_buf is only called
in the unlikely case of race condition.

So we'll need to add *another* call to get_buf.
Is it just me or is this becoming messy?

I was also be worried that we are adding more
"modes" to the code: high and low latency
depending on different speeds between host and guest,
which would be hard to trigger and test.
That's why I tried hard to make the code behave the
same all the time and free up just a bit more than
the minimum necessary.

> > on the normal path min == 2 so we're low latency but we keep ahead on
> > average. min == 0 for the "we're out of capacity, we may have to stop
> > the queue".
> > 
> > Does the above make sense at all?
> 
> It makes sense, but I think it's a classic case where incremental
> improvements aren't as good as starting from scratch.
> 
> Cheers,
> Rusty.

The only difference on good path seems an extra capacity check,
so I don't expect the difference will be testable, do you?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-26  3:28                 ` Rusty Russell
  (?)
  (?)
@ 2011-05-28 20:02                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-28 20:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Thu, May 26, 2011 at 12:58:23PM +0930, Rusty Russell wrote:
> On Wed, 25 May 2011 09:07:59 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> > Hmm I'm not sure I got it, need to think about this.
> > I'd like to go back and document how my design was supposed to work.
> > This really should have been in commit log or even a comment.
> > I thought we need a min, not a max.
> > We start with this:
> > 
> > 	while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
> > 		(skb = get_buf)))
> > 		kfree_skb(skb);
> > 	return !c;
> > 
> > This is clean and simple, right? And it's exactly asking for what we need.
> 
> No, I started from the other direction:
> 
>         for (i = 0; i < 2; i++) {
>                 skb = get_buf();
>                 if (!skb)
>                         break;
>                 kfree_skb(skb);
>         }
> 
> ie. free two packets for every one we're about to add.  For steady state
> that would work really well.

Sure, with indirect buffers, but if we
don't use indirect (and we discussed switching indirect off
dynamically in the past) this becomes harder to
be sure about. I think I understand why but
does not a simple capacity check make it more obvious?

>  Then we hit the case where the ring
> seems full after we do the add: at that point, screw latency, and just
> try to free all the buffers we can.

I see. But the code currently does this:

	for(..)
		get_buf
	add_buf
	if (capacity < max_sk_frags+2) {
		if (!enable_cb)
			for(..)
				get_buf
	}


In other words the second get_buf is only called
in the unlikely case of race condition.

So we'll need to add *another* call to get_buf.
Is it just me or is this becoming messy?

I was also be worried that we are adding more
"modes" to the code: high and low latency
depending on different speeds between host and guest,
which would be hard to trigger and test.
That's why I tried hard to make the code behave the
same all the time and free up just a bit more than
the minimum necessary.

> > on the normal path min == 2 so we're low latency but we keep ahead on
> > average. min == 0 for the "we're out of capacity, we may have to stop
> > the queue".
> > 
> > Does the above make sense at all?
> 
> It makes sense, but I think it's a classic case where incremental
> improvements aren't as good as starting from scratch.
> 
> Cheers,
> Rusty.

The only difference on good path seems an extra capacity check,
so I don't expect the difference will be testable, do you?

-- 
MST

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-30  6:27                     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-30  6:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

On Sat, 28 May 2011 23:02:04 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, May 26, 2011 at 12:58:23PM +0930, Rusty Russell wrote:
> > ie. free two packets for every one we're about to add.  For steady state
> > that would work really well.
> 
> Sure, with indirect buffers, but if we
> don't use indirect (and we discussed switching indirect off
> dynamically in the past) this becomes harder to
> be sure about. I think I understand why but
> does not a simple capacity check make it more obvious?

...

> >  Then we hit the case where the ring
> > seems full after we do the add: at that point, screw latency, and just
> > try to free all the buffers we can.
> 
> I see. But the code currently does this:
> 
> 	for(..)
> 		get_buf
> 	add_buf
> 	if (capacity < max_sk_frags+2) {
> 		if (!enable_cb)
> 			for(..)
> 				get_buf
> 	}
> 
> 
> In other words the second get_buf is only called
> in the unlikely case of race condition.
> 
> So we'll need to add *another* call to get_buf.
> Is it just me or is this becoming messy?

Yes, good point.  I really wonder if anyone would be able to measure the
difference between simply freeing 2 every time (with possible extra
stalls for strange cases) and the more complete version.

But it runs against my grain to implement heuristics when one more call
would make it provably reliable.

Please find a way to make that for loop less ugly though!

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
@ 2011-05-30  6:27                     ` Rusty Russell
  0 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-30  6:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA

On Sat, 28 May 2011 23:02:04 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, May 26, 2011 at 12:58:23PM +0930, Rusty Russell wrote:
> > ie. free two packets for every one we're about to add.  For steady state
> > that would work really well.
> 
> Sure, with indirect buffers, but if we
> don't use indirect (and we discussed switching indirect off
> dynamically in the past) this becomes harder to
> be sure about. I think I understand why but
> does not a simple capacity check make it more obvious?

...

> >  Then we hit the case where the ring
> > seems full after we do the add: at that point, screw latency, and just
> > try to free all the buffers we can.
> 
> I see. But the code currently does this:
> 
> 	for(..)
> 		get_buf
> 	add_buf
> 	if (capacity < max_sk_frags+2) {
> 		if (!enable_cb)
> 			for(..)
> 				get_buf
> 	}
> 
> 
> In other words the second get_buf is only called
> in the unlikely case of race condition.
> 
> So we'll need to add *another* call to get_buf.
> Is it just me or is this becoming messy?

Yes, good point.  I really wonder if anyone would be able to measure the
difference between simply freeing 2 every time (with possible extra
stalls for strange cases) and the more complete version.

But it runs against my grain to implement heuristics when one more call
would make it provably reliable.

Please find a way to make that for loop less ugly though!

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
  2011-05-28 20:02                   ` Michael S. Tsirkin
  (?)
  (?)
@ 2011-05-30  6:27                   ` Rusty Russell
  -1 siblings, 0 replies; 133+ messages in thread
From: Rusty Russell @ 2011-05-30  6:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

On Sat, 28 May 2011 23:02:04 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, May 26, 2011 at 12:58:23PM +0930, Rusty Russell wrote:
> > ie. free two packets for every one we're about to add.  For steady state
> > that would work really well.
> 
> Sure, with indirect buffers, but if we
> don't use indirect (and we discussed switching indirect off
> dynamically in the past) this becomes harder to
> be sure about. I think I understand why but
> does not a simple capacity check make it more obvious?

...

> >  Then we hit the case where the ring
> > seems full after we do the add: at that point, screw latency, and just
> > try to free all the buffers we can.
> 
> I see. But the code currently does this:
> 
> 	for(..)
> 		get_buf
> 	add_buf
> 	if (capacity < max_sk_frags+2) {
> 		if (!enable_cb)
> 			for(..)
> 				get_buf
> 	}
> 
> 
> In other words the second get_buf is only called
> in the unlikely case of race condition.
> 
> So we'll need to add *another* call to get_buf.
> Is it just me or is this becoming messy?

Yes, good point.  I really wonder if anyone would be able to measure the
difference between simply freeing 2 every time (with possible extra
stalls for strange cases) and the more complete version.

But it runs against my grain to implement heuristics when one more call
would make it provably reliable.

Please find a way to make that for loop less ugly though!

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCHv2 00/14] virtio and vhost-net performance enhancements
@ 2011-05-19 23:10 Michael S. Tsirkin
  0 siblings, 0 replies; 133+ messages in thread
From: Michael S. Tsirkin @ 2011-05-19 23:10 UTC (permalink / raw)
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390

OK, here is the large patchset that implements the virtio spec update
that I sent earlier (the spec itself needs a minor update, will send
that out too next week, but I think we are on the same page here
already). It supercedes the PUBLISH_USED_IDX patches I sent
out earlier.

What will follow will be a patchset that actually includes 4 sets of
patches.  I note below their status.  Please consider for 2.6.40, at
least partially. Rusty, do you think it's feasible?

List of patches and what they do:

I) With the first patchset, we change virtio ring notification
hand-off to work like the one in Xen -
each side publishes an event index, the other one
notifies when it reaches that value -
With the one difference that event index starts at 0,
same as request index (in xen event index starts at 1).

These are the patches in this set:
virtio: event index interface
virtio ring: inline function to check for events
virtio_ring: support event idx feature
vhost: support event index
virtio_test: support event index

Changes in this part of the patchset from v1 - address comments by Rusty et al.

I tested this a lot with virtio net block and with the simulator and esp
with the simulator it's easy to see drastic performance improvement
here:

[virtio]# time ./virtio_test 
spurious wakeus: 0x7

real    0m0.169s
user    0m0.140s
sys     0m0.019s
[virtio]# time ./virtio_test --no-event-idx
spurious wakeus: 0x11

real    0m0.649s
user    0m0.295s
sys     0m0.335s

And these patches are mostly unchanged from the very first version,
changes being almost exclusively code cleanups.  So I consider this part
the most stable, I strongly think these patches should go into 2.6.40.
One extra reason besides performance is that maintaining
them out of tree is very painful as guest/host ABI is affected.

II) Second set of patches: new apis and use in virtio_net
With the indexes in place it becomes possibile to request an event after
many requests (and not just on the next one as done now). This shall fix
the TX queue overrun which currently triggers a storm of interrupts.

Another issue I tried to fix is capacity checks in virtio-net,
there's a new API for that, and on top of that,
I implemented a patch improving real-time characteristics
of virtio_net

Thus we get the second patchset:
virtio: add api for delayed callbacks
virtio_net: delay TX callbacks
virtio_ring: Add capacity check API
virtio_net: fix TX capacity checks using new API
virtio_net: limit xmit polling

This has some fixes that I posted previously applied,
but otherwise ideantical to v1. I tried to change API
for enable_cb_delayed as Rusty suggested but failed to do this.
I think it's not possible to define cleanly.

These work fine for me, I think they can be merged for 2.6.40
too but would be nice to hear back from Shirley, Tom, Krishna.

III) There's also a patch that adds a tweak to virtio ring
virtio: don't delay avail index update

This seems to help small message sizes where we are constantly draining
the RX VQ.

I'll need to benchmark this to be able to give any numbers
with confidence, but I don't see how it can hurt anything.
Thoughts?

IV) Last part is a set of patches to extend feature bits
to 64 bit. I tested this by using feature bit 32.
vhost: fix 64 bit features
virtio_test: update for 64 bit features
virtio: 64 bit features

It's nice to have as set I used up the last free bit.
But not a must now that a single bit controls
use of event index on both sides.

The patchset is on top of net-next which at the time
I last rebased was 15ecd03 - so roughly 2.6.39-rc2.
For testing I usually do merge v2.6.39 on top.

qemu patch is also ready.  Code can be pulled from here:

git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next-event-idx-v3
git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git virtio-net-event-idx-v3

Rusty, I think it will be easier to merge vhost and virtio bits in one
go. Can it all go in through your tree (Dave in the past acked
sending a very similar patch through you so should not be a problem)?



-- 
1.7.5.53.gc233e


Michael S. Tsirkin (13):
  virtio: event index interface
  virtio ring: inline function to check for events
  virtio_ring: support event idx feature
  vhost: support event index
  virtio_test: support event index
  virtio: add api for delayed callbacks
  virtio_net: delay TX callbacks
  virtio_net: fix TX capacity checks using new API
  virtio_net: limit xmit polling
  virtio: don't delay avail index update
  virtio: 64 bit features
  virtio_test: update for 64 bit features
  vhost: fix 64 bit features

Shirley Ma (1):
  virtio_ring: Add capacity check API

 drivers/lguest/lguest_device.c |    8 +-
 drivers/net/virtio_net.c       |   27 +++++---
 drivers/s390/kvm/kvm_virtio.c  |    8 +-
 drivers/vhost/net.c            |   12 ++--
 drivers/vhost/test.c           |    6 +-
 drivers/vhost/vhost.c          |  138 ++++++++++++++++++++++++++++++----------
 drivers/vhost/vhost.h          |   29 +++++---
 drivers/virtio/virtio.c        |    8 +-
 drivers/virtio/virtio_pci.c    |   34 ++++++++--
 drivers/virtio/virtio_ring.c   |   87 ++++++++++++++++++++++---
 include/linux/virtio.h         |   16 ++++-
 include/linux/virtio_config.h  |   15 +++--
 include/linux/virtio_pci.h     |    9 ++-
 include/linux/virtio_ring.h    |   29 ++++++++-
 tools/virtio/virtio_test.c     |   27 +++++++-
 15 files changed, 348 insertions(+), 105 deletions(-)

-- 
1.7.5.53.gc233e

^ permalink raw reply	[flat|nested] 133+ messages in thread

end of thread, other threads:[~2011-05-30  6:31 UTC | newest]

Thread overview: 133+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-19 23:10 [PATCHv2 00/14] virtio and vhost-net performance enhancements Michael S. Tsirkin
2011-05-19 23:10 ` Michael S. Tsirkin
2011-05-19 23:10 ` [PATCHv2 01/14] virtio: event index interface Michael S. Tsirkin
2011-05-19 23:10   ` Michael S. Tsirkin
2011-05-21  2:29   ` Rusty Russell
2011-05-21  2:29   ` Rusty Russell
2011-05-21  2:29     ` Rusty Russell
2011-05-19 23:10 ` Michael S. Tsirkin
2011-05-19 23:10 ` [PATCHv2 02/14] virtio ring: inline function to check for events Michael S. Tsirkin
2011-05-19 23:10   ` Michael S. Tsirkin
2011-05-21  2:29   ` Rusty Russell
2011-05-21  2:29     ` Rusty Russell
2011-05-21  2:29   ` Rusty Russell
2011-05-19 23:10 ` Michael S. Tsirkin
2011-05-19 23:10 ` [PATCHv2 03/14] virtio_ring: support event idx feature Michael S. Tsirkin
2011-05-19 23:10 ` Michael S. Tsirkin
2011-05-19 23:10   ` Michael S. Tsirkin
2011-05-21  2:31   ` Rusty Russell
2011-05-21  2:31   ` Rusty Russell
2011-05-21  2:31     ` Rusty Russell
2011-05-19 23:10 ` [PATCHv2 04/14] vhost: support event index Michael S. Tsirkin
2011-05-19 23:10   ` Michael S. Tsirkin
2011-05-21  2:31   ` Rusty Russell
2011-05-21  2:31   ` Rusty Russell
2011-05-21  2:31     ` Rusty Russell
2011-05-19 23:10 ` Michael S. Tsirkin
2011-05-19 23:11 ` [PATCHv2 05/14] virtio_test: " Michael S. Tsirkin
2011-05-19 23:11 ` Michael S. Tsirkin
2011-05-19 23:11   ` Michael S. Tsirkin
2011-05-21  2:32   ` Rusty Russell
2011-05-21  2:32   ` Rusty Russell
2011-05-21  2:32     ` Rusty Russell
2011-05-19 23:11 ` [PATCHv2 06/14] virtio: add api for delayed callbacks Michael S. Tsirkin
2011-05-19 23:11 ` Michael S. Tsirkin
2011-05-19 23:11   ` Michael S. Tsirkin
2011-05-21  2:33   ` Rusty Russell
2011-05-21  2:33   ` Rusty Russell
2011-05-21  2:33     ` Rusty Russell
2011-05-19 23:11 ` [PATCHv2 07/14] virtio_net: delay TX callbacks Michael S. Tsirkin
2011-05-19 23:11   ` Michael S. Tsirkin
2011-05-19 23:11 ` Michael S. Tsirkin
2011-05-19 23:11 ` [PATCHv2 08/14] virtio_ring: Add capacity check API Michael S. Tsirkin
2011-05-19 23:11   ` Michael S. Tsirkin
2011-05-19 23:11 ` Michael S. Tsirkin
2011-05-19 23:11 ` [PATCHv2 09/14] virtio_net: fix TX capacity checks using new API Michael S. Tsirkin
2011-05-19 23:11   ` Michael S. Tsirkin
2011-05-21  2:13   ` Rusty Russell
2011-05-21  2:13   ` Rusty Russell
2011-05-21  2:13     ` Rusty Russell
2011-05-19 23:11 ` Michael S. Tsirkin
2011-05-19 23:11 ` [PATCHv2 10/14] virtio_net: limit xmit polling Michael S. Tsirkin
2011-05-19 23:11   ` Michael S. Tsirkin
2011-05-21  2:19   ` Rusty Russell
2011-05-21  2:19   ` Rusty Russell
2011-05-21  2:19     ` Rusty Russell
2011-05-22 12:10     ` Michael S. Tsirkin
2011-05-22 12:10       ` Michael S. Tsirkin
2011-05-23  2:07       ` Rusty Russell
2011-05-23  2:07         ` Rusty Russell
2011-05-23 11:19         ` Michael S. Tsirkin
2011-05-23 11:19         ` Michael S. Tsirkin
2011-05-23 11:19           ` Michael S. Tsirkin
2011-05-24  7:54           ` Krishna Kumar2
2011-05-24  7:54             ` Krishna Kumar2
2011-05-24  9:12             ` Michael S. Tsirkin
2011-05-24  9:12             ` Michael S. Tsirkin
2011-05-24  9:12               ` Michael S. Tsirkin
2011-05-24  9:27               ` Krishna Kumar2
2011-05-24  9:27               ` Krishna Kumar2
2011-05-24 11:29                 ` Michael S. Tsirkin
2011-05-24 11:29                   ` Michael S. Tsirkin
2011-05-24 12:50                   ` Krishna Kumar2
2011-05-24 12:50                     ` Krishna Kumar2
2011-05-24 13:52                     ` Michael S. Tsirkin
2011-05-24 13:52                       ` Michael S. Tsirkin
2011-05-24 13:52                     ` Michael S. Tsirkin
2011-05-24 12:50                   ` Krishna Kumar2
2011-05-24 11:29                 ` Michael S. Tsirkin
2011-05-24  7:54           ` Krishna Kumar2
2011-05-25  1:28           ` Rusty Russell
2011-05-25  1:28             ` Rusty Russell
2011-05-25  5:50             ` Michael S. Tsirkin
2011-05-25  5:50             ` Michael S. Tsirkin
2011-05-25  5:50               ` Michael S. Tsirkin
2011-05-25  1:28           ` Rusty Russell
2011-05-25  1:35           ` Rusty Russell
2011-05-25  1:35           ` Rusty Russell
2011-05-25  1:35             ` Rusty Russell
2011-05-25  6:07             ` Michael S. Tsirkin
2011-05-25  6:07               ` Michael S. Tsirkin
2011-05-26  3:28               ` Rusty Russell
2011-05-26  3:28               ` Rusty Russell
2011-05-26  3:28                 ` Rusty Russell
2011-05-28 20:02                 ` Michael S. Tsirkin
2011-05-28 20:02                   ` Michael S. Tsirkin
2011-05-30  6:27                   ` Rusty Russell
2011-05-30  6:27                     ` Rusty Russell
2011-05-30  6:27                   ` Rusty Russell
2011-05-28 20:02                 ` Michael S. Tsirkin
2011-05-25  6:07             ` Michael S. Tsirkin
2011-05-23  2:07       ` Rusty Russell
2011-05-22 12:10     ` Michael S. Tsirkin
2011-05-19 23:11 ` Michael S. Tsirkin
2011-05-19 23:12 ` [PATCHv2 11/14] virtio: don't delay avail index update Michael S. Tsirkin
2011-05-19 23:12   ` Michael S. Tsirkin
2011-05-21  2:26   ` Rusty Russell
2011-05-21  2:26   ` Rusty Russell
2011-05-21  2:26     ` Rusty Russell
2011-05-19 23:12 ` Michael S. Tsirkin
2011-05-19 23:12 ` [PATCHv2 12/14] virtio: 64 bit features Michael S. Tsirkin
2011-05-19 23:12 ` Michael S. Tsirkin
2011-05-19 23:12   ` Michael S. Tsirkin
2011-05-19 23:12 ` [PATCHv2 13/14] virtio_test: update for " Michael S. Tsirkin
2011-05-19 23:12 ` Michael S. Tsirkin
2011-05-19 23:12   ` Michael S. Tsirkin
2011-05-19 23:12 ` [PATCHv2 14/14] vhost: fix " Michael S. Tsirkin
2011-05-19 23:12   ` Michael S. Tsirkin
2011-05-19 23:12 ` Michael S. Tsirkin
2011-05-19 23:20 ` [PATCHv2 00/14] virtio and vhost-net performance enhancements David Miller
2011-05-19 23:20 ` David Miller
2011-05-20  7:51 ` Rusty Russell
2011-05-20  7:51   ` Rusty Russell
2011-05-20  7:51 ` Rusty Russell
2011-05-26 15:32 ` [PERF RESULTS] " Krishna Kumar2
2011-05-26 15:32   ` Krishna Kumar2
2011-05-26 15:42   ` Shirley Ma
2011-05-26 16:21     ` Krishna Kumar2
2011-05-26 16:21     ` Krishna Kumar2
     [not found]     ` <OFF9D0E604.B865A006-ON6525789C.00597010-6525789C.0059987A@LocalDomain>
2011-05-26 16:29       ` Krishna Kumar2
2011-05-26 16:29       ` Krishna Kumar2
2011-05-26 16:29         ` Krishna Kumar2
2011-05-26 15:32 ` Krishna Kumar2
2011-05-19 23:10 [PATCHv2 00/14] " Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.