netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
@ 2014-03-06 21:48 Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 1/9] xen-netback: Use skb->cb for pending_idx Zoltan Kiss
                   ` (10 more replies)
  0 siblings, 11 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, jonathan.davies, Zoltan Kiss

A long known problem of the upstream netback implementation that on the TX
path (from guest to Dom0) it copies the whole packet from guest memory into
Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
huge perfomance penalty. The classic kernel version of netback used grant
mapping, and to get notified when the page can be unmapped, it used page
destructors. Unfortunately that destructor is not an upstreamable solution.
Ian Campbell's skb fragment destructor patch series [1] tried to solve this
problem, however it seems to be very invasive on the network stack's code,
and therefore haven't progressed very well.
This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
know when the skb is freed up. That is the way KVM solved the same problem,
and based on my initial tests it can do the same for us. Avoiding the extra
copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
switch)
Based on my investigations the packet get only copied if it is delivered to
Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
unfortunate, but luckily it doesn't cause a major regression for this usecase.
In the future we should try to eliminate that copy somehow.
There are a few spinoff tasks which will be addressed in separate patches:
- grant copy the header directly instead of map and memcpy. This should help
  us avoiding TLB flushing
- use something else than ballooned pages
- fix grant map to use page->index properly
I've tried to broke it down to smaller patches, with mixed results, so I
welcome suggestions on that part as well:
1: Use skb->cb to store pending_idx
2: Some refactoring
3: Change RX path for mapped SKB fragments (moved here to keep bisectability,
review it after #4)
4: Introduce TX grant mapping
5: Remove old TX grant copy definitons and fix indentations
6: Add stat counters for zerocopy
7: Handle guests with too many frags
8: Timeout packets in RX path
9: Aggregate TX unmap operations

v2: I've fixed some smaller things, see the individual patches. I've added a
few new stat counters, and handling the important use case when an older guest
sends lots of slots. Instead of delayed copy now we timeout packets on the RX
path, based on the assumption that otherwise packets should get stucked
anywhere else. Finally some unmap batching to avoid too much TLB flush

v3: Apart from fixing a few things mentioned in responses the important change
is the use the hypercall directly for grant [un]mapping, therefore we can
avoid m2p override.

v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue
timeout logic changed also.

v5: Only minor fixes based on Wei's comments

v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see
first 2 patches. Also rework of handling packets with too many slots, and
reorder the series a bit.

v7: Small fixes in comments/log messages/error paths, and merging the frag
overflow stats patch into its parent.

[1] http://lwn.net/Articles/491522/
[2] https://lkml.org/lkml/2012/7/20/363

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 1/9] xen-netback: Use skb->cb for pending_idx
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 2/9] xen-netback: Minor refactoring of netback code Zoltan Kiss
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, Zoltan Kiss, jonathan.davies

Storing the pending_idx at the first byte of the linear buffer never looked
good, skb->cb is a more proper place for this. It also prevents the header to
be directly grant copied there, and we don't have the pending_idx after we
copied the header here, so it's time to change it.
It also introduces helpers for the RX side

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---
v7:
- using helpers to access skb->cb, on RX side too

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index e5284bc..43ae4ba 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -455,10 +455,12 @@ static void xenvif_add_frag_responses(struct xenvif *vif, int status,
 	}
 }
 
-struct skb_cb_overlay {
+struct xenvif_rx_cb {
 	int meta_slots_used;
 };
 
+#define XENVIF_RX_CB(skb) ((struct xenvif_rx_cb *)(skb)->cb)
+
 void xenvif_kick_thread(struct xenvif *vif)
 {
 	wake_up(&vif->wq);
@@ -474,7 +476,6 @@ static void xenvif_rx_action(struct xenvif *vif)
 	LIST_HEAD(notify);
 	int ret;
 	unsigned long offset;
-	struct skb_cb_overlay *sco;
 	bool need_to_notify = false;
 
 	struct netrx_pending_operations npo = {
@@ -513,9 +514,8 @@ static void xenvif_rx_action(struct xenvif *vif)
 		} else
 			vif->rx_last_skb_slots = 0;
 
-		sco = (struct skb_cb_overlay *)skb->cb;
-		sco->meta_slots_used = xenvif_gop_skb(skb, &npo);
-		BUG_ON(sco->meta_slots_used > max_slots_needed);
+		XENVIF_RX_CB(skb)->meta_slots_used = xenvif_gop_skb(skb, &npo);
+		BUG_ON(XENVIF_RX_CB(skb)->meta_slots_used > max_slots_needed);
 
 		__skb_queue_tail(&rxq, skb);
 	}
@@ -529,7 +529,6 @@ static void xenvif_rx_action(struct xenvif *vif)
 	gnttab_batch_copy(vif->grant_copy_op, npo.copy_prod);
 
 	while ((skb = __skb_dequeue(&rxq)) != NULL) {
-		sco = (struct skb_cb_overlay *)skb->cb;
 
 		if ((1 << vif->meta[npo.meta_cons].gso_type) &
 		    vif->gso_prefix_mask) {
@@ -540,19 +539,21 @@ static void xenvif_rx_action(struct xenvif *vif)
 
 			resp->offset = vif->meta[npo.meta_cons].gso_size;
 			resp->id = vif->meta[npo.meta_cons].id;
-			resp->status = sco->meta_slots_used;
+			resp->status = XENVIF_RX_CB(skb)->meta_slots_used;
 
 			npo.meta_cons++;
-			sco->meta_slots_used--;
+			XENVIF_RX_CB(skb)->meta_slots_used--;
 		}
 
 
 		vif->dev->stats.tx_bytes += skb->len;
 		vif->dev->stats.tx_packets++;
 
-		status = xenvif_check_gop(vif, sco->meta_slots_used, &npo);
+		status = xenvif_check_gop(vif,
+					  XENVIF_RX_CB(skb)->meta_slots_used,
+					  &npo);
 
-		if (sco->meta_slots_used == 1)
+		if (XENVIF_RX_CB(skb)->meta_slots_used == 1)
 			flags = 0;
 		else
 			flags = XEN_NETRXF_more_data;
@@ -589,13 +590,13 @@ static void xenvif_rx_action(struct xenvif *vif)
 
 		xenvif_add_frag_responses(vif, status,
 					  vif->meta + npo.meta_cons + 1,
-					  sco->meta_slots_used);
+					  XENVIF_RX_CB(skb)->meta_slots_used);
 
 		RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&vif->rx, ret);
 
 		need_to_notify |= !!ret;
 
-		npo.meta_cons += sco->meta_slots_used;
+		npo.meta_cons += XENVIF_RX_CB(skb)->meta_slots_used;
 		dev_kfree_skb(skb);
 	}
 
@@ -772,6 +773,13 @@ static struct page *xenvif_alloc_page(struct xenvif *vif,
 	return page;
 }
 
+
+struct xenvif_tx_cb {
+	u16 pending_idx;
+};
+
+#define XENVIF_TX_CB(skb) ((struct xenvif_tx_cb *)(skb)->cb)
+
 static struct gnttab_copy *xenvif_get_requests(struct xenvif *vif,
 					       struct sk_buff *skb,
 					       struct xen_netif_tx_request *txp,
@@ -779,7 +787,7 @@ static struct gnttab_copy *xenvif_get_requests(struct xenvif *vif,
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	skb_frag_t *frags = shinfo->frags;
-	u16 pending_idx = *((u16 *)skb->data);
+	u16 pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 	u16 head_idx = 0;
 	int slot, start;
 	struct page *page;
@@ -897,7 +905,7 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 			       struct gnttab_copy **gopp)
 {
 	struct gnttab_copy *gop = *gopp;
-	u16 pending_idx = *((u16 *)skb->data);
+	u16 pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct pending_tx_info *tx_info;
 	int nr_frags = shinfo->nr_frags;
@@ -944,7 +952,7 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 			continue;
 
 		/* First error: invalidate header and preceding fragments. */
-		pending_idx = *((u16 *)skb->data);
+		pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
 		for (j = start; j < i; j++) {
 			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
@@ -1236,7 +1244,7 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 		memcpy(&vif->pending_tx_info[pending_idx].req,
 		       &txreq, sizeof(txreq));
 		vif->pending_tx_info[pending_idx].head = index;
-		*((u16 *)skb->data) = pending_idx;
+		XENVIF_TX_CB(skb)->pending_idx = pending_idx;
 
 		__skb_put(skb, data_len);
 
@@ -1283,7 +1291,7 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		u16 pending_idx;
 		unsigned data_len;
 
-		pending_idx = *((u16 *)skb->data);
+		pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 		txp = &vif->pending_tx_info[pending_idx].req;
 
 		/* Check the remap error code. */

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 2/9] xen-netback: Minor refactoring of netback code
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 1/9] xen-netback: Use skb->cb for pending_idx Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 3/9] xen-netback: Handle foreign mapped pages on the guest RX path Zoltan Kiss
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, jonathan.davies, Zoltan Kiss

This patch contains a few bits of refactoring before introducing the grant
mapping changes:
- introducing xenvif_tx_pending_slots_available(), as this is used several
  times, and will be used more often
- rename the thread to vifX.Y-guest-rx, to signify it does RX work from the
  guest point of view

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index ae413a2..9d35845 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -108,6 +108,15 @@ struct xenvif_rx_meta {
  */
 #define MAX_GRANT_COPY_OPS (MAX_SKB_FRAGS * XEN_NETIF_RX_RING_SIZE)
 
+#define NETBACK_INVALID_HANDLE -1
+
+/* To avoid confusion, we define XEN_NETBK_LEGACY_SLOTS_MAX indicating
+ * the maximum slots a valid packet can use. Now this value is defined
+ * to be XEN_NETIF_NR_SLOTS_MIN, which is supposed to be supported by
+ * all backend.
+ */
+#define XEN_NETBK_LEGACY_SLOTS_MAX XEN_NETIF_NR_SLOTS_MIN
+
 struct xenvif {
 	/* Unique identifier for this interface. */
 	domid_t          domid;
@@ -216,7 +225,7 @@ void xenvif_carrier_off(struct xenvif *vif);
 
 int xenvif_tx_action(struct xenvif *vif, int budget);
 
-int xenvif_kthread(void *data);
+int xenvif_kthread_guest_rx(void *data);
 void xenvif_kick_thread(struct xenvif *vif);
 
 /* Determine whether the needed number of slots (req) are available,
@@ -226,6 +235,18 @@ bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed);
 
 void xenvif_stop_queue(struct xenvif *vif);
 
+static inline pending_ring_idx_t nr_pending_reqs(struct xenvif *vif)
+{
+	return MAX_PENDING_REQS -
+		vif->pending_prod + vif->pending_cons;
+}
+
+static inline bool xenvif_tx_pending_slots_available(struct xenvif *vif)
+{
+	return nr_pending_reqs(vif) + XEN_NETBK_LEGACY_SLOTS_MAX
+		< MAX_PENDING_REQS;
+}
+
 extern bool separate_tx_rx_irq;
 
 #endif /* __XEN_NETBACK__COMMON_H__ */
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 7669d49..bc32627 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -421,8 +421,8 @@ int xenvif_connect(struct xenvif *vif, unsigned long tx_ring_ref,
 		disable_irq(vif->rx_irq);
 	}
 
-	task = kthread_create(xenvif_kthread,
-			      (void *)vif, "%s", vif->dev->name);
+	task = kthread_create(xenvif_kthread_guest_rx,
+			      (void *)vif, "%s-guest-rx", vif->dev->name);
 	if (IS_ERR(task)) {
 		pr_warn("Could not allocate kthread for %s\n", vif->dev->name);
 		err = PTR_ERR(task);
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 43ae4ba..715d810 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -63,14 +63,6 @@ static unsigned int fatal_skb_slots = FATAL_SKB_SLOTS_DEFAULT;
 module_param(fatal_skb_slots, uint, 0444);
 
 /*
- * To avoid confusion, we define XEN_NETBK_LEGACY_SLOTS_MAX indicating
- * the maximum slots a valid packet can use. Now this value is defined
- * to be XEN_NETIF_NR_SLOTS_MIN, which is supposed to be supported by
- * all backend.
- */
-#define XEN_NETBK_LEGACY_SLOTS_MAX XEN_NETIF_NR_SLOTS_MIN
-
-/*
  * If head != INVALID_PENDING_RING_IDX, it means this tx request is head of
  * one or more merged tx requests, otherwise it is the continuation of
  * previous tx request.
@@ -131,12 +123,6 @@ static inline pending_ring_idx_t pending_index(unsigned i)
 	return i & (MAX_PENDING_REQS-1);
 }
 
-static inline pending_ring_idx_t nr_pending_reqs(struct xenvif *vif)
-{
-	return MAX_PENDING_REQS -
-		vif->pending_prod + vif->pending_cons;
-}
-
 bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed)
 {
 	RING_IDX prod, cons;
@@ -1116,8 +1102,7 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 	struct sk_buff *skb;
 	int ret;
 
-	while ((nr_pending_reqs(vif) + XEN_NETBK_LEGACY_SLOTS_MAX
-		< MAX_PENDING_REQS) &&
+	while (xenvif_tx_pending_slots_available(vif) &&
 	       (skb_queue_len(&vif->tx_queue) < budget)) {
 		struct xen_netif_tx_request txreq;
 		struct xen_netif_tx_request txfrags[XEN_NETBK_LEGACY_SLOTS_MAX];
@@ -1487,8 +1472,7 @@ static inline int tx_work_todo(struct xenvif *vif)
 {
 
 	if (likely(RING_HAS_UNCONSUMED_REQUESTS(&vif->tx)) &&
-	    (nr_pending_reqs(vif) + XEN_NETBK_LEGACY_SLOTS_MAX
-	     < MAX_PENDING_REQS))
+	    xenvif_tx_pending_slots_available(vif))
 		return 1;
 
 	return 0;
@@ -1551,7 +1535,7 @@ static void xenvif_start_queue(struct xenvif *vif)
 		netif_wake_queue(vif->dev);
 }
 
-int xenvif_kthread(void *data)
+int xenvif_kthread_guest_rx(void *data)
 {
 	struct xenvif *vif = data;
 	struct sk_buff *skb;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 3/9] xen-netback: Handle foreign mapped pages on the guest RX path
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 1/9] xen-netback: Use skb->cb for pending_idx Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 2/9] xen-netback: Minor refactoring of netback code Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping Zoltan Kiss
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, Zoltan Kiss, jonathan.davies

RX path need to know if the SKB fragments are stored on pages from another
domain.
Logically this patch should be after introducing the grant mapping itself, as
it makes sense only after that. But to keep bisectability, I moved it here. It
shouldn't change any functionality here. xenvif_zerocopy_callback and
ubuf_to_vif are just stubs here, they will be introduced properly later on.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---
v4:
- indentation fixes

v6:
- use ubuf_to_vif helper
- move the patch earlier in the series due to bisectability

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 9d35845..8f264df 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -247,6 +247,9 @@ static inline bool xenvif_tx_pending_slots_available(struct xenvif *vif)
 		< MAX_PENDING_REQS;
 }
 
+/* Callback from stack when TX packet can be released */
+void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success);
+
 extern bool separate_tx_rx_irq;
 
 #endif /* __XEN_NETBACK__COMMON_H__ */
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 715d810..e9391ba 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -101,6 +101,10 @@ static inline unsigned long idx_to_kaddr(struct xenvif *vif,
 	return (unsigned long)pfn_to_kaddr(idx_to_pfn(vif, idx));
 }
 
+static inline struct xenvif* ubuf_to_vif(struct ubuf_info *ubuf)
+{
+	return NULL;
+}
 /* This is a miniumum size for the linear area to avoid lots of
  * calls to __pskb_pull_tail() as we set up checksum offsets. The
  * value 128 was chosen as it covers all IPv4 and most likely
@@ -221,7 +225,9 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif *vif,
 static void xenvif_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb,
 				 struct netrx_pending_operations *npo,
 				 struct page *page, unsigned long size,
-				 unsigned long offset, int *head)
+				 unsigned long offset, int *head,
+				 struct xenvif *foreign_vif,
+				 grant_ref_t foreign_gref)
 {
 	struct gnttab_copy *copy_gop;
 	struct xenvif_rx_meta *meta;
@@ -263,8 +269,15 @@ static void xenvif_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb,
 		copy_gop->flags = GNTCOPY_dest_gref;
 		copy_gop->len = bytes;
 
-		copy_gop->source.domid = DOMID_SELF;
-		copy_gop->source.u.gmfn = virt_to_mfn(page_address(page));
+		if (foreign_vif) {
+			copy_gop->source.domid = foreign_vif->domid;
+			copy_gop->source.u.ref = foreign_gref;
+			copy_gop->flags |= GNTCOPY_source_gref;
+		} else {
+			copy_gop->source.domid = DOMID_SELF;
+			copy_gop->source.u.gmfn =
+				virt_to_mfn(page_address(page));
+		}
 		copy_gop->source.offset = offset;
 
 		copy_gop->dest.domid = vif->domid;
@@ -325,6 +338,9 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 	int old_meta_prod;
 	int gso_type;
 	int gso_size;
+	struct ubuf_info *ubuf = skb_shinfo(skb)->destructor_arg;
+	grant_ref_t foreign_grefs[MAX_SKB_FRAGS];
+	struct xenvif *foreign_vif = NULL;
 
 	old_meta_prod = npo->meta_prod;
 
@@ -365,6 +381,19 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 	npo->copy_off = 0;
 	npo->copy_gref = req->gref;
 
+	if ((skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) &&
+		 (ubuf->callback == &xenvif_zerocopy_callback)) {
+		int i = 0;
+		foreign_vif = ubuf_to_vif(ubuf);
+
+		do {
+			u16 pending_idx = ubuf->desc;
+			foreign_grefs[i++] =
+				foreign_vif->pending_tx_info[pending_idx].req.gref;
+			ubuf = (struct ubuf_info *) ubuf->ctx;
+		} while (ubuf);
+	}
+
 	data = skb->data;
 	while (data < skb_tail_pointer(skb)) {
 		unsigned int offset = offset_in_page(data);
@@ -374,7 +403,9 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 			len = skb_tail_pointer(skb) - data;
 
 		xenvif_gop_frag_copy(vif, skb, npo,
-				     virt_to_page(data), len, offset, &head);
+				     virt_to_page(data), len, offset, &head,
+				     NULL,
+				     0);
 		data += len;
 	}
 
@@ -383,7 +414,9 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 				     skb_frag_page(&skb_shinfo(skb)->frags[i]),
 				     skb_frag_size(&skb_shinfo(skb)->frags[i]),
 				     skb_shinfo(skb)->frags[i].page_offset,
-				     &head);
+				     &head,
+				     foreign_vif,
+				     foreign_grefs[i]);
 	}
 
 	return npo->meta_prod - old_meta_prod;
@@ -1351,6 +1384,11 @@ static int xenvif_tx_submit(struct xenvif *vif)
 	return work_done;
 }
 
+void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success)
+{
+	return;
+}
+
 /* Called after netfront has transmitted */
 int xenvif_tx_action(struct xenvif *vif, int budget)
 {

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
                   ` (2 preceding siblings ...)
  2014-03-06 21:48 ` [PATCH net-next v7 3/9] xen-netback: Handle foreign mapped pages on the guest RX path Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-13 10:17   ` Ian Campbell
  2014-03-13 10:33   ` Ian Campbell
  2014-03-06 21:48 ` [PATCH net-next v7 5/9] xen-netback: Remove old TX grant copy definitons and fix indentations Zoltan Kiss
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, jonathan.davies, Zoltan Kiss

This patch introduces grant mapping on netback TX path. It replaces grant copy
operations, ditching grant copy coalescing along the way. Another solution for
copy coalescing is introduced in "xen-netback: Handle guests with too many
frags", older guests and Windows can broke before that patch applies.
There is a callback (xenvif_zerocopy_callback) from core stack to release the
slots back to the guests when kfree_skb or skb_orphan_frags called. It feeds a
separate dealloc thread, as scheduling NAPI instance from there is inefficient,
therefore we can't do dealloc from the instance.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---
History of the original #1 patch (xen-netback: Introduce TX grant map
definitions)
v2:
- move unmapping to separate thread. The NAPI instance has to be scheduled
  even from thread context, which can cause huge delays
- that causes unfortunately bigger struct xenvif
- store grant handle after checking validity

v3:
- fix comment in xenvif_tx_dealloc_action()
- call unmap hypercall directly instead of gnttab_unmap_refs(), which does
  unnecessary m2p_override. Also remove pages_to_[un]map members
- BUG() if grant_tx_handle corrupted

v4:
- fix indentations and comments
- use bool for tx_dealloc_work_todo
- BUG() if grant_tx_handle corrupted - now really :)
- go back to gnttab_unmap_refs, now we rely on API changes

v5:
- remove hypercall from xenvif_idx_unmap
- remove stray line in xenvif_tx_create_gop
- simplify tx_dealloc_work_todo
- BUG() in xenvif_idx_unmap

History of the original #2 patch (xen-netback: Change TX path from grant copy to
mapping):
v2:
- delete branch for handling fragmented packets fit PKT_PROT_LEN sized first
  request
- mark the effect of using ballooned pages in a comment
- place setting of skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY right
  before netif_receive_skb, and mark the importance of it
- grab dealloc_lock before __napi_complete to avoid contention with the
  callback's napi_schedule
- handle fragmented packets where first request < PKT_PROT_LEN
- fix up error path when checksum_setup failed
- check before teardown for pending grants, and start complain if they are
  there after 10 second

v3:
- delete a surplus checking from tx_action
- remove stray line
- squash xenvif_idx_unmap changes into the first patch
- init spinlocks
- call map hypercall directly instead of gnttab_map_refs()
- fix unmapping timeout in xenvif_free()

v4:
- fix indentations and comments
- handle errors of set_phys_to_machine
- go back to gnttab_map_refs instead of direct hypercall. Now we rely on the
  modified API

v5:
- BUG_ON(vif->dealloc_task) in xenvif_connect
- use 'task' in xenvif_connect for thread creation
- proper return value if alloc_xenballooned_pages fails
- BUG in xenvif_tx_check_gop if stale handle found

Joint history:

v6:
- I wanted to avoid napi_schedule in zerocopy_callback, as it has a performance
  penalty if called from another vif thread's context. That's why I moved
  dealloc to a separate thread. But there are certain situations where it's
  necessary, so do it. Fortunately it doesn't happen too often, I couldn't see
  performance regression in iperf tests
- tx_dealloc_work_todo missing ;
- move some definitions to common.h due to previous
- new ubuf_to_vif and xenvif_grant_handle_[re]set helpers
- call xenvif_idx_release from xenvif_unmap insted of right after it
- improve comments
- add more BUG_ONs
- check xenvif_tx_pending_slots_available before napi_complete, otherwise NAPI
  will keep polling on that instance despite it can't do anything as there is no
  room for pending requests. It becomes even worse if the pending packets are
  waiting for an another instance on the same CPU.
- xenvif_fill_frags handles prev_pending_idx independently

v7:
- fixing comments, error messages, indentations

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 8f264df..5a99126 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -79,6 +79,17 @@ struct pending_tx_info {
 				  * if it is head of one or more tx
 				  * reqs
 				  */
+	/* Callback data for released SKBs. The callback is always
+	 * xenvif_zerocopy_callback, desc contains the pending_idx, which is
+	 * also an index in pending_tx_info array. It is initialized in
+	 * xenvif_alloc and it never changes.
+	 * skb_shinfo(skb)->destructor_arg points to the first mapped slot's
+	 * callback_struct in this array of struct pending_tx_info's, then ctx
+	 * to the next, or NULL if there is no more slot for this skb.
+	 * ubuf_to_vif is a helper which finds the struct xenvif from a pointer
+	 * to this field.
+	 */
+	struct ubuf_info callback_struct;
 };
 
 #define XEN_NETIF_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE)
@@ -135,13 +146,31 @@ struct xenvif {
 	pending_ring_idx_t pending_cons;
 	u16 pending_ring[MAX_PENDING_REQS];
 	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
+	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
 
 	/* Coalescing tx requests before copying makes number of grant
 	 * copy ops greater or equal to number of slots required. In
 	 * worst case a tx request consumes 2 gnttab_copy.
 	 */
 	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
-
+	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
+	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
+	/* passed to gnttab_[un]map_refs with pages under (un)mapping */
+	struct page *pages_to_map[MAX_PENDING_REQS];
+	struct page *pages_to_unmap[MAX_PENDING_REQS];
+
+	/* This prevents zerocopy callbacks  to race over dealloc_ring */
+	spinlock_t callback_lock;
+	/* This prevents dealloc thread and NAPI instance to race over response
+	 * creation and pending_ring in xenvif_idx_release. In xenvif_tx_err
+	 * it only protect response creation
+	 */
+	spinlock_t response_lock;
+	pending_ring_idx_t dealloc_prod;
+	pending_ring_idx_t dealloc_cons;
+	u16 dealloc_ring[MAX_PENDING_REQS];
+	struct task_struct *dealloc_task;
+	wait_queue_head_t dealloc_wq;
 
 	/* Use kthread for guest RX */
 	struct task_struct *task;
@@ -228,6 +257,8 @@ int xenvif_tx_action(struct xenvif *vif, int budget);
 int xenvif_kthread_guest_rx(void *data);
 void xenvif_kick_thread(struct xenvif *vif);
 
+int xenvif_dealloc_kthread(void *data);
+
 /* Determine whether the needed number of slots (req) are available,
  * and set req_event if not.
  */
@@ -235,6 +266,12 @@ bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed);
 
 void xenvif_stop_queue(struct xenvif *vif);
 
+/* Callback from stack when TX packet can be released */
+void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success);
+
+/* Unmap a pending page and release it back to the guest */
+void xenvif_idx_unmap(struct xenvif *vif, u16 pending_idx);
+
 static inline pending_ring_idx_t nr_pending_reqs(struct xenvif *vif)
 {
 	return MAX_PENDING_REQS -
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index bc32627..1fe9fe5 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -38,6 +38,7 @@
 
 #include <xen/events.h>
 #include <asm/xen/hypercall.h>
+#include <xen/balloon.h>
 
 #define XENVIF_QUEUE_LENGTH 32
 #define XENVIF_NAPI_WEIGHT  64
@@ -87,7 +88,8 @@ static int xenvif_poll(struct napi_struct *napi, int budget)
 		local_irq_save(flags);
 
 		RING_FINAL_CHECK_FOR_REQUESTS(&vif->tx, more_to_do);
-		if (!more_to_do)
+		if (!(more_to_do &&
+		      xenvif_tx_pending_slots_available(vif)))
 			__napi_complete(napi);
 
 		local_irq_restore(flags);
@@ -121,7 +123,9 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	BUG_ON(skb->dev != dev);
 
 	/* Drop the packet if vif is not ready */
-	if (vif->task == NULL || !xenvif_schedulable(vif))
+	if (vif->task == NULL ||
+	    vif->dealloc_task == NULL ||
+	    !xenvif_schedulable(vif))
 		goto drop;
 
 	/* At best we'll need one slot for the header and one for each
@@ -343,8 +347,26 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	vif->pending_prod = MAX_PENDING_REQS;
 	for (i = 0; i < MAX_PENDING_REQS; i++)
 		vif->pending_ring[i] = i;
-	for (i = 0; i < MAX_PENDING_REQS; i++)
-		vif->mmap_pages[i] = NULL;
+	spin_lock_init(&vif->callback_lock);
+	spin_lock_init(&vif->response_lock);
+	/* If ballooning is disabled, this will consume real memory, so you
+	 * better enable it. The long term solution would be to use just a
+	 * bunch of valid page descriptors, without dependency on ballooning
+	 */
+	err = alloc_xenballooned_pages(MAX_PENDING_REQS,
+				       vif->mmap_pages,
+				       false);
+	if (err) {
+		netdev_err(dev, "Could not reserve mmap_pages\n");
+		return ERR_PTR(-ENOMEM);
+	}
+	for (i = 0; i < MAX_PENDING_REQS; i++) {
+		vif->pending_tx_info[i].callback_struct = (struct ubuf_info)
+			{ .callback = xenvif_zerocopy_callback,
+			  .ctx = NULL,
+			  .desc = i };
+		vif->grant_tx_handle[i] = NETBACK_INVALID_HANDLE;
+	}
 
 	/*
 	 * Initialise a dummy MAC address. We choose the numerically
@@ -382,12 +404,14 @@ int xenvif_connect(struct xenvif *vif, unsigned long tx_ring_ref,
 
 	BUG_ON(vif->tx_irq);
 	BUG_ON(vif->task);
+	BUG_ON(vif->dealloc_task);
 
 	err = xenvif_map_frontend_rings(vif, tx_ring_ref, rx_ring_ref);
 	if (err < 0)
 		goto err;
 
 	init_waitqueue_head(&vif->wq);
+	init_waitqueue_head(&vif->dealloc_wq);
 
 	if (tx_evtchn == rx_evtchn) {
 		/* feature-split-event-channels == 0 */
@@ -431,6 +455,16 @@ int xenvif_connect(struct xenvif *vif, unsigned long tx_ring_ref,
 
 	vif->task = task;
 
+	task = kthread_create(xenvif_dealloc_kthread,
+			      (void *)vif, "%s-dealloc", vif->dev->name);
+	if (IS_ERR(task)) {
+		pr_warn("Could not allocate kthread for %s\n", vif->dev->name);
+		err = PTR_ERR(task);
+		goto err_rx_unbind;
+	}
+
+	vif->dealloc_task = task;
+
 	rtnl_lock();
 	if (!vif->can_sg && vif->dev->mtu > ETH_DATA_LEN)
 		dev_set_mtu(vif->dev, ETH_DATA_LEN);
@@ -441,6 +475,7 @@ int xenvif_connect(struct xenvif *vif, unsigned long tx_ring_ref,
 	rtnl_unlock();
 
 	wake_up_process(vif->task);
+	wake_up_process(vif->dealloc_task);
 
 	return 0;
 
@@ -478,6 +513,11 @@ void xenvif_disconnect(struct xenvif *vif)
 		vif->task = NULL;
 	}
 
+	if (vif->dealloc_task) {
+		kthread_stop(vif->dealloc_task);
+		vif->dealloc_task = NULL;
+	}
+
 	if (vif->tx_irq) {
 		if (vif->tx_irq == vif->rx_irq)
 			unbind_from_irqhandler(vif->tx_irq, vif);
@@ -493,6 +533,23 @@ void xenvif_disconnect(struct xenvif *vif)
 
 void xenvif_free(struct xenvif *vif)
 {
+	int i, unmap_timeout = 0;
+
+	for (i = 0; i < MAX_PENDING_REQS; ++i) {
+		if (vif->grant_tx_handle[i] != NETBACK_INVALID_HANDLE) {
+			unmap_timeout++;
+			schedule_timeout(msecs_to_jiffies(1000));
+			if (unmap_timeout > 9 &&
+			    net_ratelimit())
+				netdev_err(vif->dev,
+					   "Page still granted! Index: %x\n",
+					   i);
+			i = -1;
+		}
+	}
+
+	free_xenballooned_pages(MAX_PENDING_REQS, vif->mmap_pages);
+
 	netif_napi_del(&vif->napi);
 
 	unregister_netdev(vif->dev);
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index e9391ba..cb29134 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -101,10 +101,18 @@ static inline unsigned long idx_to_kaddr(struct xenvif *vif,
 	return (unsigned long)pfn_to_kaddr(idx_to_pfn(vif, idx));
 }
 
+/* Find the containing VIF's structure from a pointer in pending_tx_info array
+ */
 static inline struct xenvif* ubuf_to_vif(struct ubuf_info *ubuf)
 {
-	return NULL;
+	u16 pending_idx = ubuf->desc;
+	struct pending_tx_info *temp =
+		container_of(ubuf, struct pending_tx_info, callback_struct);
+	return container_of(temp - pending_idx,
+			    struct xenvif,
+			    pending_tx_info[0]);
 }
+
 /* This is a miniumum size for the linear area to avoid lots of
  * calls to __pskb_pull_tail() as we set up checksum offsets. The
  * value 128 was chosen as it covers all IPv4 and most likely
@@ -665,9 +673,12 @@ static void xenvif_tx_err(struct xenvif *vif,
 			  struct xen_netif_tx_request *txp, RING_IDX end)
 {
 	RING_IDX cons = vif->tx.req_cons;
+	unsigned long flags;
 
 	do {
+		spin_lock_irqsave(&vif->response_lock, flags);
 		make_tx_response(vif, txp, XEN_NETIF_RSP_ERROR);
+		spin_unlock_irqrestore(&vif->response_lock, flags);
 		if (cons == end)
 			break;
 		txp = RING_GET_REQUEST(&vif->tx, cons++);
@@ -799,10 +810,24 @@ struct xenvif_tx_cb {
 
 #define XENVIF_TX_CB(skb) ((struct xenvif_tx_cb *)(skb)->cb)
 
-static struct gnttab_copy *xenvif_get_requests(struct xenvif *vif,
-					       struct sk_buff *skb,
-					       struct xen_netif_tx_request *txp,
-					       struct gnttab_copy *gop)
+static inline void xenvif_tx_create_gop(struct xenvif *vif,
+					u16 pending_idx,
+					struct xen_netif_tx_request *txp,
+					struct gnttab_map_grant_ref *gop)
+{
+	vif->pages_to_map[gop-vif->tx_map_ops] = vif->mmap_pages[pending_idx];
+	gnttab_set_map_op(gop, idx_to_kaddr(vif, pending_idx),
+			  GNTMAP_host_map | GNTMAP_readonly,
+			  txp->gref, vif->domid);
+
+	memcpy(&vif->pending_tx_info[pending_idx].req, txp,
+	       sizeof(*txp));
+}
+
+static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif *vif,
+							struct sk_buff *skb,
+							struct xen_netif_tx_request *txp,
+							struct gnttab_map_grant_ref *gop)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	skb_frag_t *frags = shinfo->frags;
@@ -823,83 +848,12 @@ static struct gnttab_copy *xenvif_get_requests(struct xenvif *vif,
 	/* Skip first skb fragment if it is on same page as header fragment. */
 	start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
 
-	/* Coalesce tx requests, at this point the packet passed in
-	 * should be <= 64K. Any packets larger than 64K have been
-	 * handled in xenvif_count_requests().
-	 */
-	for (shinfo->nr_frags = slot = start; slot < nr_slots;
-	     shinfo->nr_frags++) {
-		struct pending_tx_info *pending_tx_info =
-			vif->pending_tx_info;
-
-		page = alloc_page(GFP_ATOMIC|__GFP_COLD);
-		if (!page)
-			goto err;
-
-		dst_offset = 0;
-		first = NULL;
-		while (dst_offset < PAGE_SIZE && slot < nr_slots) {
-			gop->flags = GNTCOPY_source_gref;
-
-			gop->source.u.ref = txp->gref;
-			gop->source.domid = vif->domid;
-			gop->source.offset = txp->offset;
-
-			gop->dest.domid = DOMID_SELF;
-
-			gop->dest.offset = dst_offset;
-			gop->dest.u.gmfn = virt_to_mfn(page_address(page));
-
-			if (dst_offset + txp->size > PAGE_SIZE) {
-				/* This page can only merge a portion
-				 * of tx request. Do not increment any
-				 * pointer / counter here. The txp
-				 * will be dealt with in future
-				 * rounds, eventually hitting the
-				 * `else` branch.
-				 */
-				gop->len = PAGE_SIZE - dst_offset;
-				txp->offset += gop->len;
-				txp->size -= gop->len;
-				dst_offset += gop->len; /* quit loop */
-			} else {
-				/* This tx request can be merged in the page */
-				gop->len = txp->size;
-				dst_offset += gop->len;
-
+	for (shinfo->nr_frags = start; shinfo->nr_frags < nr_slots;
+	     shinfo->nr_frags++, txp++, gop++) {
 				index = pending_index(vif->pending_cons++);
-
 				pending_idx = vif->pending_ring[index];
-
-				memcpy(&pending_tx_info[pending_idx].req, txp,
-				       sizeof(*txp));
-
-				/* Poison these fields, corresponding
-				 * fields for head tx req will be set
-				 * to correct values after the loop.
-				 */
-				vif->mmap_pages[pending_idx] = (void *)(~0UL);
-				pending_tx_info[pending_idx].head =
-					INVALID_PENDING_RING_IDX;
-
-				if (!first) {
-					first = &pending_tx_info[pending_idx];
-					start_idx = index;
-					head_idx = pending_idx;
-				}
-
-				txp++;
-				slot++;
-			}
-
-			gop++;
-		}
-
-		first->req.offset = 0;
-		first->req.size = dst_offset;
-		first->head = start_idx;
-		vif->mmap_pages[head_idx] = page;
-		frag_set_pending_idx(&frags[shinfo->nr_frags], head_idx);
+		xenvif_tx_create_gop(vif, pending_idx, txp, gop);
+		frag_set_pending_idx(&frags[shinfo->nr_frags], pending_idx);
 	}
 
 	BUG_ON(shinfo->nr_frags > MAX_SKB_FRAGS);
@@ -919,11 +873,38 @@ err:
 	return NULL;
 }
 
+static inline void xenvif_grant_handle_set(struct xenvif *vif,
+					   u16 pending_idx,
+					   grant_handle_t handle)
+{
+	if (unlikely(vif->grant_tx_handle[pending_idx] !=
+		     NETBACK_INVALID_HANDLE)) {
+		netdev_err(vif->dev,
+			   "Trying to overwrite active handle! pending_idx: %x\n",
+			   pending_idx);
+		BUG();
+	}
+	vif->grant_tx_handle[pending_idx] = handle;
+}
+
+static inline void xenvif_grant_handle_reset(struct xenvif *vif,
+					     u16 pending_idx)
+{
+	if (unlikely(vif->grant_tx_handle[pending_idx] ==
+		     NETBACK_INVALID_HANDLE)) {
+		netdev_err(vif->dev,
+			   "Trying to unmap invalid handle! pending_idx: %x\n",
+			   pending_idx);
+		BUG();
+	}
+	vif->grant_tx_handle[pending_idx] = NETBACK_INVALID_HANDLE;
+}
+
 static int xenvif_tx_check_gop(struct xenvif *vif,
 			       struct sk_buff *skb,
-			       struct gnttab_copy **gopp)
+			       struct gnttab_map_grant_ref **gopp)
 {
-	struct gnttab_copy *gop = *gopp;
+	struct gnttab_map_grant_ref *gop = *gopp;
 	u16 pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct pending_tx_info *tx_info;
@@ -935,6 +916,8 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 	err = gop->status;
 	if (unlikely(err))
 		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_ERROR);
+	else
+		xenvif_grant_handle_set(vif, pending_idx , gop->handle);
 
 	/* Skip first skb fragment if it is on same page as header fragment. */
 	start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
@@ -948,18 +931,13 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 		head = tx_info->head;
 
 		/* Check error status: if okay then remember grant handle. */
-		do {
 			newerr = (++gop)->status;
-			if (newerr)
-				break;
-			peek = vif->pending_ring[pending_index(++head)];
-		} while (!pending_tx_is_head(vif, peek));
 
 		if (likely(!newerr)) {
+			xenvif_grant_handle_set(vif, pending_idx , gop->handle);
 			/* Had a previous error? Invalidate this fragment. */
 			if (unlikely(err))
-				xenvif_idx_release(vif, pending_idx,
-						   XEN_NETIF_RSP_OKAY);
+				xenvif_idx_unmap(vif, pending_idx);
 			continue;
 		}
 
@@ -972,11 +950,10 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 
 		/* First error: invalidate header and preceding fragments. */
 		pending_idx = XENVIF_TX_CB(skb)->pending_idx;
-		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
+		xenvif_idx_unmap(vif, pending_idx);
 		for (j = start; j < i; j++) {
 			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
-			xenvif_idx_release(vif, pending_idx,
-					   XEN_NETIF_RSP_OKAY);
+			xenvif_idx_unmap(vif, pending_idx);
 		}
 
 		/* Remember the error: invalidate all subsequent fragments. */
@@ -992,6 +969,10 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	int nr_frags = shinfo->nr_frags;
 	int i;
+	u16 prev_pending_idx = INVALID_PENDING_IDX;
+
+	if (skb_shinfo(skb)->destructor_arg)
+		prev_pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 
 	for (i = 0; i < nr_frags; i++) {
 		skb_frag_t *frag = shinfo->frags + i;
@@ -1001,6 +982,17 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
 
 		pending_idx = frag_get_pending_idx(frag);
 
+		/* If this is not the first frag, chain it to the previous*/
+		if (unlikely(prev_pending_idx == INVALID_PENDING_IDX))
+			skb_shinfo(skb)->destructor_arg =
+				&vif->pending_tx_info[pending_idx].callback_struct;
+		else if (likely(pending_idx != prev_pending_idx))
+			vif->pending_tx_info[prev_pending_idx].callback_struct.ctx =
+				&(vif->pending_tx_info[pending_idx].callback_struct);
+
+		vif->pending_tx_info[pending_idx].callback_struct.ctx = NULL;
+		prev_pending_idx = pending_idx;
+
 		txp = &vif->pending_tx_info[pending_idx].req;
 		page = virt_to_page(idx_to_kaddr(vif, pending_idx));
 		__skb_fill_page_desc(skb, i, page, txp->offset, txp->size);
@@ -1008,10 +1000,15 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
 		skb->data_len += txp->size;
 		skb->truesize += txp->size;
 
-		/* Take an extra reference to offset xenvif_idx_release */
+		/* Take an extra reference to offset network stack's put_page */
 		get_page(vif->mmap_pages[pending_idx]);
-		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
 	}
+	/* FIXME: __skb_fill_page_desc set this to true because page->pfmemalloc
+	 * overlaps with "index", and "mapping" is not set. I think mapping
+	 * should be set. If delivered to local stack, it would drop this
+	 * skb in sk_filter unless the socket has the right to use it.
+	 */
+	skb->pfmemalloc	= false;
 }
 
 static int xenvif_get_extras(struct xenvif *vif,
@@ -1131,7 +1128,7 @@ static bool tx_credit_exceeded(struct xenvif *vif, unsigned size)
 
 static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 {
-	struct gnttab_copy *gop = vif->tx_copy_ops, *request_gop;
+	struct gnttab_map_grant_ref *gop = vif->tx_map_ops, *request_gop;
 	struct sk_buff *skb;
 	int ret;
 
@@ -1238,30 +1235,10 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 			}
 		}
 
-		/* XXX could copy straight to head */
-		page = xenvif_alloc_page(vif, pending_idx);
-		if (!page) {
-			kfree_skb(skb);
-			xenvif_tx_err(vif, &txreq, idx);
-			break;
-		}
-
-		gop->source.u.ref = txreq.gref;
-		gop->source.domid = vif->domid;
-		gop->source.offset = txreq.offset;
-
-		gop->dest.u.gmfn = virt_to_mfn(page_address(page));
-		gop->dest.domid = DOMID_SELF;
-		gop->dest.offset = txreq.offset;
-
-		gop->len = txreq.size;
-		gop->flags = GNTCOPY_source_gref;
+		xenvif_tx_create_gop(vif, pending_idx, &txreq, gop);
 
 		gop++;
 
-		memcpy(&vif->pending_tx_info[pending_idx].req,
-		       &txreq, sizeof(txreq));
-		vif->pending_tx_info[pending_idx].head = index;
 		XENVIF_TX_CB(skb)->pending_idx = pending_idx;
 
 		__skb_put(skb, data_len);
@@ -1290,17 +1267,17 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 
 		vif->tx.req_cons = idx;
 
-		if ((gop-vif->tx_copy_ops) >= ARRAY_SIZE(vif->tx_copy_ops))
+		if ((gop-vif->tx_map_ops) >= ARRAY_SIZE(vif->tx_map_ops))
 			break;
 	}
 
-	return gop - vif->tx_copy_ops;
+	return gop - vif->tx_map_ops;
 }
 
 
 static int xenvif_tx_submit(struct xenvif *vif)
 {
-	struct gnttab_copy *gop = vif->tx_copy_ops;
+	struct gnttab_map_grant_ref *gop = vif->tx_map_ops;
 	struct sk_buff *skb;
 	int work_done = 0;
 
@@ -1324,14 +1301,17 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		memcpy(skb->data,
 		       (void *)(idx_to_kaddr(vif, pending_idx)|txp->offset),
 		       data_len);
+		vif->pending_tx_info[pending_idx].callback_struct.ctx = NULL;
 		if (data_len < txp->size) {
 			/* Append the packet payload as a fragment. */
 			txp->offset += data_len;
 			txp->size -= data_len;
+			skb_shinfo(skb)->destructor_arg =
+				&vif->pending_tx_info[pending_idx].callback_struct;
 		} else {
 			/* Schedule a response immediately. */
-			xenvif_idx_release(vif, pending_idx,
-					   XEN_NETIF_RSP_OKAY);
+			skb_shinfo(skb)->destructor_arg = NULL;
+			xenvif_idx_unmap(vif, pending_idx);
 		}
 
 		if (txp->flags & XEN_NETTXF_csum_blank)
@@ -1353,6 +1333,9 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		if (checksum_setup(vif, skb)) {
 			netdev_dbg(vif->dev,
 				   "Can't setup checksum in net_tx_action\n");
+			/* We have to set this flag to trigger the callback */
+			if (skb_shinfo(skb)->destructor_arg)
+				skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
 			kfree_skb(skb);
 			continue;
 		}
@@ -1378,6 +1361,14 @@ static int xenvif_tx_submit(struct xenvif *vif)
 
 		work_done++;
 
+		/* Set this flag right before netif_receive_skb, otherwise
+		 * someone might think this packet already left netback, and
+		 * do a skb_copy_ubufs while we are still in control of the
+		 * skb. E.g. the __pskb_pull_tail earlier can do such thing.
+		 */
+		if (skb_shinfo(skb)->destructor_arg)
+			skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+
 		netif_receive_skb(skb);
 	}
 
@@ -1386,14 +1377,111 @@ static int xenvif_tx_submit(struct xenvif *vif)
 
 void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success)
 {
-	return;
+	unsigned long flags;
+	pending_ring_idx_t index;
+	struct xenvif *vif = ubuf_to_vif(ubuf);
+
+	/* This is the only place where we grab this lock, to protect callbacks
+	 * from each other.
+	 */
+	spin_lock_irqsave(&vif->callback_lock, flags);
+	do {
+		u16 pending_idx = ubuf->desc;
+		ubuf = (struct ubuf_info *) ubuf->ctx;
+		BUG_ON(vif->dealloc_prod - vif->dealloc_cons >=
+			MAX_PENDING_REQS);
+		index = pending_index(vif->dealloc_prod);
+		vif->dealloc_ring[index] = pending_idx;
+		/* Sync with xenvif_tx_dealloc_action:
+		 * insert idx then incr producer.
+		 */
+		smp_wmb();
+		vif->dealloc_prod++;
+	} while (ubuf);
+	wake_up(&vif->dealloc_wq);
+	spin_unlock_irqrestore(&vif->callback_lock, flags);
+
+	if (RING_HAS_UNCONSUMED_REQUESTS(&vif->tx) &&
+	    xenvif_tx_pending_slots_available(vif)) {
+		local_bh_disable();
+		napi_schedule(&vif->napi);
+		local_bh_enable();
+	}
+}
+
+static inline void xenvif_tx_dealloc_action(struct xenvif *vif)
+{
+	struct gnttab_unmap_grant_ref *gop;
+	pending_ring_idx_t dc, dp;
+	u16 pending_idx, pending_idx_release[MAX_PENDING_REQS];
+	unsigned int i = 0;
+
+	dc = vif->dealloc_cons;
+	gop = vif->tx_unmap_ops;
+
+	/* Free up any grants we have finished using */
+	do {
+		dp = vif->dealloc_prod;
+
+		/* Ensure we see all indices enqueued by all
+		 * xenvif_zerocopy_callback().
+		 */
+		smp_rmb();
+
+		while (dc != dp) {
+			BUG_ON(gop - vif->tx_unmap_ops > MAX_PENDING_REQS);
+			pending_idx =
+				vif->dealloc_ring[pending_index(dc++)];
+
+			pending_idx_release[gop-vif->tx_unmap_ops] =
+				pending_idx;
+			vif->pages_to_unmap[gop-vif->tx_unmap_ops] =
+				vif->mmap_pages[pending_idx];
+			gnttab_set_unmap_op(gop,
+					    idx_to_kaddr(vif, pending_idx),
+					    GNTMAP_host_map,
+					    vif->grant_tx_handle[pending_idx]);
+			/* Btw. already unmapped? */
+			xenvif_grant_handle_reset(vif, pending_idx);
+			++gop;
+		}
+
+	} while (dp != vif->dealloc_prod);
+
+	vif->dealloc_cons = dc;
+
+	if (gop - vif->tx_unmap_ops > 0) {
+		int ret;
+		ret = gnttab_unmap_refs(vif->tx_unmap_ops,
+					NULL,
+					vif->pages_to_unmap,
+					gop - vif->tx_unmap_ops);
+		if (ret) {
+			netdev_err(vif->dev, "Unmap fail: nr_ops %x ret %d\n",
+				   gop - vif->tx_unmap_ops, ret);
+			for (i = 0; i < gop - vif->tx_unmap_ops; ++i) {
+				if (gop[i].status != GNTST_okay)
+					netdev_err(vif->dev,
+						   " host_addr: %llx handle: %x status: %d\n",
+						   gop[i].host_addr,
+						   gop[i].handle,
+						   gop[i].status);
+			}
+			BUG();
+		}
+	}
+
+	for (i = 0; i < gop - vif->tx_unmap_ops; ++i)
+		xenvif_idx_release(vif, pending_idx_release[i],
+				   XEN_NETIF_RSP_OKAY);
 }
 
+
 /* Called after netfront has transmitted */
 int xenvif_tx_action(struct xenvif *vif, int budget)
 {
 	unsigned nr_gops;
-	int work_done;
+	int work_done, ret;
 
 	if (unlikely(!tx_work_todo(vif)))
 		return 0;
@@ -1403,7 +1491,11 @@ int xenvif_tx_action(struct xenvif *vif, int budget)
 	if (nr_gops == 0)
 		return 0;
 
-	gnttab_batch_copy(vif->tx_copy_ops, nr_gops);
+	ret = gnttab_map_refs(vif->tx_map_ops,
+			      NULL,
+			      vif->pages_to_map,
+			      nr_gops);
+	BUG_ON(ret);
 
 	work_done = xenvif_tx_submit(vif);
 
@@ -1414,45 +1506,19 @@ static void xenvif_idx_release(struct xenvif *vif, u16 pending_idx,
 			       u8 status)
 {
 	struct pending_tx_info *pending_tx_info;
-	pending_ring_idx_t head;
+	pending_ring_idx_t index;
 	u16 peek; /* peek into next tx request */
+	unsigned long flags;
 
-	BUG_ON(vif->mmap_pages[pending_idx] == (void *)(~0UL));
-
-	/* Already complete? */
-	if (vif->mmap_pages[pending_idx] == NULL)
-		return;
-
-	pending_tx_info = &vif->pending_tx_info[pending_idx];
-
-	head = pending_tx_info->head;
-
-	BUG_ON(!pending_tx_is_head(vif, head));
-	BUG_ON(vif->pending_ring[pending_index(head)] != pending_idx);
-
-	do {
-		pending_ring_idx_t index;
-		pending_ring_idx_t idx = pending_index(head);
-		u16 info_idx = vif->pending_ring[idx];
-
-		pending_tx_info = &vif->pending_tx_info[info_idx];
+		pending_tx_info = &vif->pending_tx_info[pending_idx];
+		spin_lock_irqsave(&vif->response_lock, flags);
 		make_tx_response(vif, &pending_tx_info->req, status);
-
-		/* Setting any number other than
-		 * INVALID_PENDING_RING_IDX indicates this slot is
-		 * starting a new packet / ending a previous packet.
-		 */
-		pending_tx_info->head = 0;
-
-		index = pending_index(vif->pending_prod++);
-		vif->pending_ring[index] = vif->pending_ring[info_idx];
-
-		peek = vif->pending_ring[pending_index(++head)];
-
-	} while (!pending_tx_is_head(vif, peek));
-
-	put_page(vif->mmap_pages[pending_idx]);
-	vif->mmap_pages[pending_idx] = NULL;
+		index = pending_index(vif->pending_prod);
+		vif->pending_ring[index] = pending_idx;
+		/* TX shouldn't use the index before we give it back here */
+		mb();
+		vif->pending_prod++;
+		spin_unlock_irqrestore(&vif->response_lock, flags);
 }
 
 
@@ -1500,6 +1566,25 @@ static struct xen_netif_rx_response *make_rx_response(struct xenvif *vif,
 	return resp;
 }
 
+void xenvif_idx_unmap(struct xenvif *vif, u16 pending_idx)
+{
+	int ret;
+	struct gnttab_unmap_grant_ref tx_unmap_op;
+
+	gnttab_set_unmap_op(&tx_unmap_op,
+			    idx_to_kaddr(vif, pending_idx),
+			    GNTMAP_host_map,
+			    vif->grant_tx_handle[pending_idx]);
+	/* Btw. already unmapped? */
+	xenvif_grant_handle_reset(vif, pending_idx);
+
+	ret = gnttab_unmap_refs(&tx_unmap_op, NULL,
+				&vif->mmap_pages[pending_idx], 1);
+	BUG_ON(ret);
+
+	xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
+}
+
 static inline int rx_work_todo(struct xenvif *vif)
 {
 	return !skb_queue_empty(&vif->rx_queue) &&
@@ -1516,6 +1601,11 @@ static inline int tx_work_todo(struct xenvif *vif)
 	return 0;
 }
 
+static inline bool tx_dealloc_work_todo(struct xenvif *vif)
+{
+	return vif->dealloc_cons != vif->dealloc_prod;
+}
+
 void xenvif_unmap_frontend_rings(struct xenvif *vif)
 {
 	if (vif->tx.sring)
@@ -1602,6 +1692,28 @@ int xenvif_kthread_guest_rx(void *data)
 	return 0;
 }
 
+int xenvif_dealloc_kthread(void *data)
+{
+	struct xenvif *vif = data;
+
+	while (!kthread_should_stop()) {
+		wait_event_interruptible(vif->dealloc_wq,
+					 tx_dealloc_work_todo(vif) ||
+					 kthread_should_stop());
+		if (kthread_should_stop())
+			break;
+
+		xenvif_tx_dealloc_action(vif);
+		cond_resched();
+	}
+
+	/* Unmap anything remaining*/
+	if (tx_dealloc_work_todo(vif))
+		xenvif_tx_dealloc_action(vif);
+
+	return 0;
+}
+
 static int __init netback_init(void)
 {
 	int rc = 0;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 5/9] xen-netback: Remove old TX grant copy definitons and fix indentations
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
                   ` (3 preceding siblings ...)
  2014-03-06 21:48 ` [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 6/9] xen-netback: Add stat counters for zerocopy Zoltan Kiss
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, Zoltan Kiss, jonathan.davies

These became obsolete with grant mapping. I've left intentionally the
indentations in this way, to improve readability of previous patches.

NOTE: if bisect brought you here, you should apply the series up until
"xen-netback: Timeout packets in RX path", otherwise Windows guests can't work
properly and malicious guests can block other guests by not releasing their sent
packets.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---
v2:
- move the indentation fixup patch here

v4:
- indentation fixes

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 8719500..8e59b14 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -48,37 +48,8 @@
 typedef unsigned int pending_ring_idx_t;
 #define INVALID_PENDING_RING_IDX (~0U)
 
-/* For the head field in pending_tx_info: it is used to indicate
- * whether this tx info is the head of one or more coalesced requests.
- *
- * When head != INVALID_PENDING_RING_IDX, it means the start of a new
- * tx requests queue and the end of previous queue.
- *
- * An example sequence of head fields (I = INVALID_PENDING_RING_IDX):
- *
- * ...|0 I I I|5 I|9 I I I|...
- * -->|<-INUSE----------------
- *
- * After consuming the first slot(s) we have:
- *
- * ...|V V V V|5 I|9 I I I|...
- * -----FREE->|<-INUSE--------
- *
- * where V stands for "valid pending ring index". Any number other
- * than INVALID_PENDING_RING_IDX is OK. These entries are considered
- * free and can contain any number other than
- * INVALID_PENDING_RING_IDX. In practice we use 0.
- *
- * The in use non-INVALID_PENDING_RING_IDX (say 0, 5 and 9 in the
- * above example) number is the index into pending_tx_info and
- * mmap_pages arrays.
- */
 struct pending_tx_info {
-	struct xen_netif_tx_request req; /* coalesced tx request */
-	pending_ring_idx_t head; /* head != INVALID_PENDING_RING_IDX
-				  * if it is head of one or more tx
-				  * reqs
-				  */
+	struct xen_netif_tx_request req; /* tx request */
 	/* Callback data for released SKBs. The callback is always
 	 * xenvif_zerocopy_callback, desc contains the pending_idx, which is
 	 * also an index in pending_tx_info array. It is initialized in
@@ -149,11 +120,6 @@ struct xenvif {
 	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
 	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
 
-	/* Coalescing tx requests before copying makes number of grant
-	 * copy ops greater or equal to number of slots required. In
-	 * worst case a tx request consumes 2 gnttab_copy.
-	 */
-	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
 	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
 	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
 	/* passed to gnttab_[un]map_refs with pages under (un)mapping */
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 763d49a..bd09028 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -62,16 +62,6 @@ module_param(separate_tx_rx_irq, bool, 0644);
 static unsigned int fatal_skb_slots = FATAL_SKB_SLOTS_DEFAULT;
 module_param(fatal_skb_slots, uint, 0444);
 
-/*
- * If head != INVALID_PENDING_RING_IDX, it means this tx request is head of
- * one or more merged tx requests, otherwise it is the continuation of
- * previous tx request.
- */
-static inline int pending_tx_is_head(struct xenvif *vif, RING_IDX idx)
-{
-	return vif->pending_tx_info[idx].head != INVALID_PENDING_RING_IDX;
-}
-
 static void xenvif_idx_release(struct xenvif *vif, u16 pending_idx,
 			       u8 status);
 
@@ -790,19 +780,6 @@ static int xenvif_count_requests(struct xenvif *vif,
 	return slots;
 }
 
-static struct page *xenvif_alloc_page(struct xenvif *vif,
-				      u16 pending_idx)
-{
-	struct page *page;
-
-	page = alloc_page(GFP_ATOMIC|__GFP_COLD);
-	if (!page)
-		return NULL;
-	vif->mmap_pages[pending_idx] = page;
-
-	return page;
-}
-
 
 struct xenvif_tx_cb {
 	u16 pending_idx;
@@ -832,13 +809,9 @@ static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif *vif,
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	skb_frag_t *frags = shinfo->frags;
 	u16 pending_idx = XENVIF_TX_CB(skb)->pending_idx;
-	u16 head_idx = 0;
-	int slot, start;
-	struct page *page;
-	pending_ring_idx_t index, start_idx = 0;
-	uint16_t dst_offset;
+	int start;
+	pending_ring_idx_t index;
 	unsigned int nr_slots;
-	struct pending_tx_info *first = NULL;
 
 	/* At this point shinfo->nr_frags is in fact the number of
 	 * slots, which can be as large as XEN_NETBK_LEGACY_SLOTS_MAX.
@@ -850,8 +823,8 @@ static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif *vif,
 
 	for (shinfo->nr_frags = start; shinfo->nr_frags < nr_slots;
 	     shinfo->nr_frags++, txp++, gop++) {
-				index = pending_index(vif->pending_cons++);
-				pending_idx = vif->pending_ring[index];
+		index = pending_index(vif->pending_cons++);
+		pending_idx = vif->pending_ring[index];
 		xenvif_tx_create_gop(vif, pending_idx, txp, gop);
 		frag_set_pending_idx(&frags[shinfo->nr_frags], pending_idx);
 	}
@@ -859,18 +832,6 @@ static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif *vif,
 	BUG_ON(shinfo->nr_frags > MAX_SKB_FRAGS);
 
 	return gop;
-err:
-	/* Unwind, freeing all pages and sending error responses. */
-	while (shinfo->nr_frags-- > start) {
-		xenvif_idx_release(vif,
-				frag_get_pending_idx(&frags[shinfo->nr_frags]),
-				XEN_NETIF_RSP_ERROR);
-	}
-	/* The head too, if necessary. */
-	if (start)
-		xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_ERROR);
-
-	return NULL;
 }
 
 static inline void xenvif_grant_handle_set(struct xenvif *vif,
@@ -910,7 +871,6 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 	struct pending_tx_info *tx_info;
 	int nr_frags = shinfo->nr_frags;
 	int i, err, start;
-	u16 peek; /* peek into next tx request */
 
 	/* Check status of header. */
 	err = gop->status;
@@ -924,14 +884,12 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 
 	for (i = start; i < nr_frags; i++) {
 		int j, newerr;
-		pending_ring_idx_t head;
 
 		pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
 		tx_info = &vif->pending_tx_info[pending_idx];
-		head = tx_info->head;
 
 		/* Check error status: if okay then remember grant handle. */
-			newerr = (++gop)->status;
+		newerr = (++gop)->status;
 
 		if (likely(!newerr)) {
 			xenvif_grant_handle_set(vif, pending_idx , gop->handle);
@@ -1136,7 +1094,6 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 	       (skb_queue_len(&vif->tx_queue) < budget)) {
 		struct xen_netif_tx_request txreq;
 		struct xen_netif_tx_request txfrags[XEN_NETBK_LEGACY_SLOTS_MAX];
-		struct page *page;
 		struct xen_netif_extra_info extras[XEN_NETIF_EXTRA_TYPE_MAX-1];
 		u16 pending_idx;
 		RING_IDX idx;
@@ -1507,18 +1464,17 @@ static void xenvif_idx_release(struct xenvif *vif, u16 pending_idx,
 {
 	struct pending_tx_info *pending_tx_info;
 	pending_ring_idx_t index;
-	u16 peek; /* peek into next tx request */
 	unsigned long flags;
 
-		pending_tx_info = &vif->pending_tx_info[pending_idx];
-		spin_lock_irqsave(&vif->response_lock, flags);
-		make_tx_response(vif, &pending_tx_info->req, status);
-		index = pending_index(vif->pending_prod);
-		vif->pending_ring[index] = pending_idx;
-		/* TX shouldn't use the index before we give it back here */
-		mb();
-		vif->pending_prod++;
-		spin_unlock_irqrestore(&vif->response_lock, flags);
+	pending_tx_info = &vif->pending_tx_info[pending_idx];
+	spin_lock_irqsave(&vif->response_lock, flags);
+	make_tx_response(vif, &pending_tx_info->req, status);
+	index = pending_index(vif->pending_prod);
+	vif->pending_ring[index] = pending_idx;
+	/* TX shouldn't use the index before we give it back here */
+	mb();
+	vif->pending_prod++;
+	spin_unlock_irqrestore(&vif->response_lock, flags);
 }

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 6/9] xen-netback: Add stat counters for zerocopy
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
                   ` (4 preceding siblings ...)
  2014-03-06 21:48 ` [PATCH net-next v7 5/9] xen-netback: Remove old TX grant copy definitons and fix indentations Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 7/9] xen-netback: Handle guests with too many frags Zoltan Kiss
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, Zoltan Kiss, jonathan.davies

These counters help determine how often the buffers had to be copied. Also
they help find out if packets are leaked, as if "sent != success + fail",
there are probably packets never freed up properly.

NOTE: if bisect brought you here, you should apply the series up until
"xen-netback: Timeout packets in RX path", otherwise Windows guests can't work
properly and malicious guests can block other guests by not releasing their sent
packets.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 79ba456..eac171e 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -167,6 +167,9 @@ struct xenvif {
 
 	/* Statistics */
 	unsigned long rx_gso_checksum_fixup;
+	unsigned long tx_zerocopy_sent;
+	unsigned long tx_zerocopy_success;
+	unsigned long tx_zerocopy_fail;
 
 	/* Miscellaneous private stuff. */
 	struct net_device *dev;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 408ab57..adfed30 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -237,6 +237,21 @@ static const struct xenvif_stat {
 		"rx_gso_checksum_fixup",
 		offsetof(struct xenvif, rx_gso_checksum_fixup)
 	},
+	/* If (sent != success + fail), there are probably packets never
+	 * freed up properly!
+	 */
+	{
+		"tx_zerocopy_sent",
+		offsetof(struct xenvif, tx_zerocopy_sent),
+	},
+	{
+		"tx_zerocopy_success",
+		offsetof(struct xenvif, tx_zerocopy_success),
+	},
+	{
+		"tx_zerocopy_fail",
+		offsetof(struct xenvif, tx_zerocopy_fail)
+	},
 };
 
 static int xenvif_get_sset_count(struct net_device *dev, int string_set)
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index bc91965..44e6988 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1554,8 +1554,10 @@ static int xenvif_tx_submit(struct xenvif *vif)
 		 * do a skb_copy_ubufs while we are still in control of the
 		 * skb. E.g. the __pskb_pull_tail earlier can do such thing.
 		 */
-		if (skb_shinfo(skb)->destructor_arg)
+		if (skb_shinfo(skb)->destructor_arg) {
 			skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+			vif->tx_zerocopy_sent++;
+		}
 
 		netif_receive_skb(skb);
 	}
@@ -1595,6 +1597,11 @@ void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success)
 		napi_schedule(&vif->napi);
 		local_bh_enable();
 	}
+
+	if (likely(zerocopy_success))
+		vif->tx_zerocopy_success++;
+	else
+		vif->tx_zerocopy_fail++;
 }
 
 static inline void xenvif_tx_dealloc_action(struct xenvif *vif)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 7/9] xen-netback: Handle guests with too many frags
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
                   ` (5 preceding siblings ...)
  2014-03-06 21:48 ` [PATCH net-next v7 6/9] xen-netback: Add stat counters for zerocopy Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-06 21:48 ` [PATCH net-next v7 8/9] xen-netback: Timeout packets in RX path Zoltan Kiss
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, jonathan.davies, Zoltan Kiss

Xen network protocol had implicit dependency on MAX_SKB_FRAGS. Netback has to
handle guests sending up to XEN_NETBK_LEGACY_SLOTS_MAX slots. To achieve that:
- create a new skb
- map the leftover slots to its frags (no linear buffer here!)
- chain it to the previous through skb_shinfo(skb)->frag_list
- map them
- copy and coalesce the frags into a brand new one and send it to the stack
- unmap the 2 old skb's pages

It's also introduces new stat counters, which help determine how often the guest
sends a packet with more than MAX_SKB_FRAGS frags.

NOTE: if bisect brought you here, you should apply the series up until
"xen-netback: Timeout packets in RX path", otherwise malicious guests can block
other guests by not releasing their sent packets.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---
v3:
- adding extra check for frag number
- consolidate alloc_skb's into xenvif_alloc_skb()
- BUG_ON(frag_overflow > MAX_SKB_FRAGS)

v4:
- handle error of skb_copy_expand()

v5:
- ratelimit error messages
- remove a tx_flags setting from xenvif_tx_submit 

v6:
- move out handling from tx_submit into a new funciont, as it became quite long
- skb_copy[_expand] allocate a new skb with a huge linear buffer, which is bad
  in times of memory pressure. Just make a new frags array and do the copy and
  coalesce with skb_copy_bits

v7:
- remove obsolate BUG_ON
- remove stale idx_release
- merge the next patch about stats here

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 917dbb0..aaa3d17 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -183,6 +183,7 @@ struct xenvif {
 	unsigned long tx_zerocopy_sent;
 	unsigned long tx_zerocopy_success;
 	unsigned long tx_zerocopy_fail;
+	unsigned long tx_frag_overflow;
 
 	/* Miscellaneous private stuff. */
 	struct net_device *dev;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 44df858..b646039 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -253,6 +253,13 @@ static const struct xenvif_stat {
 		"tx_zerocopy_fail",
 		offsetof(struct xenvif, tx_zerocopy_fail)
 	},
+	/* Number of packets exceeding MAX_SKB_FRAG slots. You should use
+	 * a guest with the same MAX_SKB_FRAG
+	 */
+	{
+		"tx_frag_overflow",
+		offsetof(struct xenvif, tx_frag_overflow)
+	},
 };
 
 static int xenvif_get_sset_count(struct net_device *dev, int string_set)
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index fa4b53f..61f0fb2 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -37,6 +37,7 @@
 #include <linux/kthread.h>
 #include <linux/if_vlan.h>
 #include <linux/udp.h>
+#include <linux/highmem.h>
 
 #include <net/tcp.h>
 
@@ -801,6 +802,23 @@ static inline void xenvif_tx_create_gop(struct xenvif *vif,
 	       sizeof(*txp));
 }
 
+static inline struct sk_buff *xenvif_alloc_skb(unsigned int size)
+{
+	struct sk_buff *skb =
+		alloc_skb(size + NET_SKB_PAD + NET_IP_ALIGN,
+			  GFP_ATOMIC | __GFP_NOWARN);
+	if (unlikely(skb == NULL))
+		return NULL;
+
+	/* Packets passed to netif_rx() must have some headroom. */
+	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
+
+	/* Initialize it here to avoid later surprises */
+	skb_shinfo(skb)->destructor_arg = NULL;
+
+	return skb;
+}
+
 static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif *vif,
 							struct sk_buff *skb,
 							struct xen_netif_tx_request *txp,
@@ -811,11 +829,16 @@ static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif *vif,
 	u16 pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 	int start;
 	pending_ring_idx_t index;
-	unsigned int nr_slots;
+	unsigned int nr_slots, frag_overflow = 0;
 
 	/* At this point shinfo->nr_frags is in fact the number of
 	 * slots, which can be as large as XEN_NETBK_LEGACY_SLOTS_MAX.
 	 */
+	if (shinfo->nr_frags > MAX_SKB_FRAGS) {
+		frag_overflow = shinfo->nr_frags - MAX_SKB_FRAGS;
+		BUG_ON(frag_overflow > MAX_SKB_FRAGS);
+		shinfo->nr_frags = MAX_SKB_FRAGS;
+	}
 	nr_slots = shinfo->nr_frags;
 
 	/* Skip first skb fragment if it is on same page as header fragment. */
@@ -829,7 +852,29 @@ static struct gnttab_map_grant_ref *xenvif_get_requests(struct xenvif *vif,
 		frag_set_pending_idx(&frags[shinfo->nr_frags], pending_idx);
 	}
 
-	BUG_ON(shinfo->nr_frags > MAX_SKB_FRAGS);
+	if (frag_overflow) {
+		struct sk_buff *nskb = xenvif_alloc_skb(0);
+		if (unlikely(nskb == NULL)) {
+			if (net_ratelimit())
+				netdev_err(vif->dev,
+					   "Can't allocate the frag_list skb.\n");
+			return NULL;
+		}
+
+		shinfo = skb_shinfo(nskb);
+		frags = shinfo->frags;
+
+		for (shinfo->nr_frags = 0; shinfo->nr_frags < frag_overflow;
+		     shinfo->nr_frags++, txp++, gop++) {
+			index = pending_index(vif->pending_cons++);
+			pending_idx = vif->pending_ring[index];
+			xenvif_tx_create_gop(vif, pending_idx, txp, gop);
+			frag_set_pending_idx(&frags[shinfo->nr_frags],
+					     pending_idx);
+		}
+
+		skb_shinfo(skb)->frag_list = nskb;
+	}
 
 	return gop;
 }
@@ -871,6 +916,7 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 	struct pending_tx_info *tx_info;
 	int nr_frags = shinfo->nr_frags;
 	int i, err, start;
+	struct sk_buff *first_skb = NULL;
 
 	/* Check status of header. */
 	err = gop->status;
@@ -882,6 +928,7 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 	/* Skip first skb fragment if it is on same page as header fragment. */
 	start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
 
+check_frags:
 	for (i = start; i < nr_frags; i++) {
 		int j, newerr;
 
@@ -905,9 +952,11 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 		/* Not the first error? Preceding frags already invalidated. */
 		if (err)
 			continue;
-
 		/* First error: invalidate header and preceding fragments. */
-		pending_idx = XENVIF_TX_CB(skb)->pending_idx;
+		if (!first_skb)
+			pending_idx = XENVIF_TX_CB(skb)->pending_idx;
+		else
+			pending_idx = XENVIF_TX_CB(skb)->pending_idx;
 		xenvif_idx_unmap(vif, pending_idx);
 		for (j = start; j < i; j++) {
 			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
@@ -918,6 +967,30 @@ static int xenvif_tx_check_gop(struct xenvif *vif,
 		err = newerr;
 	}
 
+	if (skb_has_frag_list(skb)) {
+		first_skb = skb;
+		skb = shinfo->frag_list;
+		shinfo = skb_shinfo(skb);
+		nr_frags = shinfo->nr_frags;
+		start = 0;
+
+		goto check_frags;
+	}
+
+	/* There was a mapping error in the frag_list skb. We have to unmap
+	 * the first skb's frags
+	 */
+	if (first_skb && err) {
+		int j;
+		shinfo = skb_shinfo(first_skb);
+		pending_idx = XENVIF_TX_CB(skb)->pending_idx;
+		start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
+		for (j = start; j < shinfo->nr_frags; j++) {
+			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
+			xenvif_idx_unmap(vif, pending_idx);
+		}
+	}
+
 	*gopp = gop + 1;
 	return err;
 }
@@ -1169,8 +1242,7 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 			    ret < XEN_NETBK_LEGACY_SLOTS_MAX) ?
 			PKT_PROT_LEN : txreq.size;
 
-		skb = alloc_skb(data_len + NET_SKB_PAD + NET_IP_ALIGN,
-				GFP_ATOMIC | __GFP_NOWARN);
+		skb = xenvif_alloc_skb(data_len);
 		if (unlikely(skb == NULL)) {
 			netdev_dbg(vif->dev,
 				   "Can't allocate a skb in start_xmit.\n");
@@ -1178,9 +1250,6 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 			break;
 		}
 
-		/* Packets passed to netif_rx() must have some headroom. */
-		skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
-
 		if (extras[XEN_NETIF_EXTRA_TYPE_GSO - 1].type) {
 			struct xen_netif_extra_info *gso;
 			gso = &extras[XEN_NETIF_EXTRA_TYPE_GSO - 1];
@@ -1231,6 +1300,71 @@ static unsigned xenvif_tx_build_gops(struct xenvif *vif, int budget)
 	return gop - vif->tx_map_ops;
 }
 
+/* Consolidate skb with a frag_list into a brand new one with local pages on
+ * frags. Returns 0 or -ENOMEM if can't allocate new pages.
+ */
+static int xenvif_handle_frag_list(struct xenvif *vif, struct sk_buff *skb)
+{
+	unsigned int offset = skb_headlen(skb);
+	skb_frag_t frags[MAX_SKB_FRAGS];
+	int i;
+	struct ubuf_info *uarg;
+	struct sk_buff *nskb = skb_shinfo(skb)->frag_list;
+
+	vif->tx_zerocopy_sent += 2;
+	vif->tx_frag_overflow++;
+
+	xenvif_fill_frags(vif, nskb);
+	/* Subtract frags size, we will correct it later */
+	skb->truesize -= skb->data_len;
+	skb->len += nskb->len;
+	skb->data_len += nskb->len;
+
+	/* create a brand new frags array and coalesce there */
+	for (i = 0; offset < skb->len; i++) {
+		struct page *page;
+		unsigned int len;
+
+		BUG_ON(i >= MAX_SKB_FRAGS);
+		page = alloc_page(GFP_ATOMIC|__GFP_COLD);
+		if (!page) {
+			int j;
+			skb->truesize += skb->data_len;
+			for (j = 0; j < i; j++)
+				put_page(frags[j].page.p);
+			return -ENOMEM;
+		}
+
+		if (offset + PAGE_SIZE < skb->len)
+			len = PAGE_SIZE;
+		else
+			len = skb->len - offset;
+		if (skb_copy_bits(skb, offset, page_address(page), len))
+			BUG();
+
+		offset += len;
+		frags[i].page.p = page;
+		frags[i].page_offset = 0;
+		skb_frag_size_set(&frags[i], len);
+	}
+	/* swap out with old one */
+	memcpy(skb_shinfo(skb)->frags,
+	       frags,
+	       i * sizeof(skb_frag_t));
+	skb_shinfo(skb)->nr_frags = i;
+	skb->truesize += i * PAGE_SIZE;
+
+	/* remove traces of mapped pages and frag_list */
+	skb_frag_list_init(skb);
+	uarg = skb_shinfo(skb)->destructor_arg;
+	uarg->callback(uarg, true);
+	skb_shinfo(skb)->destructor_arg = NULL;
+
+	skb_shinfo(nskb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+	kfree_skb(nskb);
+
+	return 0;
+}
 
 static int xenvif_tx_submit(struct xenvif *vif)
 {
@@ -1267,7 +1401,6 @@ static int xenvif_tx_submit(struct xenvif *vif)
 				&vif->pending_tx_info[pending_idx].callback_struct;
 		} else {
 			/* Schedule a response immediately. */
-			skb_shinfo(skb)->destructor_arg = NULL;
 			xenvif_idx_unmap(vif, pending_idx);
 		}
 
@@ -1278,6 +1411,17 @@ static int xenvif_tx_submit(struct xenvif *vif)
 
 		xenvif_fill_frags(vif, skb);
 
+		if (unlikely(skb_has_frag_list(skb))) {
+			if (xenvif_handle_frag_list(vif, skb)) {
+				if (net_ratelimit())
+					netdev_err(vif->dev,
+						   "Not enough memory to consolidate frag_list!\n");
+				skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+				kfree_skb(skb);
+				continue;
+			}
+		}
+
 		if (skb_is_nonlinear(skb) && skb_headlen(skb) < PKT_PROT_LEN) {
 			int target = min_t(int, skb->len, PKT_PROT_LEN);
 			__pskb_pull_tail(skb, target - skb_headlen(skb));

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 8/9] xen-netback: Timeout packets in RX path
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
                   ` (6 preceding siblings ...)
  2014-03-06 21:48 ` [PATCH net-next v7 7/9] xen-netback: Handle guests with too many frags Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-13 10:39   ` Ian Campbell
  2014-03-06 21:48 ` [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations Zoltan Kiss
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, Zoltan Kiss, jonathan.davies

A malicious or buggy guest can leave its queue filled indefinitely, in which
case qdisc start to queue packets for that VIF. If those packets came from an
another guest, it can block its slots and prevent shutdown. To avoid that, we
make sure the queue is drained in every 10 seconds.
The QDisc queue in worst case takes 3 round to flush usually.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---
v3:
- remove stale debug log
- tie unmap timeout in xenvif_free to this timeout

v4:
- due to RX flow control changes now start_xmit doesn't drop the packets but
  place them on the internal queue. So the timer set rx_queue_purge and kick in
  the thread to drop the packets there
- we shoot down the timer if a previously filled internal queue drains
- adjust the teardown timeout as in worst case it can take more time now

v5:
- create separate variable worst_case_skb_lifetime and add an explanation about
  why is it so long

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 109c29f..d1cd8ce 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -129,6 +129,9 @@ struct xenvif {
 	struct xen_netif_rx_back_ring rx;
 	struct sk_buff_head rx_queue;
 	RING_IDX rx_last_skb_slots;
+	bool rx_queue_purge;
+
+	struct timer_list wake_queue;
 
 	/* This array is allocated seperately as it is large */
 	struct gnttab_copy *grant_copy_op;
@@ -225,4 +228,7 @@ void xenvif_idx_unmap(struct xenvif *vif, u16 pending_idx);
 
 extern bool separate_tx_rx_irq;
 
+extern unsigned int rx_drain_timeout_msecs;
+extern unsigned int rx_drain_timeout_jiffies;
+
 #endif /* __XEN_NETBACK__COMMON_H__ */
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index af6b3e1..40aa500 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -114,6 +114,18 @@ static irqreturn_t xenvif_interrupt(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 
+static void xenvif_wake_queue(unsigned long data)
+{
+	struct xenvif *vif = (struct xenvif *)data;
+
+	if (netif_queue_stopped(vif->dev)) {
+		netdev_err(vif->dev, "draining TX queue\n");
+		vif->rx_queue_purge = true;
+		xenvif_kick_thread(vif);
+		netif_wake_queue(vif->dev);
+	}
+}
+
 static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct xenvif *vif = netdev_priv(dev);
@@ -143,8 +155,13 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * then turn off the queue to give the ring a chance to
 	 * drain.
 	 */
-	if (!xenvif_rx_ring_slots_available(vif, min_slots_needed))
+	if (!xenvif_rx_ring_slots_available(vif, min_slots_needed)) {
+		vif->wake_queue.function = xenvif_wake_queue;
+		vif->wake_queue.data = (unsigned long)vif;
 		xenvif_stop_queue(vif);
+		mod_timer(&vif->wake_queue,
+			jiffies + rx_drain_timeout_jiffies);
+	}
 
 	skb_queue_tail(&vif->rx_queue, skb);
 	xenvif_kick_thread(vif);
@@ -352,6 +369,8 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	init_timer(&vif->credit_timeout);
 	vif->credit_window_start = get_jiffies_64();
 
+	init_timer(&vif->wake_queue);
+
 	dev->netdev_ops	= &xenvif_netdev_ops;
 	dev->hw_features = NETIF_F_SG |
 		NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
@@ -532,6 +551,7 @@ void xenvif_disconnect(struct xenvif *vif)
 		xenvif_carrier_off(vif);
 
 	if (vif->task) {
+		del_timer_sync(&vif->wake_queue);
 		kthread_stop(vif->task);
 		vif->task = NULL;
 	}
@@ -557,12 +577,25 @@ void xenvif_disconnect(struct xenvif *vif)
 void xenvif_free(struct xenvif *vif)
 {
 	int i, unmap_timeout = 0;
+	/* Here we want to avoid timeout messages if an skb can be legitimatly
+	 * stucked somewhere else. Realisticly this could be an another vif's
+	 * internal or QDisc queue. That another vif also has this
+	 * rx_drain_timeout_msecs timeout, but the timer only ditches the
+	 * internal queue. After that, the QDisc queue can put in worst case
+	 * XEN_NETIF_RX_RING_SIZE / MAX_SKB_FRAGS skbs into that another vif's
+	 * internal queue, so we need several rounds of such timeouts until we
+	 * can be sure that no another vif should have skb's from us. We are
+	 * not sending more skb's, so newly stucked packets are not interesting
+	 * for us here.
+	 */
+	unsigned int worst_case_skb_lifetime = (rx_drain_timeout_msecs/1000) *
+		DIV_ROUND_UP(XENVIF_QUEUE_LENGTH, (XEN_NETIF_RX_RING_SIZE / MAX_SKB_FRAGS));
 
 	for (i = 0; i < MAX_PENDING_REQS; ++i) {
 		if (vif->grant_tx_handle[i] != NETBACK_INVALID_HANDLE) {
 			unmap_timeout++;
 			schedule_timeout(msecs_to_jiffies(1000));
-			if (unmap_timeout > 9 &&
+			if (unmap_timeout > worst_case_skb_lifetime &&
 			    net_ratelimit())
 				netdev_err(vif->dev,
 					   "Page still granted! Index: %x\n",
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 560950e..bb65c7c 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -63,6 +63,13 @@ module_param(separate_tx_rx_irq, bool, 0644);
 static unsigned int fatal_skb_slots = FATAL_SKB_SLOTS_DEFAULT;
 module_param(fatal_skb_slots, uint, 0444);
 
+/* When guest ring is filled up, qdisc queues the packets for us, but we have
+ * to timeout them, otherwise other guests' packets can get stucked there
+ */
+unsigned int rx_drain_timeout_msecs = 10000;
+module_param(rx_drain_timeout_msecs, uint, 0444);
+unsigned int rx_drain_timeout_jiffies;
+
 /*
  * To avoid confusion, we define XEN_NETBK_LEGACY_SLOTS_MAX indicating
  * the maximum slots a valid packet can use. Now this value is defined
@@ -1909,8 +1916,9 @@ static struct xen_netif_rx_response *make_rx_response(struct xenvif *vif,
 
 static inline int rx_work_todo(struct xenvif *vif)
 {
-	return !skb_queue_empty(&vif->rx_queue) &&
-	       xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots);
+	return (!skb_queue_empty(&vif->rx_queue) &&
+	       xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots)) ||
+	       vif->rx_queue_purge;
 }
 
 static inline int tx_work_todo(struct xenvif *vif)
@@ -1998,12 +2006,19 @@ int xenvif_kthread(void *data)
 		if (kthread_should_stop())
 			break;
 
+		if (vif->rx_queue_purge) {
+			skb_queue_purge(&vif->rx_queue);
+			vif->rx_queue_purge = false;
+		}
+
 		if (!skb_queue_empty(&vif->rx_queue))
 			xenvif_rx_action(vif);
 
 		if (skb_queue_empty(&vif->rx_queue) &&
-		    netif_queue_stopped(vif->dev))
+		    netif_queue_stopped(vif->dev)) {
+			del_timer_sync(&vif->wake_queue);
 			xenvif_start_queue(vif);
+		}
 
 		cond_resched();
 	}
@@ -2054,6 +2069,8 @@ static int __init netback_init(void)
 	if (rc)
 		goto failed_init;
 
+	rx_drain_timeout_jiffies = msecs_to_jiffies(rx_drain_timeout_msecs);
+
 	return 0;
 
 failed_init:

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
                   ` (7 preceding siblings ...)
  2014-03-06 21:48 ` [PATCH net-next v7 8/9] xen-netback: Timeout packets in RX path Zoltan Kiss
@ 2014-03-06 21:48 ` Zoltan Kiss
  2014-03-19 21:16   ` Zoltan Kiss
  2014-03-07 21:05 ` [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy David Miller
  2014-03-13 10:08 ` Ian Campbell
  10 siblings, 1 reply; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-06 21:48 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, jonathan.davies, Zoltan Kiss

Unmapping causes TLB flushing, therefore we should make it in the largest
possible batches. However we shouldn't starve the guest for too long. So if
the guest has space for at least two big packets and we don't have at least a
quarter ring to unmap, delay it for at most 1 milisec.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---
v4:
- use bool for tx_dealloc_work_todo

v6:
- rebase tx_dealloc_work_todo due to missing ;

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index d1cd8ce..95498c8 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -118,6 +118,8 @@ struct xenvif {
 	u16 dealloc_ring[MAX_PENDING_REQS];
 	struct task_struct *dealloc_task;
 	wait_queue_head_t dealloc_wq;
+	struct timer_list dealloc_delay;
+	bool dealloc_delay_timed_out;
 
 	/* Use kthread for guest RX */
 	struct task_struct *task;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 40aa500..f925af5 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -407,6 +407,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 			  .desc = i };
 		vif->grant_tx_handle[i] = NETBACK_INVALID_HANDLE;
 	}
+	init_timer(&vif->dealloc_delay);
 
 	/*
 	 * Initialise a dummy MAC address. We choose the numerically
@@ -557,6 +558,7 @@ void xenvif_disconnect(struct xenvif *vif)
 	}
 
 	if (vif->dealloc_task) {
+		del_timer_sync(&vif->dealloc_delay);
 		kthread_stop(vif->dealloc_task);
 		vif->dealloc_task = NULL;
 	}
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index bb65c7c..c098276 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -135,6 +135,11 @@ static inline pending_ring_idx_t nr_pending_reqs(struct xenvif *vif)
 		vif->pending_prod + vif->pending_cons;
 }
 
+static inline pending_ring_idx_t nr_free_slots(struct xen_netif_tx_back_ring *ring)
+{
+	return ring->nr_ents -	(ring->sring->req_prod - ring->rsp_prod_pvt);
+}
+
 bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed)
 {
 	RING_IDX prod, cons;
@@ -1932,9 +1937,36 @@ static inline int tx_work_todo(struct xenvif *vif)
 	return 0;
 }
 
+static void xenvif_dealloc_delay(unsigned long data)
+{
+	struct xenvif *vif = (struct xenvif *)data;
+
+	vif->dealloc_delay_timed_out = true;
+	wake_up(&vif->dealloc_wq);
+}
+
 static inline bool tx_dealloc_work_todo(struct xenvif *vif)
 {
-	return vif->dealloc_cons != vif->dealloc_prod;
+	if (vif->dealloc_cons != vif->dealloc_prod) {
+		if ((nr_free_slots(&vif->tx) > 2 * XEN_NETBK_LEGACY_SLOTS_MAX) &&
+		    (vif->dealloc_prod - vif->dealloc_cons < MAX_PENDING_REQS / 4) &&
+		    !vif->dealloc_delay_timed_out) {
+			if (!timer_pending(&vif->dealloc_delay)) {
+				vif->dealloc_delay.function =
+					xenvif_dealloc_delay;
+				vif->dealloc_delay.data = (unsigned long)vif;
+				mod_timer(&vif->dealloc_delay,
+					  jiffies + msecs_to_jiffies(1));
+
+			}
+			return false;
+		}
+		del_timer_sync(&vif->dealloc_delay);
+		vif->dealloc_delay_timed_out = false;
+		return true;
+	}
+
+	return false;
 }
 
 void xenvif_unmap_frontend_rings(struct xenvif *vif)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
                   ` (8 preceding siblings ...)
  2014-03-06 21:48 ` [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations Zoltan Kiss
@ 2014-03-07 21:05 ` David Miller
  2014-03-08 14:37   ` Zoltan Kiss
  2014-03-13 10:08 ` Ian Campbell
  10 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2014-03-07 21:05 UTC (permalink / raw)
  To: zoltan.kiss
  Cc: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

From: Zoltan Kiss <zoltan.kiss@citrix.com>
Date: Thu, 6 Mar 2014 21:48:22 +0000

> A long known problem of the upstream netback implementation that on the TX
> path (from guest to Dom0) it copies the whole packet from guest memory into
> Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
> huge perfomance penalty. The classic kernel version of netback used grant
> mapping, and to get notified when the page can be unmapped, it used page
> destructors. Unfortunately that destructor is not an upstreamable solution.
> Ian Campbell's skb fragment destructor patch series [1] tried to solve this
> problem, however it seems to be very invasive on the network stack's code,
> and therefore haven't progressed very well.
> This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
> know when the skb is freed up. That is the way KVM solved the same problem,
> and based on my initial tests it can do the same for us. Avoiding the extra
> copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
> Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
> running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
> switch)
> Based on my investigations the packet get only copied if it is delivered to
> Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
> DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
> unfortunate, but luckily it doesn't cause a major regression for this usecase.
> In the future we should try to eliminate that copy somehow.
> There are a few spinoff tasks which will be addressed in separate patches:
> - grant copy the header directly instead of map and memcpy. This should help
>   us avoiding TLB flushing
> - use something else than ballooned pages
> - fix grant map to use page->index properly
> I've tried to broke it down to smaller patches, with mixed results, so I
> welcome suggestions on that part as well:
> 1: Use skb->cb to store pending_idx
> 2: Some refactoring
> 3: Change RX path for mapped SKB fragments (moved here to keep bisectability,
> review it after #4)
> 4: Introduce TX grant mapping
> 5: Remove old TX grant copy definitons and fix indentations
> 6: Add stat counters for zerocopy
> 7: Handle guests with too many frags
> 8: Timeout packets in RX path
> 9: Aggregate TX unmap operations

Series applied, thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-07 21:05 ` [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy David Miller
@ 2014-03-08 14:37   ` Zoltan Kiss
  2014-03-08 23:57     ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-08 14:37 UTC (permalink / raw)
  To: David Miller
  Cc: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies, andrew.bennieston

On 07/03/14 21:05, David Miller wrote:
> From: Zoltan Kiss <zoltan.kiss@citrix.com>
> Date: Thu, 6 Mar 2014 21:48:22 +0000
>
>> A long known problem of the upstream netback implementation that on the TX
>> path (from guest to Dom0) it copies the whole packet from guest memory into
>> Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
>> huge perfomance penalty. The classic kernel version of netback used grant
>> mapping, and to get notified when the page can be unmapped, it used page
>> destructors. Unfortunately that destructor is not an upstreamable solution.
>> Ian Campbell's skb fragment destructor patch series [1] tried to solve this
>> problem, however it seems to be very invasive on the network stack's code,
>> and therefore haven't progressed very well.
>> This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
>> know when the skb is freed up. That is the way KVM solved the same problem,
>> and based on my initial tests it can do the same for us. Avoiding the extra
>> copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
>> Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
>> running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
>> switch)
>> Based on my investigations the packet get only copied if it is delivered to
>> Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
>> DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
>> unfortunate, but luckily it doesn't cause a major regression for this usecase.
>> In the future we should try to eliminate that copy somehow.
>> There are a few spinoff tasks which will be addressed in separate patches:
>> - grant copy the header directly instead of map and memcpy. This should help
>>    us avoiding TLB flushing
>> - use something else than ballooned pages
>> - fix grant map to use page->index properly
>> I've tried to broke it down to smaller patches, with mixed results, so I
>> welcome suggestions on that part as well:
>> 1: Use skb->cb to store pending_idx
>> 2: Some refactoring
>> 3: Change RX path for mapped SKB fragments (moved here to keep bisectability,
>> review it after #4)
>> 4: Introduce TX grant mapping
>> 5: Remove old TX grant copy definitons and fix indentations
>> 6: Add stat counters for zerocopy
>> 7: Handle guests with too many frags
>> 8: Timeout packets in RX path
>> 9: Aggregate TX unmap operations
> Series applied, thanks.
Well, thanks, I'm happy that things moving fast :), but I'm not sure 
it's good to apply a series before the maintainers ack it. As far as 
I've seen neither Wei nor Ian said the final word, and I guess Ian 
didn't had time to finish his review yet. There is an another series 
from Andrew Bennieston which was half-acked by Wei:

"This series looks good enough for me. IIRC Ian said it's still in his 
queue so I will wait for his final review."

Maybe you mixed up mine with that? But that's also not eligible to be 
applied yet.

Zoli

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-08 14:37   ` Zoltan Kiss
@ 2014-03-08 23:57     ` David Miller
  2014-03-10 10:15       ` Wei Liu
  2014-03-12 15:40       ` Ian Campbell
  0 siblings, 2 replies; 36+ messages in thread
From: David Miller @ 2014-03-08 23:57 UTC (permalink / raw)
  To: zoltan.kiss
  Cc: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies, andrew.bennieston

From: Zoltan Kiss <zoltan.kiss@citrix.com>
Date: Sat, 8 Mar 2014 14:37:50 +0000

> Maybe you mixed up mine with that? But that's also not eligible to be
> applied yet.

I can always revert the series if there are major objections.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-08 23:57     ` David Miller
@ 2014-03-10 10:15       ` Wei Liu
  2014-03-12 15:40       ` Ian Campbell
  1 sibling, 0 replies; 36+ messages in thread
From: Wei Liu @ 2014-03-10 10:15 UTC (permalink / raw)
  To: David Miller
  Cc: zoltan.kiss, ian.campbell, wei.liu2, xen-devel, netdev,
	linux-kernel, jonathan.davies, andrew.bennieston

On Sat, Mar 08, 2014 at 06:57:59PM -0500, David Miller wrote:
> From: Zoltan Kiss <zoltan.kiss@citrix.com>
> Date: Sat, 8 Mar 2014 14:37:50 +0000
> 
> > Maybe you mixed up mine with that? But that's also not eligible to be
> > applied yet.
> 
> I can always revert the series if there are major objections.

David, I appreciate your speed, but can you revert this series first. I
think this series still needs to undergo review. I believe Ian still
wants to have a look at it but he's travelling at the moment.

Wei.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-08 23:57     ` David Miller
  2014-03-10 10:15       ` Wei Liu
@ 2014-03-12 15:40       ` Ian Campbell
  2014-03-12 18:49         ` Zoltan Kiss
  2014-03-13 10:43         ` Ian Campbell
  1 sibling, 2 replies; 36+ messages in thread
From: Ian Campbell @ 2014-03-12 15:40 UTC (permalink / raw)
  To: David Miller
  Cc: zoltan.kiss, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies, andrew.bennieston

On Sat, 2014-03-08 at 18:57 -0500, David Miller wrote:
> From: Zoltan Kiss <zoltan.kiss@citrix.com>
> Date: Sat, 8 Mar 2014 14:37:50 +0000
> 
> > Maybe you mixed up mine with that? But that's also not eligible to be
> > applied yet.
> 
> I can always revert the series if there are major objections.

Zoltan -- does this patch series suffer from/expose the confusion
regarding RING_HAS_UNCONSUMED_REQUESTS which we are discussing
separately on xen-devel? If the answer is yes then I think this series
should be reverted for the time being because there seems to be some
fairly fundamental questions about the semantics of that macro.

If the answer is no then I will endeavour to review this version of the
series ASAP (hopefully tomorrow) and determine if I have any other major
objections which would warrant a revert.

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-12 15:40       ` Ian Campbell
@ 2014-03-12 18:49         ` Zoltan Kiss
  2014-03-13 10:43         ` Ian Campbell
  1 sibling, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-12 18:49 UTC (permalink / raw)
  To: Ian Campbell, David Miller
  Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies,
	andrew.bennieston

On 12/03/14 15:40, Ian Campbell wrote:
> On Sat, 2014-03-08 at 18:57 -0500, David Miller wrote:
>> From: Zoltan Kiss <zoltan.kiss@citrix.com>
>> Date: Sat, 8 Mar 2014 14:37:50 +0000
>>
>>> Maybe you mixed up mine with that? But that's also not eligible to be
>>> applied yet.
>>
>> I can always revert the series if there are major objections.
>
> Zoltan -- does this patch series suffer from/expose the confusion
> regarding RING_HAS_UNCONSUMED_REQUESTS which we are discussing
> separately on xen-devel? If the answer is yes then I think this series
> should be reverted for the time being because there seems to be some
> fairly fundamental questions about the semantics of that macro.
I haven't seen it causing any issue during my testing, although it went 
through several XenRT nighlies. That topic 
("RING_HAS_UNCONSUMED_REQUESTS oddness" on xen-devel) came from 
theoretical grounds. One outcome of it is that we should move that 
napi_schedule from the callback to the end of the dealloc thread to be 
on the safe side. I can post a short patch for that.

>
> If the answer is no then I will endeavour to review this version of the
> series ASAP (hopefully tomorrow) and determine if I have any other major
> objections which would warrant a revert.
>
> Ian.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
                   ` (9 preceding siblings ...)
  2014-03-07 21:05 ` [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy David Miller
@ 2014-03-13 10:08 ` Ian Campbell
  2014-03-13 18:23   ` Zoltan Kiss
  10 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-03-13 10:08 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
> A long known problem of the upstream netback implementation that on the TX
> path (from guest to Dom0) it copies the whole packet from guest memory into
> Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
> huge perfomance penalty. The classic kernel version of netback used grant
> mapping, and to get notified when the page can be unmapped, it used page
> destructors. Unfortunately that destructor is not an upstreamable solution.
> Ian Campbell's skb fragment destructor patch series [1] tried to solve this
> problem, however it seems to be very invasive on the network stack's code,
> and therefore haven't progressed very well.
> This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
> know when the skb is freed up. That is the way KVM solved the same problem,
> and based on my initial tests it can do the same for us. Avoiding the extra
> copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
> Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
> running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
> switch)

Do you have any other numbers? e.g. for a modern Intel or AMD system? A
slower box is likely to make the difference between copy and map larger,
whereas modern Intel for example is supposed to be very good at copying.

> Based on my investigations the packet get only copied if it is delivered to
> Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
> DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
> unfortunate, but luckily it doesn't cause a major regression for this usecase.

Numbers?

> In the future we should try to eliminate that copy somehow.
> There are a few spinoff tasks which will be addressed in separate patches:
> - grant copy the header directly instead of map and memcpy. This should help
>   us avoiding TLB flushing
> - use something else than ballooned pages
> - fix grant map to use page->index properly
> I've tried to broke it down to smaller patches, with mixed results, so I
> welcome suggestions on that part as well:
> 1: Use skb->cb to store pending_idx
> 2: Some refactoring
> 3: Change RX path for mapped SKB fragments (moved here to keep bisectability,
> review it after #4)
> 4: Introduce TX grant mapping
> 5: Remove old TX grant copy definitons and fix indentations
> 6: Add stat counters for zerocopy
> 7: Handle guests with too many frags
> 8: Timeout packets in RX path
> 9: Aggregate TX unmap operations
> 
> v2: I've fixed some smaller things, see the individual patches. I've added a
> few new stat counters, and handling the important use case when an older guest
> sends lots of slots. Instead of delayed copy now we timeout packets on the RX
> path, based on the assumption that otherwise packets should get stucked
> anywhere else. Finally some unmap batching to avoid too much TLB flush
> 
> v3: Apart from fixing a few things mentioned in responses the important change
> is the use the hypercall directly for grant [un]mapping, therefore we can
> avoid m2p override.
> 
> v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue
> timeout logic changed also.
> 
> v5: Only minor fixes based on Wei's comments
> 
> v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see
> first 2 patches. Also rework of handling packets with too many slots, and
> reorder the series a bit.
> 
> v7: Small fixes in comments/log messages/error paths, and merging the frag
> overflow stats patch into its parent.
> 
> [1] http://lwn.net/Articles/491522/
> [2] https://lkml.org/lkml/2012/7/20/363
> 
> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-06 21:48 ` [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping Zoltan Kiss
@ 2014-03-13 10:17   ` Ian Campbell
  2014-03-13 12:34     ` Zoltan Kiss
  2014-03-13 10:33   ` Ian Campbell
  1 sibling, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-03-13 10:17 UTC (permalink / raw)
  To: Zoltan Kiss, Konrad Rzeszutek Wilk, David Vrabel
  Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

Pulling out this one comment for the attention on the core Xen/Linux
maintainers.

On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
[...]
> @@ -343,8 +347,26 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
>  	vif->pending_prod = MAX_PENDING_REQS;
>  	for (i = 0; i < MAX_PENDING_REQS; i++)
>  		vif->pending_ring[i] = i;
> -	for (i = 0; i < MAX_PENDING_REQS; i++)
> -		vif->mmap_pages[i] = NULL;
> +	spin_lock_init(&vif->callback_lock);
> +	spin_lock_init(&vif->response_lock);
> +	/* If ballooning is disabled, this will consume real memory, so you
> +	 * better enable it. The long term solution would be to use just a
> +	 * bunch of valid page descriptors, without dependency on ballooning
> +	 */

I wonder if we ought to enforce this via Kconfig? i.e. making
CONFIG_XEN_BACKEND (or the individual backends) depend on BALLOON (or
select?) or by making CONFIG_XEN_BALLOON non-optional etc.

IIRC David V was looking into a solution involving auto hotplugging a
new region to use for this case, but then I guess
CONFIG_XEN_BALLOON_MEMORY_HOTPLUG would equally need to be enabled.

> +	err = alloc_xenballooned_pages(MAX_PENDING_REQS,
> +				       vif->mmap_pages,
> +				       false);
[...]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-06 21:48 ` [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping Zoltan Kiss
  2014-03-13 10:17   ` Ian Campbell
@ 2014-03-13 10:33   ` Ian Campbell
  2014-03-13 10:56     ` [Xen-devel] " David Vrabel
  2014-03-13 13:17     ` Zoltan Kiss
  1 sibling, 2 replies; 36+ messages in thread
From: Ian Campbell @ 2014-03-13 10:33 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
> @@ -135,13 +146,31 @@ struct xenvif {
>  	pending_ring_idx_t pending_cons;
>  	u16 pending_ring[MAX_PENDING_REQS];
>  	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
> +	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
>  
>  	/* Coalescing tx requests before copying makes number of grant
>  	 * copy ops greater or equal to number of slots required. In
>  	 * worst case a tx request consumes 2 gnttab_copy.
>  	 */
>  	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
> -
> +	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
> +	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];

I wonder if we should break some of these arrays into separate
allocations? Wasn't there a problem with sizeof(struct xenvif) at one
point?

> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
> index bc32627..1fe9fe5 100644
> --- a/drivers/net/xen-netback/interface.c
> +++ b/drivers/net/xen-netback/interface.c
> @@ -493,6 +533,23 @@ void xenvif_disconnect(struct xenvif *vif)
>  
>  void xenvif_free(struct xenvif *vif)
>  {
> +	int i, unmap_timeout = 0;
> +
> +	for (i = 0; i < MAX_PENDING_REQS; ++i) {
> +		if (vif->grant_tx_handle[i] != NETBACK_INVALID_HANDLE) {
> +			unmap_timeout++;
> +			schedule_timeout(msecs_to_jiffies(1000));
> +			if (unmap_timeout > 9 &&
> +			    net_ratelimit())

Does this really reach 80 columns when unwrapped?

(there seems to my eye to be a lot of overaggressive wrapping in this
patch, but nevermind)

> +				netdev_err(vif->dev,
> +					   "Page still granted! Index: %x\n",
> +					   i);
> +			i = -1;

Should there not be a break here? Otherwise don't we restart the for
loop from 0 again? If that is intentional then a comment would be very
useful.

> @@ -919,11 +873,38 @@ err:
>  	return NULL;
>  }
>  
> +static inline void xenvif_grant_handle_set(struct xenvif *vif,
> +					   u16 pending_idx,
> +					   grant_handle_t handle)
> +{
> +	if (unlikely(vif->grant_tx_handle[pending_idx] !=
> +		     NETBACK_INVALID_HANDLE)) {
> +		netdev_err(vif->dev,

Is this in any way guest triggerable? Needs to be ratelimited in that
case (and arguably even if not?)

> +			   "Trying to overwrite active handle! pending_idx: %x\n",
> +			   pending_idx);
> +		BUG();
> +	}
> +	vif->grant_tx_handle[pending_idx] = handle;
> +}
> +
> +static inline void xenvif_grant_handle_reset(struct xenvif *vif,
> +					     u16 pending_idx)
> +{
> +	if (unlikely(vif->grant_tx_handle[pending_idx] ==
> +		     NETBACK_INVALID_HANDLE)) {
> +		netdev_err(vif->dev,

Likewise.

> +			   "Trying to unmap invalid handle! pending_idx: %x\n",
> +			   pending_idx);
> +		BUG();
> +	}
> +	vif->grant_tx_handle[pending_idx] = NETBACK_INVALID_HANDLE;
> +}
> +
> @@ -1001,6 +982,17 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
>  
>  		pending_idx = frag_get_pending_idx(frag);
>  
> +		/* If this is not the first frag, chain it to the previous*/
> +		if (unlikely(prev_pending_idx == INVALID_PENDING_IDX))
> +			skb_shinfo(skb)->destructor_arg =
> +				&vif->pending_tx_info[pending_idx].callback_struct;
> +		else if (likely(pending_idx != prev_pending_idx))
> +			vif->pending_tx_info[prev_pending_idx].callback_struct.ctx =
> +				&(vif->pending_tx_info[pending_idx].callback_struct);

#define callback_for(vif, pending_idx) .... would make this and a bunch
of other places a lot less verbose IMHO.

> +		index = pending_index(vif->pending_prod);
> +		vif->pending_ring[index] = pending_idx;
> +		/* TX shouldn't use the index before we give it back here */

I hope this comment refers to the pending_prod++ and not the mb(), since
the barrier only guarantees visibility after that point, but not
invisibility before this point.

[...]
> +	/* Btw. already unmapped? */

What does this comment mean? Is it a fixme? An indicator that
xenvif_grant_handle_reset is supposed to handle this case or something
else?

I think there was another such comment earlier too.

> +	xenvif_grant_handle_reset(vif, pending_idx);
> +
> +	ret = gnttab_unmap_refs(&tx_unmap_op, NULL,
> +				&vif->mmap_pages[pending_idx], 1);
> +	BUG_ON(ret);
> +
> +	xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
> +}
> +
>  static inline int rx_work_todo(struct xenvif *vif)
>  {
>  	return !skb_queue_empty(&vif->rx_queue) &&

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 8/9] xen-netback: Timeout packets in RX path
  2014-03-06 21:48 ` [PATCH net-next v7 8/9] xen-netback: Timeout packets in RX path Zoltan Kiss
@ 2014-03-13 10:39   ` Ian Campbell
  0 siblings, 0 replies; 36+ messages in thread
From: Ian Campbell @ 2014-03-13 10:39 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
> @@ -557,12 +577,25 @@ void xenvif_disconnect(struct xenvif *vif)
>  void xenvif_free(struct xenvif *vif)
>  {
>  	int i, unmap_timeout = 0;
> +	/* Here we want to avoid timeout messages if an skb can be legitimatly

"legitimately"

> +	 * stucked somewhere else. Realisticly this could be an another vif's

"stuck" and "Realistically"

> +	 * internal or QDisc queue. That another vif also has this
> +	 * rx_drain_timeout_msecs timeout, but the timer only ditches the
> +	 * internal queue. After that, the QDisc queue can put in worst case
> +	 * XEN_NETIF_RX_RING_SIZE / MAX_SKB_FRAGS skbs into that another vif's
> +	 * internal queue, so we need several rounds of such timeouts until we
> +	 * can be sure that no another vif should have skb's from us. We are
> +	 * not sending more skb's, so newly stucked packets are not interesting

"stuck" again.

> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 560950e..bb65c7c 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -63,6 +63,13 @@ module_param(separate_tx_rx_irq, bool, 0644);
>  static unsigned int fatal_skb_slots = FATAL_SKB_SLOTS_DEFAULT;
>  module_param(fatal_skb_slots, uint, 0444);
>  
> +/* When guest ring is filled up, qdisc queues the packets for us, but we have
> + * to timeout them, otherwise other guests' packets can get stucked there

"stuck"

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-12 15:40       ` Ian Campbell
  2014-03-12 18:49         ` Zoltan Kiss
@ 2014-03-13 10:43         ` Ian Campbell
  1 sibling, 0 replies; 36+ messages in thread
From: Ian Campbell @ 2014-03-13 10:43 UTC (permalink / raw)
  To: David Miller
  Cc: zoltan.kiss, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies, andrew.bennieston

On Wed, 2014-03-12 at 15:40 +0000, Ian Campbell wrote:
> On Sat, 2014-03-08 at 18:57 -0500, David Miller wrote:
> > From: Zoltan Kiss <zoltan.kiss@citrix.com>
> > Date: Sat, 8 Mar 2014 14:37:50 +0000
> > 
> > > Maybe you mixed up mine with that? But that's also not eligible to be
> > > applied yet.
> > 
> > I can always revert the series if there are major objections.
> 
> Zoltan -- does this patch series suffer from/expose the confusion
> regarding RING_HAS_UNCONSUMED_REQUESTS which we are discussing
> separately on xen-devel? If the answer is yes then I think this series
> should be reverted for the time being because there seems to be some
> fairly fundamental questions about the semantics of that macro.

The answer was yes but we think this is fixed by "xen-netback: Schedule
NAPI from dealloc thread instead of callback" sent last night. I'll
review that next.

> If the answer is no then I will endeavour to review this version of the
> series ASAP (hopefully tomorrow) and determine if I have any other major
> objections which would warrant a revert.

There was a few things which I would have preferred to see fixed (or
understood) but what's done is done and I don't think any of it warrants
a revert. I commented on a few things which I was sure about or which
I'd like to see fixed in a follow up. I also let a bunch of minor stuff
like coding style nits slip since it doesn't seem worth the churn now
that it is in.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Xen-devel] [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-13 10:33   ` Ian Campbell
@ 2014-03-13 10:56     ` David Vrabel
  2014-03-13 11:02       ` Ian Campbell
  2014-03-13 13:17     ` Zoltan Kiss
  1 sibling, 1 reply; 36+ messages in thread
From: David Vrabel @ 2014-03-13 10:56 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Zoltan Kiss, xen-devel, jonathan.davies, wei.liu2, linux-kernel, netdev

On 13/03/14 10:33, Ian Campbell wrote:
> On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
>> @@ -135,13 +146,31 @@ struct xenvif {
>>  	pending_ring_idx_t pending_cons;
>>  	u16 pending_ring[MAX_PENDING_REQS];
>>  	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
>> +	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
>>  
>>  	/* Coalescing tx requests before copying makes number of grant
>>  	 * copy ops greater or equal to number of slots required. In
>>  	 * worst case a tx request consumes 2 gnttab_copy.
>>  	 */
>>  	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
>> -
>> +	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
>> +	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
> 
> I wonder if we should break some of these arrays into separate
> allocations? Wasn't there a problem with sizeof(struct xenvif) at one
> point?

alloc_netdev() falls back to vmalloc() if the kmalloc failed so there's
no need to split these structures.

David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Xen-devel] [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-13 10:56     ` [Xen-devel] " David Vrabel
@ 2014-03-13 11:02       ` Ian Campbell
  2014-03-13 11:09         ` David Vrabel
  2014-03-13 11:13         ` Wei Liu
  0 siblings, 2 replies; 36+ messages in thread
From: Ian Campbell @ 2014-03-13 11:02 UTC (permalink / raw)
  To: David Vrabel
  Cc: Zoltan Kiss, xen-devel, jonathan.davies, wei.liu2, linux-kernel, netdev

On Thu, 2014-03-13 at 10:56 +0000, David Vrabel wrote:
> On 13/03/14 10:33, Ian Campbell wrote:
> > On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
> >> @@ -135,13 +146,31 @@ struct xenvif {
> >>  	pending_ring_idx_t pending_cons;
> >>  	u16 pending_ring[MAX_PENDING_REQS];
> >>  	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
> >> +	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
> >>  
> >>  	/* Coalescing tx requests before copying makes number of grant
> >>  	 * copy ops greater or equal to number of slots required. In
> >>  	 * worst case a tx request consumes 2 gnttab_copy.
> >>  	 */
> >>  	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
> >> -
> >> +	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
> >> +	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
> > 
> > I wonder if we should break some of these arrays into separate
> > allocations? Wasn't there a problem with sizeof(struct xenvif) at one
> > point?
> 
> alloc_netdev() falls back to vmalloc() if the kmalloc failed so there's
> no need to split these structures.

Is vmalloc space in abundant supply? For some reason I thought it was
limited (maybe that's a 32-bit only limitation?)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Xen-devel] [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-13 11:02       ` Ian Campbell
@ 2014-03-13 11:09         ` David Vrabel
  2014-03-13 11:13         ` Wei Liu
  1 sibling, 0 replies; 36+ messages in thread
From: David Vrabel @ 2014-03-13 11:09 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Zoltan Kiss, xen-devel, jonathan.davies, wei.liu2, linux-kernel, netdev

On 13/03/14 11:02, Ian Campbell wrote:
> On Thu, 2014-03-13 at 10:56 +0000, David Vrabel wrote:
>> On 13/03/14 10:33, Ian Campbell wrote:
>>> On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
>>>> @@ -135,13 +146,31 @@ struct xenvif {
>>>>  	pending_ring_idx_t pending_cons;
>>>>  	u16 pending_ring[MAX_PENDING_REQS];
>>>>  	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
>>>> +	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
>>>>  
>>>>  	/* Coalescing tx requests before copying makes number of grant
>>>>  	 * copy ops greater or equal to number of slots required. In
>>>>  	 * worst case a tx request consumes 2 gnttab_copy.
>>>>  	 */
>>>>  	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
>>>> -
>>>> +	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
>>>> +	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
>>>
>>> I wonder if we should break some of these arrays into separate
>>> allocations? Wasn't there a problem with sizeof(struct xenvif) at one
>>> point?
>>
>> alloc_netdev() falls back to vmalloc() if the kmalloc failed so there's
>> no need to split these structures.
> 
> Is vmalloc space in abundant supply? For some reason I thought it was
> limited (maybe that's a 32-bit only limitation?)

It is limited in 32-bit, but 64-bit has stupid amounts.

/proc/meminfo:

VmallocTotal:   34359738367 kB

David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Xen-devel] [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-13 11:02       ` Ian Campbell
  2014-03-13 11:09         ` David Vrabel
@ 2014-03-13 11:13         ` Wei Liu
  1 sibling, 0 replies; 36+ messages in thread
From: Wei Liu @ 2014-03-13 11:13 UTC (permalink / raw)
  To: Ian Campbell
  Cc: David Vrabel, Zoltan Kiss, xen-devel, jonathan.davies, wei.liu2,
	linux-kernel, netdev

On Thu, Mar 13, 2014 at 11:02:35AM +0000, Ian Campbell wrote:
> On Thu, 2014-03-13 at 10:56 +0000, David Vrabel wrote:
> > On 13/03/14 10:33, Ian Campbell wrote:
> > > On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
> > >> @@ -135,13 +146,31 @@ struct xenvif {
> > >>  	pending_ring_idx_t pending_cons;
> > >>  	u16 pending_ring[MAX_PENDING_REQS];
> > >>  	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
> > >> +	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
> > >>  
> > >>  	/* Coalescing tx requests before copying makes number of grant
> > >>  	 * copy ops greater or equal to number of slots required. In
> > >>  	 * worst case a tx request consumes 2 gnttab_copy.
> > >>  	 */
> > >>  	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
> > >> -
> > >> +	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
> > >> +	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
> > > 
> > > I wonder if we should break some of these arrays into separate
> > > allocations? Wasn't there a problem with sizeof(struct xenvif) at one
> > > point?
> > 
> > alloc_netdev() falls back to vmalloc() if the kmalloc failed so there's
> > no need to split these structures.
> 
> Is vmalloc space in abundant supply? For some reason I thought it was
> limited (maybe that's a 32-bit only limitation?)

32-bit has a limitation of 128MB by default. 64-bit has much larger
address space.

Wei.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-13 10:17   ` Ian Campbell
@ 2014-03-13 12:34     ` Zoltan Kiss
  0 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-13 12:34 UTC (permalink / raw)
  To: Ian Campbell, Konrad Rzeszutek Wilk, David Vrabel
  Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

On 13/03/14 10:17, Ian Campbell wrote:
> Pulling out this one comment for the attention on the core Xen/Linux
> maintainers.
>
> On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
> [...]
>> @@ -343,8 +347,26 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
>>   	vif->pending_prod = MAX_PENDING_REQS;
>>   	for (i = 0; i < MAX_PENDING_REQS; i++)
>>   		vif->pending_ring[i] = i;
>> -	for (i = 0; i < MAX_PENDING_REQS; i++)
>> -		vif->mmap_pages[i] = NULL;
>> +	spin_lock_init(&vif->callback_lock);
>> +	spin_lock_init(&vif->response_lock);
>> +	/* If ballooning is disabled, this will consume real memory, so you
>> +	 * better enable it. The long term solution would be to use just a
>> +	 * bunch of valid page descriptors, without dependency on ballooning
>> +	 */
>
> I wonder if we ought to enforce this via Kconfig? i.e. making
> CONFIG_XEN_BACKEND (or the individual backends) depend on BALLOON (or
> select?) or by making CONFIG_XEN_BALLOON non-optional etc.
I support this idea, but let's have other's views.
>
> IIRC David V was looking into a solution involving auto hotplugging a
> new region to use for this case, but then I guess
> CONFIG_XEN_BALLOON_MEMORY_HOTPLUG would equally need to be enabled.
Yes, if that happens in the near future, we shouldn't be bothered by the 
above right now.

>
>> +	err = alloc_xenballooned_pages(MAX_PENDING_REQS,
>> +				       vif->mmap_pages,
>> +				       false);
> [...]
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-13 10:33   ` Ian Campbell
  2014-03-13 10:56     ` [Xen-devel] " David Vrabel
@ 2014-03-13 13:17     ` Zoltan Kiss
  2014-03-13 13:56       ` Ian Campbell
  1 sibling, 1 reply; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-13 13:17 UTC (permalink / raw)
  To: Ian Campbell; +Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

On 13/03/14 10:33, Ian Campbell wrote:
> On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
>> @@ -135,13 +146,31 @@ struct xenvif {
>>   	pending_ring_idx_t pending_cons;
>>   	u16 pending_ring[MAX_PENDING_REQS];
>>   	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
>> +	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
>>
>>   	/* Coalescing tx requests before copying makes number of grant
>>   	 * copy ops greater or equal to number of slots required. In
>>   	 * worst case a tx request consumes 2 gnttab_copy.
>>   	 */
>>   	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
>> -
>> +	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
>> +	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
>
> I wonder if we should break some of these arrays into separate
> allocations? Wasn't there a problem with sizeof(struct xenvif) at one
> point?
tx_copy_ops will be removed in the next patch. Yes, for grant_copy_op we 
allocate it separately, because it has MAX_SKB_FRAGS * 
XEN_NETIF_RX_RING_SIZE elements, but that's for the RX thread

>
>> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
>> index bc32627..1fe9fe5 100644
>> --- a/drivers/net/xen-netback/interface.c
>> +++ b/drivers/net/xen-netback/interface.c
>> @@ -493,6 +533,23 @@ void xenvif_disconnect(struct xenvif *vif)
>>
>>   void xenvif_free(struct xenvif *vif)
>>   {
>> +	int i, unmap_timeout = 0;
>> +
>> +	for (i = 0; i < MAX_PENDING_REQS; ++i) {
>> +		if (vif->grant_tx_handle[i] != NETBACK_INVALID_HANDLE) {
>> +			unmap_timeout++;
>> +			schedule_timeout(msecs_to_jiffies(1000));
>> +			if (unmap_timeout > 9 &&
>> +			    net_ratelimit())
>
> Does this really reach 80 columns when unwrapped?
Not here, but in a later patch 9 will be replaced with 
worst_case_skb_lifetime.

>
> (there seems to my eye to be a lot of overaggressive wrapping in this
> patch, but nevermind)
I tried to fix every warning and error noticed by checkpatch.pl, however 
there are still a few lines longer than 80, just because I couldn't 
reasonably wrap them.
>
>> +				netdev_err(vif->dev,
>> +					   "Page still granted! Index: %x\n",
>> +					   i);
>> +			i = -1;
>
> Should there not be a break here? Otherwise don't we restart the for
> loop from 0 again? If that is intentional then a comment would be very
> useful.
Yes, that's intentional, we shouldn't exit this loop until everything is 
unmapped. An i-- would be fine as well. I will put a comment there.

>
>> @@ -919,11 +873,38 @@ err:
>>   	return NULL;
>>   }
>>
>> +static inline void xenvif_grant_handle_set(struct xenvif *vif,
>> +					   u16 pending_idx,
>> +					   grant_handle_t handle)
>> +{
>> +	if (unlikely(vif->grant_tx_handle[pending_idx] !=
>> +		     NETBACK_INVALID_HANDLE)) {
>> +		netdev_err(vif->dev,
>
> Is this in any way guest triggerable? Needs to be ratelimited in that
> case (and arguably even if not?)
It shouldn't be guest triggerable. It only means netback really screwed 
up the accounting of granted pages. There is a BUG right after it, and 
the kernel should panic here. David suggested to replace this whole 
stuff with a BUG_ON. One counterargument is that there is a slight 
chance printing pending_idx can provide some useful info. At least back 
in the beginning when I tried to fix some basic mistakes it was useful.
>
>> +			   "Trying to overwrite active handle! pending_idx: %x\n",
>> +			   pending_idx);
>> +		BUG();
>> +	}
>> +	vif->grant_tx_handle[pending_idx] = handle;
>> +}
>> +
>> +static inline void xenvif_grant_handle_reset(struct xenvif *vif,
>> +					     u16 pending_idx)
>> +{
>> +	if (unlikely(vif->grant_tx_handle[pending_idx] ==
>> +		     NETBACK_INVALID_HANDLE)) {
>> +		netdev_err(vif->dev,
>
> Likewise.
>
>> +			   "Trying to unmap invalid handle! pending_idx: %x\n",
>> +			   pending_idx);
>> +		BUG();
>> +	}
>> +	vif->grant_tx_handle[pending_idx] = NETBACK_INVALID_HANDLE;
>> +}
>> +
>> @@ -1001,6 +982,17 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
>>
>>   		pending_idx = frag_get_pending_idx(frag);
>>
>> +		/* If this is not the first frag, chain it to the previous*/
>> +		if (unlikely(prev_pending_idx == INVALID_PENDING_IDX))
>> +			skb_shinfo(skb)->destructor_arg =
>> +				&vif->pending_tx_info[pending_idx].callback_struct;
>> +		else if (likely(pending_idx != prev_pending_idx))
>> +			vif->pending_tx_info[prev_pending_idx].callback_struct.ctx =
>> +				&(vif->pending_tx_info[pending_idx].callback_struct);
>
> #define callback_for(vif, pending_idx) .... would make this and a bunch
> of other places a lot less verbose IMHO.
Yeah, I was thinking about that, but it's really used here and 2 places 
in tx_submit, so I didn't bother to do it.

>
>> +		index = pending_index(vif->pending_prod);
>> +		vif->pending_ring[index] = pending_idx;
>> +		/* TX shouldn't use the index before we give it back here */
>
> I hope this comment refers to the pending_prod++ and not the mb(), since
> the barrier only guarantees visibility after that point, but not
> invisibility before this point.
Yes, the NAPI instance will use vif->pending_ring[index] only after 
vif->pending_prod++, so the memory barrier makes sure that we set the 
element in the ring first and then increase the producer.

>
> [...]
>> +	/* Btw. already unmapped? */
>
> What does this comment mean? Is it a fixme? An indicator that
> xenvif_grant_handle_reset is supposed to handle this case or something
> else?
It comes from the time when xenvif_grant_handle_reset was not a 
standalone function. Yes, it refers to the check in the beginning of 
that function, and it should go there.
>
> I think there was another such comment earlier too.
>
>> +	xenvif_grant_handle_reset(vif, pending_idx);
>> +
>> +	ret = gnttab_unmap_refs(&tx_unmap_op, NULL,
>> +				&vif->mmap_pages[pending_idx], 1);
>> +	BUG_ON(ret);
>> +
>> +	xenvif_idx_release(vif, pending_idx, XEN_NETIF_RSP_OKAY);
>> +}
>> +
>>   static inline int rx_work_todo(struct xenvif *vif)
>>   {
>>   	return !skb_queue_empty(&vif->rx_queue) &&
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-13 13:17     ` Zoltan Kiss
@ 2014-03-13 13:56       ` Ian Campbell
  2014-03-13 17:43         ` Zoltan Kiss
  0 siblings, 1 reply; 36+ messages in thread
From: Ian Campbell @ 2014-03-13 13:56 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

On Thu, 2014-03-13 at 13:17 +0000, Zoltan Kiss wrote:
> On 13/03/14 10:33, Ian Campbell wrote:
> > On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
> >> @@ -135,13 +146,31 @@ struct xenvif {
> >>   	pending_ring_idx_t pending_cons;
> >>   	u16 pending_ring[MAX_PENDING_REQS];
> >>   	struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
> >> +	grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
> >>
> >>   	/* Coalescing tx requests before copying makes number of grant
> >>   	 * copy ops greater or equal to number of slots required. In
> >>   	 * worst case a tx request consumes 2 gnttab_copy.
> >>   	 */
> >>   	struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
> >> -
> >> +	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
> >> +	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
> >
> > I wonder if we should break some of these arrays into separate
> > allocations? Wasn't there a problem with sizeof(struct xenvif) at one
> > point?
> tx_copy_ops will be removed in the next patch. Yes, for grant_copy_op we 
> allocate it separately, because it has MAX_SKB_FRAGS * 
> XEN_NETIF_RX_RING_SIZE elements, but that's for the RX thread

OK

> >> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
> >> index bc32627..1fe9fe5 100644
> >> --- a/drivers/net/xen-netback/interface.c
> >> +++ b/drivers/net/xen-netback/interface.c
> >> @@ -493,6 +533,23 @@ void xenvif_disconnect(struct xenvif *vif)
> >>
> >>   void xenvif_free(struct xenvif *vif)
> >>   {
> >> +	int i, unmap_timeout = 0;
> >> +
> >> +	for (i = 0; i < MAX_PENDING_REQS; ++i) {
> >> +		if (vif->grant_tx_handle[i] != NETBACK_INVALID_HANDLE) {
> >> +			unmap_timeout++;
> >> +			schedule_timeout(msecs_to_jiffies(1000));
> >> +			if (unmap_timeout > 9 &&
> >> +			    net_ratelimit())
> >
> > Does this really reach 80 columns when unwrapped?
> Not here, but in a later patch 9 will be replaced with 
> worst_case_skb_lifetime.

OK.

> > (there seems to my eye to be a lot of overaggressive wrapping in this
> > patch, but nevermind)
> I tried to fix every warning and error noticed by checkpatch.pl, however 
> there are still a few lines longer than 80, just because I couldn't 
> reasonably wrap them.

I thought checkpatch had been reined in a bit wrt 80 columns because the
cure is often worse than the disease. Anyway, it's there now.

> >
> >> +				netdev_err(vif->dev,
> >> +					   "Page still granted! Index: %x\n",
> >> +					   i);
> >> +			i = -1;
> >
> > Should there not be a break here? Otherwise don't we restart the for
> > loop from 0 again? If that is intentional then a comment would be very
> > useful.
> Yes, that's intentional, we shouldn't exit this loop until everything is 
> unmapped. An i-- would be fine as well. I will put a comment there.

Yes please do, it's very non-obvious what is going on. I'm almost
inclined to suggest that this is one of the few places where a goto
retry might be appropriate.

Can you also add a comment saying what is doing the actual unmap work
which we are waiting for here since it is not actually part of the loop.
Might a barrier be needed to ensure we see that work happening?

> >
> >> @@ -919,11 +873,38 @@ err:
> >>   	return NULL;
> >>   }
> >>
> >> +static inline void xenvif_grant_handle_set(struct xenvif *vif,
> >> +					   u16 pending_idx,
> >> +					   grant_handle_t handle)
> >> +{
> >> +	if (unlikely(vif->grant_tx_handle[pending_idx] !=
> >> +		     NETBACK_INVALID_HANDLE)) {
> >> +		netdev_err(vif->dev,
> >
> > Is this in any way guest triggerable? Needs to be ratelimited in that
> > case (and arguably even if not?)
> It shouldn't be guest triggerable. It only means netback really screwed 
> up the accounting of granted pages. There is a BUG right after it, and 
> the kernel should panic here.

OK.

> >> +			   "Trying to unmap invalid handle! pending_idx: %x\n",
> >> +			   pending_idx);
> >> +		BUG();
> >> +	}
> >> +	vif->grant_tx_handle[pending_idx] = NETBACK_INVALID_HANDLE;
> >> +}
> >> +
> >> @@ -1001,6 +982,17 @@ static void xenvif_fill_frags(struct xenvif *vif, struct sk_buff *skb)
> >>
> >>   		pending_idx = frag_get_pending_idx(frag);
> >>
> >> +		/* If this is not the first frag, chain it to the previous*/
> >> +		if (unlikely(prev_pending_idx == INVALID_PENDING_IDX))
> >> +			skb_shinfo(skb)->destructor_arg =
> >> +				&vif->pending_tx_info[pending_idx].callback_struct;
> >> +		else if (likely(pending_idx != prev_pending_idx))
> >> +			vif->pending_tx_info[prev_pending_idx].callback_struct.ctx =
> >> +				&(vif->pending_tx_info[pending_idx].callback_struct);
> >
> > #define callback_for(vif, pending_idx) .... would make this and a bunch
> > of other places a lot less verbose IMHO.
> Yeah, I was thinking about that, but it's really used here and 2 places 
> in tx_submit, so I didn't bother to do it.

That's still 5-6 uses I think. Worth it IMHO.

Thanks,

Ian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping
  2014-03-13 13:56       ` Ian Campbell
@ 2014-03-13 17:43         ` Zoltan Kiss
  0 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-13 17:43 UTC (permalink / raw)
  To: Ian Campbell; +Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies

On 13/03/14 13:56, Ian Campbell wrote:
> On Thu, 2014-03-13 at 13:17 +0000, Zoltan Kiss wrote:
>> On 13/03/14 10:33, Ian Campbell wrote:
>>> On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
>>>
>>>> +				netdev_err(vif->dev,
>>>> +					   "Page still granted! Index: %x\n",
>>>> +					   i);
>>>> +			i = -1;
>>>
>>> Should there not be a break here? Otherwise don't we restart the for
>>> loop from 0 again? If that is intentional then a comment would be very
>>> useful.
>> Yes, that's intentional, we shouldn't exit this loop until everything is
>> unmapped. An i-- would be fine as well. I will put a comment there.
>
> Yes please do, it's very non-obvious what is going on. I'm almost
> inclined to suggest that this is one of the few places where a goto
> retry might be appropriate.
>
> Can you also add a comment saying what is doing the actual unmap work
> which we are waiting for here since it is not actually part of the loop.
> Might a barrier be needed to ensure we see that work happening?
I don't think a barrier is necessary here, if this function ran into 
!NETBACK_INVALID_HANDLE, it just starts again the checking.

On 13/03/14 13:17, Zoltan Kiss wrote:>>
 >> [...]
 >>> +    /* Btw. already unmapped? */
 >>
 >> What does this comment mean? Is it a fixme? An indicator that
 >> xenvif_grant_handle_reset is supposed to handle this case or something
 >> else?
 > It comes from the time when xenvif_grant_handle_reset was not a
 > standalone function. Yes, it refers to the check in the beginning of
 > that function, and it should go there.

I ended up removing that comment, the error message in the function 
tells the same.

Zoli

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
  2014-03-13 10:08 ` Ian Campbell
@ 2014-03-13 18:23   ` Zoltan Kiss
  0 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-13 18:23 UTC (permalink / raw)
  To: Ian Campbell
  Cc: wei.liu2, xen-devel, netdev, linux-kernel, jonathan.davies,
	Marcus Granado

On 13/03/14 10:08, Ian Campbell wrote:
> On Thu, 2014-03-06 at 21:48 +0000, Zoltan Kiss wrote:
>> >A long known problem of the upstream netback implementation that on the TX
>> >path (from guest to Dom0) it copies the whole packet from guest memory into
>> >Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
>> >huge perfomance penalty. The classic kernel version of netback used grant
>> >mapping, and to get notified when the page can be unmapped, it used page
>> >destructors. Unfortunately that destructor is not an upstreamable solution.
>> >Ian Campbell's skb fragment destructor patch series [1] tried to solve this
>> >problem, however it seems to be very invasive on the network stack's code,
>> >and therefore haven't progressed very well.
>> >This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
>> >know when the skb is freed up. That is the way KVM solved the same problem,
>> >and based on my initial tests it can do the same for us. Avoiding the extra
>> >copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
>> >Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
>> >running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
>> >switch)
> Do you have any other numbers? e.g. for a modern Intel or AMD system? A
> slower box is likely to make the difference between copy and map larger,
> whereas modern Intel for example is supposed to be very good at copying.
Performance team made a lot of measurements, I've added Marcus to 
comment on that.
With the latest version and tip net-next kernel I could see even ~9.3 
Gbps peak throughput on the same AMD box, which is the practical maximum 
for 10G cards. However with older guests I couldn't reach that. A lot 
depends on netfront and TCP stack, e.g. the tcp_limit_output_bytes 
sysctl can cause an artificial cap.
Perf team now has 40 Gbps NICs I guess, it would be interesting to see 
how does this perform there.
I just checked the intrahost guest-to-guest throughput with 2 upstream 
kernel, I could get out 5.6-5.8 Gbps at most.

>
>> >Based on my investigations the packet get only copied if it is delivered to
>> >Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
>> >DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
>> >unfortunate, but luckily it doesn't cause a major regression for this usecase.
> Numbers?
I've checked that back in November:

https://lkml.org/lkml/2013/11/5/288

Originally it was 5.4 vs with my patch it was 5.2. I've checked DomU to 
Dom0 iperf again, about the same still with my series.

Zoli

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations
  2014-03-06 21:48 ` [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations Zoltan Kiss
@ 2014-03-19 21:16   ` Zoltan Kiss
  2014-03-20  9:53     ` Paul Durrant
  2014-03-20 10:48     ` Wei Liu
  0 siblings, 2 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-19 21:16 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel
  Cc: netdev, linux-kernel, jonathan.davies, Paul Durrant

Hi,

I'm thinking about revoking this patch: it's value is pretty small, but 
it causes performance regression on Win7 guests. And probably it is not 
the best solution for this problem. It might be the delay it takes the 
dealloc thread to be scheduled is enough.
What do you think?

Zoli

On 06/03/14 21:48, Zoltan Kiss wrote:
> Unmapping causes TLB flushing, therefore we should make it in the largest
> possible batches. However we shouldn't starve the guest for too long. So if
> the guest has space for at least two big packets and we don't have at least a
> quarter ring to unmap, delay it for at most 1 milisec.
>
> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
> ---
> v4:
> - use bool for tx_dealloc_work_todo
>
> v6:
> - rebase tx_dealloc_work_todo due to missing ;
>
> diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
> index d1cd8ce..95498c8 100644
> --- a/drivers/net/xen-netback/common.h
> +++ b/drivers/net/xen-netback/common.h
> @@ -118,6 +118,8 @@ struct xenvif {
>   	u16 dealloc_ring[MAX_PENDING_REQS];
>   	struct task_struct *dealloc_task;
>   	wait_queue_head_t dealloc_wq;
> +	struct timer_list dealloc_delay;
> +	bool dealloc_delay_timed_out;
>
>   	/* Use kthread for guest RX */
>   	struct task_struct *task;
> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
> index 40aa500..f925af5 100644
> --- a/drivers/net/xen-netback/interface.c
> +++ b/drivers/net/xen-netback/interface.c
> @@ -407,6 +407,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
>   			  .desc = i };
>   		vif->grant_tx_handle[i] = NETBACK_INVALID_HANDLE;
>   	}
> +	init_timer(&vif->dealloc_delay);
>
>   	/*
>   	 * Initialise a dummy MAC address. We choose the numerically
> @@ -557,6 +558,7 @@ void xenvif_disconnect(struct xenvif *vif)
>   	}
>
>   	if (vif->dealloc_task) {
> +		del_timer_sync(&vif->dealloc_delay);
>   		kthread_stop(vif->dealloc_task);
>   		vif->dealloc_task = NULL;
>   	}
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index bb65c7c..c098276 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -135,6 +135,11 @@ static inline pending_ring_idx_t nr_pending_reqs(struct xenvif *vif)
>   		vif->pending_prod + vif->pending_cons;
>   }
>
> +static inline pending_ring_idx_t nr_free_slots(struct xen_netif_tx_back_ring *ring)
> +{
> +	return ring->nr_ents -	(ring->sring->req_prod - ring->rsp_prod_pvt);
> +}
> +
>   bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed)
>   {
>   	RING_IDX prod, cons;
> @@ -1932,9 +1937,36 @@ static inline int tx_work_todo(struct xenvif *vif)
>   	return 0;
>   }
>
> +static void xenvif_dealloc_delay(unsigned long data)
> +{
> +	struct xenvif *vif = (struct xenvif *)data;
> +
> +	vif->dealloc_delay_timed_out = true;
> +	wake_up(&vif->dealloc_wq);
> +}
> +
>   static inline bool tx_dealloc_work_todo(struct xenvif *vif)
>   {
> -	return vif->dealloc_cons != vif->dealloc_prod;
> +	if (vif->dealloc_cons != vif->dealloc_prod) {
> +		if ((nr_free_slots(&vif->tx) > 2 * XEN_NETBK_LEGACY_SLOTS_MAX) &&
> +		    (vif->dealloc_prod - vif->dealloc_cons < MAX_PENDING_REQS / 4) &&
> +		    !vif->dealloc_delay_timed_out) {
> +			if (!timer_pending(&vif->dealloc_delay)) {
> +				vif->dealloc_delay.function =
> +					xenvif_dealloc_delay;
> +				vif->dealloc_delay.data = (unsigned long)vif;
> +				mod_timer(&vif->dealloc_delay,
> +					  jiffies + msecs_to_jiffies(1));
> +
> +			}
> +			return false;
> +		}
> +		del_timer_sync(&vif->dealloc_delay);
> +		vif->dealloc_delay_timed_out = false;
> +		return true;
> +	}
> +
> +	return false;
>   }
>
>   void xenvif_unmap_frontend_rings(struct xenvif *vif)
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations
  2014-03-19 21:16   ` Zoltan Kiss
@ 2014-03-20  9:53     ` Paul Durrant
  2014-03-20 10:48     ` Wei Liu
  1 sibling, 0 replies; 36+ messages in thread
From: Paul Durrant @ 2014-03-20  9:53 UTC (permalink / raw)
  To: Zoltan Kiss, Ian Campbell, Wei Liu, xen-devel
  Cc: netdev, linux-kernel, Jonathan Davies

> -----Original Message-----
> From: Zoltan Kiss
> Sent: 19 March 2014 21:16
> To: Ian Campbell; Wei Liu; xen-devel@lists.xenproject.org
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Jonathan Davies;
> Paul Durrant
> Subject: Re: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap
> operations
> 
> Hi,
> 
> I'm thinking about revoking this patch: it's value is pretty small, but
> it causes performance regression on Win7 guests. And probably it is not
> the best solution for this problem. It might be the delay it takes the
> dealloc thread to be scheduled is enough.
> What do you think?
> 

Yes, I think we need a revert to fix the performance regression. As I understand things, it's sufficiently bad that we would not want to take the grant mapping series into XenServer without the reversion.

  Paul

> Zoli
> 
> On 06/03/14 21:48, Zoltan Kiss wrote:
> > Unmapping causes TLB flushing, therefore we should make it in the largest
> > possible batches. However we shouldn't starve the guest for too long. So if
> > the guest has space for at least two big packets and we don't have at least a
> > quarter ring to unmap, delay it for at most 1 milisec.
> >
> > Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
> > ---
> > v4:
> > - use bool for tx_dealloc_work_todo
> >
> > v6:
> > - rebase tx_dealloc_work_todo due to missing ;
> >
> > diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-
> netback/common.h
> > index d1cd8ce..95498c8 100644
> > --- a/drivers/net/xen-netback/common.h
> > +++ b/drivers/net/xen-netback/common.h
> > @@ -118,6 +118,8 @@ struct xenvif {
> >   	u16 dealloc_ring[MAX_PENDING_REQS];
> >   	struct task_struct *dealloc_task;
> >   	wait_queue_head_t dealloc_wq;
> > +	struct timer_list dealloc_delay;
> > +	bool dealloc_delay_timed_out;
> >
> >   	/* Use kthread for guest RX */
> >   	struct task_struct *task;
> > diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-
> netback/interface.c
> > index 40aa500..f925af5 100644
> > --- a/drivers/net/xen-netback/interface.c
> > +++ b/drivers/net/xen-netback/interface.c
> > @@ -407,6 +407,7 @@ struct xenvif *xenvif_alloc(struct device *parent,
> domid_t domid,
> >   			  .desc = i };
> >   		vif->grant_tx_handle[i] = NETBACK_INVALID_HANDLE;
> >   	}
> > +	init_timer(&vif->dealloc_delay);
> >
> >   	/*
> >   	 * Initialise a dummy MAC address. We choose the numerically
> > @@ -557,6 +558,7 @@ void xenvif_disconnect(struct xenvif *vif)
> >   	}
> >
> >   	if (vif->dealloc_task) {
> > +		del_timer_sync(&vif->dealloc_delay);
> >   		kthread_stop(vif->dealloc_task);
> >   		vif->dealloc_task = NULL;
> >   	}
> > diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-
> netback/netback.c
> > index bb65c7c..c098276 100644
> > --- a/drivers/net/xen-netback/netback.c
> > +++ b/drivers/net/xen-netback/netback.c
> > @@ -135,6 +135,11 @@ static inline pending_ring_idx_t
> nr_pending_reqs(struct xenvif *vif)
> >   		vif->pending_prod + vif->pending_cons;
> >   }
> >
> > +static inline pending_ring_idx_t nr_free_slots(struct
> xen_netif_tx_back_ring *ring)
> > +{
> > +	return ring->nr_ents -	(ring->sring->req_prod - ring-
> >rsp_prod_pvt);
> > +}
> > +
> >   bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed)
> >   {
> >   	RING_IDX prod, cons;
> > @@ -1932,9 +1937,36 @@ static inline int tx_work_todo(struct xenvif *vif)
> >   	return 0;
> >   }
> >
> > +static void xenvif_dealloc_delay(unsigned long data)
> > +{
> > +	struct xenvif *vif = (struct xenvif *)data;
> > +
> > +	vif->dealloc_delay_timed_out = true;
> > +	wake_up(&vif->dealloc_wq);
> > +}
> > +
> >   static inline bool tx_dealloc_work_todo(struct xenvif *vif)
> >   {
> > -	return vif->dealloc_cons != vif->dealloc_prod;
> > +	if (vif->dealloc_cons != vif->dealloc_prod) {
> > +		if ((nr_free_slots(&vif->tx) > 2 *
> XEN_NETBK_LEGACY_SLOTS_MAX) &&
> > +		    (vif->dealloc_prod - vif->dealloc_cons <
> MAX_PENDING_REQS / 4) &&
> > +		    !vif->dealloc_delay_timed_out) {
> > +			if (!timer_pending(&vif->dealloc_delay)) {
> > +				vif->dealloc_delay.function =
> > +					xenvif_dealloc_delay;
> > +				vif->dealloc_delay.data = (unsigned long)vif;
> > +				mod_timer(&vif->dealloc_delay,
> > +					  jiffies + msecs_to_jiffies(1));
> > +
> > +			}
> > +			return false;
> > +		}
> > +		del_timer_sync(&vif->dealloc_delay);
> > +		vif->dealloc_delay_timed_out = false;
> > +		return true;
> > +	}
> > +
> > +	return false;
> >   }
> >
> >   void xenvif_unmap_frontend_rings(struct xenvif *vif)
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations
  2014-03-19 21:16   ` Zoltan Kiss
  2014-03-20  9:53     ` Paul Durrant
@ 2014-03-20 10:48     ` Wei Liu
  2014-03-20 11:14       ` Paul Durrant
  1 sibling, 1 reply; 36+ messages in thread
From: Wei Liu @ 2014-03-20 10:48 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies, Paul Durrant

On Wed, Mar 19, 2014 at 09:16:05PM +0000, Zoltan Kiss wrote:
> Hi,
> 
> I'm thinking about revoking this patch: it's value is pretty small,
> but it causes performance regression on Win7 guests. And probably it
> is not the best solution for this problem. It might be the delay it
> takes the dealloc thread to be scheduled is enough.
> What do you think?
> 

Can you elaborate? What makes Win7 so special? What's performance
impact to other guests?

Wei.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations
  2014-03-20 10:48     ` Wei Liu
@ 2014-03-20 11:14       ` Paul Durrant
  2014-03-20 12:38         ` Wei Liu
  0 siblings, 1 reply; 36+ messages in thread
From: Paul Durrant @ 2014-03-20 11:14 UTC (permalink / raw)
  To: Wei Liu, Zoltan Kiss
  Cc: Ian Campbell, Wei Liu, xen-devel, netdev, linux-kernel, Jonathan Davies

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: 20 March 2014 10:49
> To: Zoltan Kiss
> Cc: Ian Campbell; Wei Liu; xen-devel@lists.xenproject.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Jonathan Davies;
> Paul Durrant
> Subject: Re: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap
> operations
> 
> On Wed, Mar 19, 2014 at 09:16:05PM +0000, Zoltan Kiss wrote:
> > Hi,
> >
> > I'm thinking about revoking this patch: it's value is pretty small,
> > but it causes performance regression on Win7 guests. And probably it
> > is not the best solution for this problem. It might be the delay it
> > takes the dealloc thread to be scheduled is enough.
> > What do you think?
> >
> 
> Can you elaborate? What makes Win7 so special? What's performance
> impact to other guests?
> 

It won't be Win7 specifically I expect. It will likely by any version of Windows, or any other OS that limits the TXs-in-flight so aggressively. Basically you need to TX-complete reasonably frequently otherwise your throughput drops off a lot. IIRC at Solarflare we found every ~500us to be just about frequent enough for hitting 10G.

  Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations
  2014-03-20 11:14       ` Paul Durrant
@ 2014-03-20 12:38         ` Wei Liu
  2014-03-20 16:11           ` Zoltan Kiss
  0 siblings, 1 reply; 36+ messages in thread
From: Wei Liu @ 2014-03-20 12:38 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Wei Liu, Zoltan Kiss, Ian Campbell, xen-devel, netdev,
	linux-kernel, Jonathan Davies

On Thu, Mar 20, 2014 at 11:14:51AM +0000, Paul Durrant wrote:
> > -----Original Message-----
> > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > Sent: 20 March 2014 10:49
> > To: Zoltan Kiss
> > Cc: Ian Campbell; Wei Liu; xen-devel@lists.xenproject.org;
> > netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Jonathan Davies;
> > Paul Durrant
> > Subject: Re: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap
> > operations
> > 
> > On Wed, Mar 19, 2014 at 09:16:05PM +0000, Zoltan Kiss wrote:
> > > Hi,
> > >
> > > I'm thinking about revoking this patch: it's value is pretty small,
> > > but it causes performance regression on Win7 guests. And probably it
> > > is not the best solution for this problem. It might be the delay it
> > > takes the dealloc thread to be scheduled is enough.
> > > What do you think?
> > >
> > 
> > Can you elaborate? What makes Win7 so special? What's performance
> > impact to other guests?
> > 
> 
> It won't be Win7 specifically I expect. It will likely by any version
> of Windows, or any other OS that limits the TXs-in-flight so
> aggressively. Basically you need to TX-complete reasonably frequently
> otherwise your throughput drops off a lot. IIRC at Solarflare we found
> every ~500us to be just about frequent enough for hitting 10G.

Thanks for the explanation.

Reverting this change basically means when to flush TLB is at sole
discretion of Linux kernel scheduler. I don't oppose to that. But it
would be better to provide some numbers.

Wei.

> 
>   Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations
  2014-03-20 12:38         ` Wei Liu
@ 2014-03-20 16:11           ` Zoltan Kiss
  0 siblings, 0 replies; 36+ messages in thread
From: Zoltan Kiss @ 2014-03-20 16:11 UTC (permalink / raw)
  To: Wei Liu, Paul Durrant, David Miller
  Cc: Ian Campbell, xen-devel, netdev, linux-kernel, Jonathan Davies

On 20/03/14 12:38, Wei Liu wrote:
> On Thu, Mar 20, 2014 at 11:14:51AM +0000, Paul Durrant wrote:
>>> -----Original Message-----
>>> From: Wei Liu [mailto:wei.liu2@citrix.com]
>>> Sent: 20 March 2014 10:49
>>> To: Zoltan Kiss
>>> Cc: Ian Campbell; Wei Liu; xen-devel@lists.xenproject.org;
>>> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Jonathan Davies;
>>> Paul Durrant
>>> Subject: Re: [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap
>>> operations
>>>
>>> On Wed, Mar 19, 2014 at 09:16:05PM +0000, Zoltan Kiss wrote:
>>>> Hi,
>>>>
>>>> I'm thinking about revoking this patch: it's value is pretty small,
>>>> but it causes performance regression on Win7 guests. And probably it
>>>> is not the best solution for this problem. It might be the delay it
>>>> takes the dealloc thread to be scheduled is enough.
>>>> What do you think?
>>>>
>>>
>>> Can you elaborate? What makes Win7 so special? What's performance
>>> impact to other guests?
>>>
>>
>> It won't be Win7 specifically I expect. It will likely by any version
>> of Windows, or any other OS that limits the TXs-in-flight so
>> aggressively. Basically you need to TX-complete reasonably frequently
>> otherwise your throughput drops off a lot. IIRC at Solarflare we found
>> every ~500us to be just about frequent enough for hitting 10G.
>
> Thanks for the explanation.
>
> Reverting this change basically means when to flush TLB is at sole
> discretion of Linux kernel scheduler. I don't oppose to that. But it
> would be better to provide some numbers.

My comparisons with iperf haven't showed any significant difference. 
I've measured Win7 and upstream Linux guest.
There was a misunderstanding that reverting this patch would stop 
batching of unmap. There would be still batching as when the callback 
wakes the dealloc thread, by the time it happens other callbacks still 
can place work on the dealloc ring, even while the thread started to 
process them. And that could happen independently from the TX operations 
in the NAPI instance, which is an another good feature of having a 
dealloc thread.

I've discussed this in person with Paul and Ian as well, they are happy 
with the reverting. So David, can you please revert e9275f5e2d 
"xen-netback: Aggregate TX unmap operations"?

Regards,

Zoli

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-03-20 16:11 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-06 21:48 [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy Zoltan Kiss
2014-03-06 21:48 ` [PATCH net-next v7 1/9] xen-netback: Use skb->cb for pending_idx Zoltan Kiss
2014-03-06 21:48 ` [PATCH net-next v7 2/9] xen-netback: Minor refactoring of netback code Zoltan Kiss
2014-03-06 21:48 ` [PATCH net-next v7 3/9] xen-netback: Handle foreign mapped pages on the guest RX path Zoltan Kiss
2014-03-06 21:48 ` [PATCH net-next v7 4/9] xen-netback: Introduce TX grant mapping Zoltan Kiss
2014-03-13 10:17   ` Ian Campbell
2014-03-13 12:34     ` Zoltan Kiss
2014-03-13 10:33   ` Ian Campbell
2014-03-13 10:56     ` [Xen-devel] " David Vrabel
2014-03-13 11:02       ` Ian Campbell
2014-03-13 11:09         ` David Vrabel
2014-03-13 11:13         ` Wei Liu
2014-03-13 13:17     ` Zoltan Kiss
2014-03-13 13:56       ` Ian Campbell
2014-03-13 17:43         ` Zoltan Kiss
2014-03-06 21:48 ` [PATCH net-next v7 5/9] xen-netback: Remove old TX grant copy definitons and fix indentations Zoltan Kiss
2014-03-06 21:48 ` [PATCH net-next v7 6/9] xen-netback: Add stat counters for zerocopy Zoltan Kiss
2014-03-06 21:48 ` [PATCH net-next v7 7/9] xen-netback: Handle guests with too many frags Zoltan Kiss
2014-03-06 21:48 ` [PATCH net-next v7 8/9] xen-netback: Timeout packets in RX path Zoltan Kiss
2014-03-13 10:39   ` Ian Campbell
2014-03-06 21:48 ` [PATCH net-next v7 9/9] xen-netback: Aggregate TX unmap operations Zoltan Kiss
2014-03-19 21:16   ` Zoltan Kiss
2014-03-20  9:53     ` Paul Durrant
2014-03-20 10:48     ` Wei Liu
2014-03-20 11:14       ` Paul Durrant
2014-03-20 12:38         ` Wei Liu
2014-03-20 16:11           ` Zoltan Kiss
2014-03-07 21:05 ` [PATCH net-next v7 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy David Miller
2014-03-08 14:37   ` Zoltan Kiss
2014-03-08 23:57     ` David Miller
2014-03-10 10:15       ` Wei Liu
2014-03-12 15:40       ` Ian Campbell
2014-03-12 18:49         ` Zoltan Kiss
2014-03-13 10:43         ` Ian Campbell
2014-03-13 10:08 ` Ian Campbell
2014-03-13 18:23   ` Zoltan Kiss

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).