All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/10] skb paged fragment destructors
@ 2012-04-10 14:26 Ian Campbell
  2012-04-10 14:26 ` [PATCH 01/10] net: add and use SKB_ALLOCSIZE Ian Campbell
                   ` (24 more replies)
  0 siblings, 25 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	David VomLehn, Bart Van Assche, xen-devel, Ian Campbell

I think this is v4, but I've sort of lost count, sorry that it's taken
me so long to get back to this stuff.

The following series makes use of the skb fragment API (which is in 3.2
+) to add a per-paged-fragment destructor callback. This can be used by
creators of skbs who are interested in the lifecycle of the pages
included in that skb after they have handed it off to the network stack.

The mail at [0] contains some more background and rationale but
basically the completed series will allow entities which inject pages
into the networking stack to receive a notification when the stack has
really finished with those pages (i.e. including retransmissions,
clones, pull-ups etc) and not just when the original skb is finished
with, which is beneficial to many subsystems which wish to inject pages
into the network stack without giving up full ownership of those page's
lifecycle. It implements something broadly along the lines of what was
described in [1].

I have also included a patch to the RPC subsystem which uses this API to
fix the bug which I describe at [2].

I've also had some interest from David VemLehn and Bart Van Assche
regarding using this functionality in the context of vmsplice and iSCSI
targets respectively (I think).

Changes since last time:

      * Added skb_orphan_frags API for the use of recipients of SKBs who
        may hold onto the SKB for a long time (this is analogous to
        skb_orphan). This was pointed out by Michael. The TUN driver is
        currently the only user.
              * I can't for the life of me get anything to actually hit
                this code path. I've been trying with an NFS server
                running in a Xen HVM domain with emulated (e.g. tap)
                networking and a client in domain 0, using the NFS fix
                in this series which generates SKBs with destructors
                set, so far -- nothing. I suspect that lack of TSO/GSO
                etc on the TAP interface is causing the frags to be
                copied to normal pages during skb_segment().
      * Various fixups related to the change of alignment/padding in
        shinfo, in particular to build_skb as pointed out by Eric.
      * Tweaked ordering of shinfo members to ensure that all hotpath
        variables up to and including the first frag fit within (and are
        aligned to) a single 64 byte cache line. (Eric again)

I ran a monothread UDP benchmark (similar to that described by Eric in
e52fcb2462ac) and don't see any difference in pps throughput, it was
~810,000 pps both before and after.

Cheers,
Ian.

[0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
[1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
[2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 01/10] net: add and use SKB_ALLOCSIZE
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:57   ` Eric Dumazet
  2012-04-10 14:57   ` Eric Dumazet
  2012-04-10 14:26 ` Ian Campbell
                   ` (23 subsequent siblings)
  24 siblings, 2 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell

This gives the allocation size required for an skb containing X bytes of data

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/net/ethernet/broadcom/bnx2.c        |    7 +++----
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |    3 +--
 drivers/net/ethernet/broadcom/tg3.c         |    3 +--
 include/linux/skbuff.h                      |   12 ++++++++++++
 net/core/skbuff.c                           |    8 +-------
 5 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index 8297e28..dede71f 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -5321,8 +5321,7 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 	/* 8 for CRC and VLAN */
 	rx_size = bp->dev->mtu + ETH_HLEN + BNX2_RX_OFFSET + 8;
 
-	rx_space = SKB_DATA_ALIGN(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD +
-		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	rx_space = SKB_ALLOCSIZE(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD;
 
 	bp->rx_copy_thresh = BNX2_RX_COPY_THRESH;
 	bp->rx_pg_ring_size = 0;
@@ -5345,8 +5344,8 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 
 	bp->rx_buf_use_size = rx_size;
 	/* hw alignment + build_skb() overhead*/
-	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
-		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	bp->rx_buf_size = SKB_ALLOCSIZE(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
+		NET_SKB_PAD;
 	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
 	bp->rx_ring_size = size;
 	bp->rx_max_ring = bnx2_find_max_ring(size, MAX_RX_RINGS);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index e37161f..12f2ceb 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1229,8 +1229,7 @@ struct bnx2x {
 #define BNX2X_FW_RX_ALIGN_START	(1UL << BNX2X_RX_ALIGN_SHIFT)
 
 #define BNX2X_FW_RX_ALIGN_END					\
-	max(1UL << BNX2X_RX_ALIGN_SHIFT, 			\
-	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	max(1UL << BNX2X_RX_ALIGN_SHIFT, SKB_ALLOCSIZE(0))
 
 #define BNX2X_PXP_DRAM_ALIGN		(BNX2X_RX_ALIGN_SHIFT - 5)
 
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 7b71387..4d4b063 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -5672,8 +5672,7 @@ static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
 	 * Callers depend upon this behavior and assume that
 	 * we leave everything unchanged if we fail.
 	 */
-	skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp)) +
-		   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	skb_size = SKB_ALLOCSIZE(data_size + TG3_RX_OFFSET(tp));
 	data = kmalloc(skb_size, GFP_ATOMIC);
 	if (!data)
 		return -ENOMEM;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 192250b..fbc92b2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,8 +41,20 @@
 
 #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
 				 ~(SMP_CACHE_BYTES - 1))
+/* maximum data size which can fit into an allocation of X bytes */
 #define SKB_WITH_OVERHEAD(X)	\
 	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+/*
+ * minimum allocation size required for an skb containing X bytes of data
+ *
+ * We do our best to align skb_shared_info on a separate cache
+ * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
+ * aligned memory blocks, unless SLUB/SLAB debug is enabled.  Both
+ * skb->head and skb_shared_info are cache line aligned.
+ */
+#define SKB_ALLOCSIZE(X)	\
+	(SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
 #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a690cae..59a1ecb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -184,13 +184,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		goto out;
 	prefetchw(skb);
 
-	/* We do our best to align skb_shared_info on a separate cache
-	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
-	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
-	 * Both skb->head and skb_shared_info are cache line aligned.
-	 */
-	size = SKB_DATA_ALIGN(size);
-	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_ALLOCSIZE(size);
 	data = kmalloc_node_track_caller(size, gfp_mask, node);
 	if (!data)
 		goto nodata;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 01/10] net: add and use SKB_ALLOCSIZE
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
  2012-04-10 14:26 ` [PATCH 01/10] net: add and use SKB_ALLOCSIZE Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Ian Campbell, Eric Dumazet, Michael S. Tsirkin,
	xen-devel, David Miller

This gives the allocation size required for an skb containing X bytes of data

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/net/ethernet/broadcom/bnx2.c        |    7 +++----
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |    3 +--
 drivers/net/ethernet/broadcom/tg3.c         |    3 +--
 include/linux/skbuff.h                      |   12 ++++++++++++
 net/core/skbuff.c                           |    8 +-------
 5 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index 8297e28..dede71f 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -5321,8 +5321,7 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 	/* 8 for CRC and VLAN */
 	rx_size = bp->dev->mtu + ETH_HLEN + BNX2_RX_OFFSET + 8;
 
-	rx_space = SKB_DATA_ALIGN(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD +
-		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	rx_space = SKB_ALLOCSIZE(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD;
 
 	bp->rx_copy_thresh = BNX2_RX_COPY_THRESH;
 	bp->rx_pg_ring_size = 0;
@@ -5345,8 +5344,8 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 
 	bp->rx_buf_use_size = rx_size;
 	/* hw alignment + build_skb() overhead*/
-	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
-		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	bp->rx_buf_size = SKB_ALLOCSIZE(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
+		NET_SKB_PAD;
 	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
 	bp->rx_ring_size = size;
 	bp->rx_max_ring = bnx2_find_max_ring(size, MAX_RX_RINGS);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index e37161f..12f2ceb 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1229,8 +1229,7 @@ struct bnx2x {
 #define BNX2X_FW_RX_ALIGN_START	(1UL << BNX2X_RX_ALIGN_SHIFT)
 
 #define BNX2X_FW_RX_ALIGN_END					\
-	max(1UL << BNX2X_RX_ALIGN_SHIFT, 			\
-	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	max(1UL << BNX2X_RX_ALIGN_SHIFT, SKB_ALLOCSIZE(0))
 
 #define BNX2X_PXP_DRAM_ALIGN		(BNX2X_RX_ALIGN_SHIFT - 5)
 
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 7b71387..4d4b063 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -5672,8 +5672,7 @@ static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
 	 * Callers depend upon this behavior and assume that
 	 * we leave everything unchanged if we fail.
 	 */
-	skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp)) +
-		   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	skb_size = SKB_ALLOCSIZE(data_size + TG3_RX_OFFSET(tp));
 	data = kmalloc(skb_size, GFP_ATOMIC);
 	if (!data)
 		return -ENOMEM;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 192250b..fbc92b2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,8 +41,20 @@
 
 #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
 				 ~(SMP_CACHE_BYTES - 1))
+/* maximum data size which can fit into an allocation of X bytes */
 #define SKB_WITH_OVERHEAD(X)	\
 	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+/*
+ * minimum allocation size required for an skb containing X bytes of data
+ *
+ * We do our best to align skb_shared_info on a separate cache
+ * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
+ * aligned memory blocks, unless SLUB/SLAB debug is enabled.  Both
+ * skb->head and skb_shared_info are cache line aligned.
+ */
+#define SKB_ALLOCSIZE(X)	\
+	(SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
 #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a690cae..59a1ecb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -184,13 +184,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		goto out;
 	prefetchw(skb);
 
-	/* We do our best to align skb_shared_info on a separate cache
-	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
-	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
-	 * Both skb->head and skb_shared_info are cache line aligned.
-	 */
-	size = SKB_DATA_ALIGN(size);
-	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_ALLOCSIZE(size);
 	data = kmalloc_node_track_caller(size, gfp_mask, node);
 	if (!data)
 		goto nodata;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
  2012-04-10 14:26 ` [PATCH 01/10] net: add and use SKB_ALLOCSIZE Ian Campbell
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:58   ` Eric Dumazet
  2012-04-10 14:58   ` Eric Dumazet
  2012-04-10 14:26 ` Ian Campbell
                   ` (21 subsequent siblings)
  24 siblings, 2 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 net/core/skbuff.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 59a1ecb..d4e139e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -264,7 +264,7 @@ struct sk_buff *build_skb(void *data)
 	if (!skb)
 		return NULL;
 
-	size = ksize(data) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_WITH_OVERHEAD(ksize(data));
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = SKB_TRUESIZE(size);
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (2 preceding siblings ...)
  2012-04-10 14:26 ` [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD Ian Campbell
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Ian Campbell, Eric Dumazet, Michael S. Tsirkin,
	xen-devel, David Miller

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 net/core/skbuff.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 59a1ecb..d4e139e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -264,7 +264,7 @@ struct sk_buff *build_skb(void *data)
 	if (!skb)
 		return NULL;
 
-	size = ksize(data) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_WITH_OVERHEAD(ksize(data));
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = SKB_TRUESIZE(size);
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (3 preceding siblings ...)
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:59   ` Eric Dumazet
  2012-04-10 14:59   ` Eric Dumazet
  2012-04-10 14:26 ` Ian Campbell
                   ` (19 subsequent siblings)
  24 siblings, 2 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell, Divy Le Ray

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb/sge.c  |    3 +--
 drivers/net/ethernet/chelsio/cxgb3/sge.c |    6 +++---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb/sge.c b/drivers/net/ethernet/chelsio/cxgb/sge.c
index 47a8435..52373db 100644
--- a/drivers/net/ethernet/chelsio/cxgb/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb/sge.c
@@ -599,8 +599,7 @@ static int alloc_rx_resources(struct sge *sge, struct sge_params *p)
 		sizeof(struct cpl_rx_data) +
 		sge->freelQ[!sge->jumbo_fl].dma_offset;
 
-		size = (16 * 1024) -
-		    SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_WITH_OVERHEAD(16 * 1024);
 
 	sge->freelQ[sge->jumbo_fl].rx_buffer_size = size;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c b/drivers/net/ethernet/chelsio/cxgb3/sge.c
index cfb60e1..b804470 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -3043,7 +3043,7 @@ int t3_sge_alloc_qset(struct adapter *adapter, unsigned int id, int nports,
 	q->fl[1].buf_size = FL1_PG_CHUNK_SIZE;
 #else
 	q->fl[1].buf_size = is_offload(adapter) ?
-		(16 * 1024) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
+		SKB_WITH_OVERHEAD(16 * 1024) :
 		MAX_FRAME_SIZE + 2 + sizeof(struct cpl_rx_pkt);
 #endif
 
@@ -3282,8 +3282,8 @@ void t3_sge_prep(struct adapter *adap, struct sge_params *p)
 {
 	int i;
 
-	p->max_pkt_size = (16 * 1024) - sizeof(struct cpl_rx_data) -
-	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	p->max_pkt_size =
+		SKB_WITH_OVERHEAD((16*1024) - sizeof(struct cpl_rx_data));
 
 	for (i = 0; i < SGE_QSETS; ++i) {
 		struct qset_params *q = p->qset + i;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (4 preceding siblings ...)
  2012-04-10 14:26 ` [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Ian Campbell, Eric Dumazet, Michael S. Tsirkin,
	xen-devel, David Miller, Divy Le Ray

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb/sge.c  |    3 +--
 drivers/net/ethernet/chelsio/cxgb3/sge.c |    6 +++---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb/sge.c b/drivers/net/ethernet/chelsio/cxgb/sge.c
index 47a8435..52373db 100644
--- a/drivers/net/ethernet/chelsio/cxgb/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb/sge.c
@@ -599,8 +599,7 @@ static int alloc_rx_resources(struct sge *sge, struct sge_params *p)
 		sizeof(struct cpl_rx_data) +
 		sge->freelQ[!sge->jumbo_fl].dma_offset;
 
-		size = (16 * 1024) -
-		    SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_WITH_OVERHEAD(16 * 1024);
 
 	sge->freelQ[sge->jumbo_fl].rx_buffer_size = size;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c b/drivers/net/ethernet/chelsio/cxgb3/sge.c
index cfb60e1..b804470 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -3043,7 +3043,7 @@ int t3_sge_alloc_qset(struct adapter *adapter, unsigned int id, int nports,
 	q->fl[1].buf_size = FL1_PG_CHUNK_SIZE;
 #else
 	q->fl[1].buf_size = is_offload(adapter) ?
-		(16 * 1024) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
+		SKB_WITH_OVERHEAD(16 * 1024) :
 		MAX_FRAME_SIZE + 2 + sizeof(struct cpl_rx_pkt);
 #endif
 
@@ -3282,8 +3282,8 @@ void t3_sge_prep(struct adapter *adap, struct sge_params *p)
 {
 	int i;
 
-	p->max_pkt_size = (16 * 1024) - sizeof(struct cpl_rx_data) -
-	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	p->max_pkt_size =
+		SKB_WITH_OVERHEAD((16*1024) - sizeof(struct cpl_rx_data));
 
 	for (i = 0; i < SGE_QSETS; ++i) {
 		struct qset_params *q = p->qset + i;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (5 preceding siblings ...)
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 15:01   ` Eric Dumazet
  2012-04-10 15:01   ` Eric Dumazet
  2012-04-10 14:26 ` Ian Campbell
                   ` (17 subsequent siblings)
  24 siblings, 2 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell

This reduces the minimum overhead required for this allocation such that the
shinfo can be grown in the following patch without overflowing 2048 bytes for a
1500 byte frame.

Reducing this overhead while also growing the shinfo means that sometimes the
tail end of the data can end up in the same cache line as the beginning of the
shinfo. Specifically in the case of the 64 byte cache lines on a 64 bit system
the first 8 bytes of shinfo can overlap the tail cacheline of the data. In many
cases the allocation slop means that there is no overlap.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fbc92b2..0ad6a46 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -43,17 +43,20 @@
 				 ~(SMP_CACHE_BYTES - 1))
 /* maximum data size which can fit into an allocation of X bytes */
 #define SKB_WITH_OVERHEAD(X)	\
-	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	((X) - sizeof(struct skb_shared_info))
 /*
  * minimum allocation size required for an skb containing X bytes of data
  *
  * We do our best to align skb_shared_info on a separate cache
  * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
- * aligned memory blocks, unless SLUB/SLAB debug is enabled.  Both
- * skb->head and skb_shared_info are cache line aligned.
+ * aligned memory blocks, unless SLUB/SLAB debug is enabled.
+ * skb->head is aligned to a cache line while the tail of
+ * skb_shared_info is cache line aligned.  We arrange that the order
+ * of the fields in skb_shared_info is such that the interesting
+ * fields are cache line aligned and fit within a 64 byte cache line.
  */
 #define SKB_ALLOCSIZE(X)	\
-	(SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	(SKB_DATA_ALIGN((X) + sizeof(struct skb_shared_info)))
 
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
@@ -63,7 +66,7 @@
 /* return minimum truesize of one skb containing X bytes of data */
 #define SKB_TRUESIZE(X) ((X) +						\
 			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
-			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+			 sizeof(struct skb_shared_info))
 
 /* A. Checksumming of received packets by device.
  *
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (6 preceding siblings ...)
  2012-04-10 14:26 ` [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` [PATCH 05/10] net: move destructor_arg to the front of sk_buff Ian Campbell
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Ian Campbell, Eric Dumazet, Michael S. Tsirkin,
	xen-devel, David Miller

This reduces the minimum overhead required for this allocation such that the
shinfo can be grown in the following patch without overflowing 2048 bytes for a
1500 byte frame.

Reducing this overhead while also growing the shinfo means that sometimes the
tail end of the data can end up in the same cache line as the beginning of the
shinfo. Specifically in the case of the 64 byte cache lines on a 64 bit system
the first 8 bytes of shinfo can overlap the tail cacheline of the data. In many
cases the allocation slop means that there is no overlap.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fbc92b2..0ad6a46 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -43,17 +43,20 @@
 				 ~(SMP_CACHE_BYTES - 1))
 /* maximum data size which can fit into an allocation of X bytes */
 #define SKB_WITH_OVERHEAD(X)	\
-	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	((X) - sizeof(struct skb_shared_info))
 /*
  * minimum allocation size required for an skb containing X bytes of data
  *
  * We do our best to align skb_shared_info on a separate cache
  * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
- * aligned memory blocks, unless SLUB/SLAB debug is enabled.  Both
- * skb->head and skb_shared_info are cache line aligned.
+ * aligned memory blocks, unless SLUB/SLAB debug is enabled.
+ * skb->head is aligned to a cache line while the tail of
+ * skb_shared_info is cache line aligned.  We arrange that the order
+ * of the fields in skb_shared_info is such that the interesting
+ * fields are cache line aligned and fit within a 64 byte cache line.
  */
 #define SKB_ALLOCSIZE(X)	\
-	(SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	(SKB_DATA_ALIGN((X) + sizeof(struct skb_shared_info)))
 
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
@@ -63,7 +66,7 @@
 /* return minimum truesize of one skb containing X bytes of data */
 #define SKB_TRUESIZE(X) ((X) +						\
 			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
-			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+			 sizeof(struct skb_shared_info))
 
 /* A. Checksumming of received packets by device.
  *
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (8 preceding siblings ...)
  2012-04-10 14:26 ` [PATCH 05/10] net: move destructor_arg to the front of sk_buff Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 15:05   ` Eric Dumazet
                     ` (3 more replies)
  2012-04-10 14:26 ` [PATCH 06/10] net: add support for per-paged-fragment destructors Ian Campbell
                   ` (14 subsequent siblings)
  24 siblings, 4 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell

As of the previous patch we align the end (rather than the start) of the struct
to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
increase from the next patch, the first 8 bytes of the struct end up on a
different cache line to the rest of it so make sure it is something relatively
unimportant to avoid hitting an extra cache line on hot operations such as
kfree_skb.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   15 ++++++++++-----
 net/core/skbuff.c      |    5 ++++-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0ad6a46..f0ae39c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -265,6 +265,15 @@ struct ubuf_info {
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
+	/* Intermediate layers must ensure that destructor_arg
+	 * remains valid until skb destructor */
+	void		*destructor_arg;
+
+	/*
+	 * Warning: all fields from here until dataref are cleared in
+	 * __alloc_skb()
+	 *
+	 */
 	unsigned char	nr_frags;
 	__u8		tx_flags;
 	unsigned short	gso_size;
@@ -276,14 +285,10 @@ struct skb_shared_info {
 	__be32          ip6_frag_id;
 
 	/*
-	 * Warning : all fields before dataref are cleared in __alloc_skb()
+	 * Warning: all fields before dataref are cleared in __alloc_skb()
 	 */
 	atomic_t	dataref;
 
-	/* Intermediate layers must ensure that destructor_arg
-	 * remains valid until skb destructor */
-	void *		destructor_arg;
-
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d4e139e..b8a41d6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+
+	memset(&shinfo->nr_frags, 0,
+	       offsetof(struct skb_shared_info, dataref)
+	       - offsetof(struct skb_shared_info, nr_frags));
 	atomic_set(&shinfo->dataref, 1);
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
 
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (7 preceding siblings ...)
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` Ian Campbell
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Ian Campbell, Eric Dumazet, Michael S. Tsirkin,
	xen-devel, David Miller

As of the previous patch we align the end (rather than the start) of the struct
to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
increase from the next patch, the first 8 bytes of the struct end up on a
different cache line to the rest of it so make sure it is something relatively
unimportant to avoid hitting an extra cache line on hot operations such as
kfree_skb.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   15 ++++++++++-----
 net/core/skbuff.c      |    5 ++++-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0ad6a46..f0ae39c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -265,6 +265,15 @@ struct ubuf_info {
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
+	/* Intermediate layers must ensure that destructor_arg
+	 * remains valid until skb destructor */
+	void		*destructor_arg;
+
+	/*
+	 * Warning: all fields from here until dataref are cleared in
+	 * __alloc_skb()
+	 *
+	 */
 	unsigned char	nr_frags;
 	__u8		tx_flags;
 	unsigned short	gso_size;
@@ -276,14 +285,10 @@ struct skb_shared_info {
 	__be32          ip6_frag_id;
 
 	/*
-	 * Warning : all fields before dataref are cleared in __alloc_skb()
+	 * Warning: all fields before dataref are cleared in __alloc_skb()
 	 */
 	atomic_t	dataref;
 
-	/* Intermediate layers must ensure that destructor_arg
-	 * remains valid until skb destructor */
-	void *		destructor_arg;
-
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d4e139e..b8a41d6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+
+	memset(&shinfo->nr_frags, 0,
+	       offsetof(struct skb_shared_info, dataref)
+	       - offsetof(struct skb_shared_info, nr_frags));
 	atomic_set(&shinfo->dataref, 1);
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
 
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 06/10] net: add support for per-paged-fragment destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (10 preceding siblings ...)
  2012-04-10 14:26 ` [PATCH 06/10] net: add support for per-paged-fragment destructors Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-26 20:44   ` [Xen-devel] " Konrad Rzeszutek Wilk
  2012-04-26 20:44   ` Konrad Rzeszutek Wilk
  2012-04-10 14:26 ` [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
                   ` (12 subsequent siblings)
  24 siblings, 2 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell, Michał Mirosław

Entities which care about the complete lifecycle of pages which they inject
into the network stack via an skb paged fragment can choose to set this
destructor in order to receive a callback when the stack is really finished
with a page (including all clones, retransmits, pull-ups etc etc).

This destructor will always be propagated alongside the struct page when
copying skb_frag_t->page. This is the reason I chose to embed the destructor in
a "struct { } page" within the skb_frag_t, rather than as a separate field,
since it allows existing code which propagates ->frags[N].page to Just
Work(tm).

When the destructor is present the page reference counting is done slightly
differently. No references are held by the network stack on the struct page (it
is up to the caller to manage this as necessary) instead the network stack will
track references via the count embedded in the destructor structure. When this
reference count reaches zero then the destructor will be called and the caller
can take the necesary steps to release the page (i.e. release the struct page
reference itself).

The intention is that callers can use this callback to delay completion to
_their_ callers until the network stack has completely released the page, in
order to prevent use-after-free or modification of data pages which are still
in use by the stack.

It is allowable (indeed expected) for a caller to share a single destructor
instance between multiple pages injected into the stack e.g. a group of pages
included in a single higher level operation might share a destructor which is
used to complete that higher level operation.

With this change and the previous two changes to shinfo alignment and field
orderring it is now the case tyhat on a 64 bit system with 64 byte cache lines,
everything from nr_frags until the end of frags[0] is on the same cacheline.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |   43 +++++++++++++++++++++++++++++++++++++++++++
 net/core/skbuff.c      |   17 +++++++++++++++++
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f0ae39c..6ac283e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -166,9 +166,15 @@ struct sk_buff;
 
 typedef struct skb_frag_struct skb_frag_t;
 
+struct skb_frag_destructor {
+	atomic_t ref;
+	int (*destroy)(struct skb_frag_destructor *destructor);
+};
+
 struct skb_frag_struct {
 	struct {
 		struct page *p;
+		struct skb_frag_destructor *destructor;
 	} page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
 	__u32 page_offset;
@@ -1221,6 +1227,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
 }
 
 /**
+ * skb_frag_set_destructor - set destructor for a paged fragment
+ * @skb: buffer containing fragment to be initialised
+ * @i: paged fragment index to initialise
+ * @destroy: the destructor to use for this fragment
+ *
+ * Sets @destroy as the destructor to be called when all references to
+ * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
+ * etc) are released.
+ *
+ * When a destructor is set then reference counting is performed on
+ * @destroy->ref. When the ref reaches zero then @destroy->destroy
+ * will be called. The caller is responsible for holding and managing
+ * any other references (such a the struct page reference count).
+ *
+ * This function must be called before any use of skb_frag_ref() or
+ * skb_frag_unref().
+ */
+static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
+					   struct skb_frag_destructor *destroy)
+{
+	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+	frag->page.destructor = destroy;
+}
+
+/**
  * __skb_fill_page_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
  * @i: paged fragment index to initialise
@@ -1239,6 +1270,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 	frag->page.p		  = page;
+	frag->page.destructor     = NULL;
 	frag->page_offset	  = off;
 	skb_frag_size_set(frag, size);
 }
@@ -1743,6 +1775,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
 	return frag->page.p;
 }
 
+extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
+extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
+
 /**
  * __skb_frag_ref - take an addition reference on a paged fragment.
  * @frag: the paged fragment
@@ -1751,6 +1786,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_ref(frag->page.destructor);
+		return;
+	}
 	get_page(skb_frag_page(frag));
 }
 
@@ -1774,6 +1813,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
  */
 static inline void __skb_frag_unref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_unref(frag->page.destructor);
+		return;
+	}
 	put_page(skb_frag_page(frag));
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b8a41d6..9ec88ce 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -349,6 +349,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
 }
 EXPORT_SYMBOL(dev_alloc_skb);
 
+void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
+{
+	BUG_ON(destroy == NULL);
+	atomic_inc(&destroy->ref);
+}
+EXPORT_SYMBOL(skb_frag_destructor_ref);
+
+void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
+{
+	if (destroy == NULL)
+		return;
+
+	if (atomic_dec_and_test(&destroy->ref))
+		destroy->destroy(destroy);
+}
+EXPORT_SYMBOL(skb_frag_destructor_unref);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 06/10] net: add support for per-paged-fragment destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (9 preceding siblings ...)
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` Ian Campbell
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Ian Campbell, Eric Dumazet, Michael S. Tsirkin,
	xen-devel, Michał Mirosław, David Miller

Entities which care about the complete lifecycle of pages which they inject
into the network stack via an skb paged fragment can choose to set this
destructor in order to receive a callback when the stack is really finished
with a page (including all clones, retransmits, pull-ups etc etc).

This destructor will always be propagated alongside the struct page when
copying skb_frag_t->page. This is the reason I chose to embed the destructor in
a "struct { } page" within the skb_frag_t, rather than as a separate field,
since it allows existing code which propagates ->frags[N].page to Just
Work(tm).

When the destructor is present the page reference counting is done slightly
differently. No references are held by the network stack on the struct page (it
is up to the caller to manage this as necessary) instead the network stack will
track references via the count embedded in the destructor structure. When this
reference count reaches zero then the destructor will be called and the caller
can take the necesary steps to release the page (i.e. release the struct page
reference itself).

The intention is that callers can use this callback to delay completion to
_their_ callers until the network stack has completely released the page, in
order to prevent use-after-free or modification of data pages which are still
in use by the stack.

It is allowable (indeed expected) for a caller to share a single destructor
instance between multiple pages injected into the stack e.g. a group of pages
included in a single higher level operation might share a destructor which is
used to complete that higher level operation.

With this change and the previous two changes to shinfo alignment and field
orderring it is now the case tyhat on a 64 bit system with 64 byte cache lines,
everything from nr_frags until the end of frags[0] is on the same cacheline.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |   43 +++++++++++++++++++++++++++++++++++++++++++
 net/core/skbuff.c      |   17 +++++++++++++++++
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f0ae39c..6ac283e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -166,9 +166,15 @@ struct sk_buff;
 
 typedef struct skb_frag_struct skb_frag_t;
 
+struct skb_frag_destructor {
+	atomic_t ref;
+	int (*destroy)(struct skb_frag_destructor *destructor);
+};
+
 struct skb_frag_struct {
 	struct {
 		struct page *p;
+		struct skb_frag_destructor *destructor;
 	} page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
 	__u32 page_offset;
@@ -1221,6 +1227,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
 }
 
 /**
+ * skb_frag_set_destructor - set destructor for a paged fragment
+ * @skb: buffer containing fragment to be initialised
+ * @i: paged fragment index to initialise
+ * @destroy: the destructor to use for this fragment
+ *
+ * Sets @destroy as the destructor to be called when all references to
+ * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
+ * etc) are released.
+ *
+ * When a destructor is set then reference counting is performed on
+ * @destroy->ref. When the ref reaches zero then @destroy->destroy
+ * will be called. The caller is responsible for holding and managing
+ * any other references (such a the struct page reference count).
+ *
+ * This function must be called before any use of skb_frag_ref() or
+ * skb_frag_unref().
+ */
+static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
+					   struct skb_frag_destructor *destroy)
+{
+	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+	frag->page.destructor = destroy;
+}
+
+/**
  * __skb_fill_page_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
  * @i: paged fragment index to initialise
@@ -1239,6 +1270,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 	frag->page.p		  = page;
+	frag->page.destructor     = NULL;
 	frag->page_offset	  = off;
 	skb_frag_size_set(frag, size);
 }
@@ -1743,6 +1775,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
 	return frag->page.p;
 }
 
+extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
+extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
+
 /**
  * __skb_frag_ref - take an addition reference on a paged fragment.
  * @frag: the paged fragment
@@ -1751,6 +1786,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_ref(frag->page.destructor);
+		return;
+	}
 	get_page(skb_frag_page(frag));
 }
 
@@ -1774,6 +1813,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
  */
 static inline void __skb_frag_unref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_unref(frag->page.destructor);
+		return;
+	}
 	put_page(skb_frag_page(frag));
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b8a41d6..9ec88ce 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -349,6 +349,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
 }
 EXPORT_SYMBOL(dev_alloc_skb);
 
+void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
+{
+	BUG_ON(destroy == NULL);
+	atomic_inc(&destroy->ref);
+}
+EXPORT_SYMBOL(skb_frag_destructor_ref);
+
+void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
+{
+	if (destroy == NULL)
+		return;
+
+	if (atomic_dec_and_test(&destroy->ref))
+		destroy->destroy(destroy);
+}
+EXPORT_SYMBOL(skb_frag_destructor_unref);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
-- 
1.7.2.5


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced.
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (12 preceding siblings ...)
  2012-04-10 14:26 ` [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 20:11   ` Ben Hutchings
  2012-04-10 20:11   ` Ben Hutchings
  2012-04-10 14:26 ` [PATCH 08/10] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell
                   ` (10 subsequent siblings)
  24 siblings, 2 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Michał Mirosław

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |    7 +++++--
 net/core/skbuff.c      |    1 +
 net/ipv4/ip_output.c   |    2 +-
 net/ipv4/tcp.c         |    4 ++--
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6ac283e..8593ac2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2014,13 +2014,16 @@ static inline int skb_add_data(struct sk_buff *skb,
 }
 
 static inline int skb_can_coalesce(struct sk_buff *skb, int i,
-				   const struct page *page, int off)
+				   const struct page *page,
+				   const struct skb_frag_destructor *destroy,
+				   int off)
 {
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
 		return page == skb_frag_page(frag) &&
-		       off == frag->page_offset + skb_frag_size(frag);
+		       off == frag->page_offset + skb_frag_size(frag) &&
+		       frag->page.destructor == destroy;
 	}
 	return 0;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9ec88ce..e63a4a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2323,6 +2323,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 	 */
 	if (!to ||
 	    !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
+			      fragfrom->page.destructor,
 			      fragfrom->page_offset)) {
 		merge = -1;
 	} else {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index ff302bd..9e4eca6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1243,7 +1243,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, offset)) {
+		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
 			get_page(page);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cfd7edd..b1612e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -804,7 +804,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -1013,7 +1013,7 @@ new_segment:
 
 				off = sk->sk_sndmsg_off;
 
-				if (skb_can_coalesce(skb, i, page, off) &&
+				if (skb_can_coalesce(skb, i, page, NULL, off) &&
 				    off != PAGE_SIZE) {
 					/* We can extend the last page
 					 * fragment. */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced.
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (11 preceding siblings ...)
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` Ian Campbell
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Ian Campbell, Eric Dumazet, Michael S. Tsirkin,
	Hideaki YOSHIFUJI, James Morris, xen-devel, Patrick McHardy,
	Alexey Kuznetsov, Michał Mirosław, David Miller,
	Pekka Savola (ipv6)

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |    7 +++++--
 net/core/skbuff.c      |    1 +
 net/ipv4/ip_output.c   |    2 +-
 net/ipv4/tcp.c         |    4 ++--
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6ac283e..8593ac2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2014,13 +2014,16 @@ static inline int skb_add_data(struct sk_buff *skb,
 }
 
 static inline int skb_can_coalesce(struct sk_buff *skb, int i,
-				   const struct page *page, int off)
+				   const struct page *page,
+				   const struct skb_frag_destructor *destroy,
+				   int off)
 {
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
 		return page == skb_frag_page(frag) &&
-		       off == frag->page_offset + skb_frag_size(frag);
+		       off == frag->page_offset + skb_frag_size(frag) &&
+		       frag->page.destructor == destroy;
 	}
 	return 0;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9ec88ce..e63a4a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2323,6 +2323,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 	 */
 	if (!to ||
 	    !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
+			      fragfrom->page.destructor,
 			      fragfrom->page_offset)) {
 		merge = -1;
 	} else {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index ff302bd..9e4eca6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1243,7 +1243,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, offset)) {
+		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
 			get_page(page);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cfd7edd..b1612e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -804,7 +804,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -1013,7 +1013,7 @@ new_segment:
 
 				off = sk->sk_sndmsg_off;
 
-				if (skb_can_coalesce(skb, i, page, off) &&
+				if (skb_can_coalesce(skb, i, page, NULL, off) &&
 				    off != PAGE_SIZE) {
 					/* We can extend the last page
 					 * fragment. */
-- 
1.7.2.5


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 08/10] net: add skb_orphan_frags to copy aside frags with destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (13 preceding siblings ...)
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` Ian Campbell
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell

This should be used by drivers which need to hold on to an skb for an extended
(perhaps unbounded) period of time. e.g. the tun driver which relies on
userspace consuming the skb.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: mst@redhat.com
---
 drivers/net/tun.c      |    1 +
 include/linux/skbuff.h |   11 ++++++++
 net/core/skbuff.c      |   67 ++++++++++++++++++++++++++++++++++-------------
 3 files changed, 60 insertions(+), 19 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 74d7f76..b20789e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -416,6 +416,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Orphan the skb - required as we might hang on to it
 	 * for indefinite time. */
 	skb_orphan(skb);
+	skb_orphan_frags(skb, GFP_KERNEL);
 
 	/* Enqueue packet */
 	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8593ac2..77145f0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1688,6 +1688,17 @@ static inline void skb_orphan(struct sk_buff *skb)
 }
 
 /**
+ *	skb_orphan_frags - orphan the frags contained in a buffer
+ *	@skb: buffer to orphan frags from
+ *	@gfp_mask: allocation mask for replacement pages
+ *
+ *	For each frag in the SKB which has a destructor (i.e. has an
+ *	owner) create a copy of that frag and release the original
+ *	page by calling the destructor.
+ */
+extern int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask);
+
+/**
  *	__skb_queue_purge - empty a list
  *	@list: list to empty
  *
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e63a4a6..d0a1603 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -688,31 +688,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
-/*	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
- *	@skb: the skb to modify
- *	@gfp_mask: allocation priority
- *
- *	This must be called on SKBTX_DEV_ZEROCOPY skb.
- *	It will copy all frags into kernel and drop the reference
- *	to userspace pages.
- *
- *	If this function is called from an interrupt gfp_mask() must be
- *	%GFP_ATOMIC.
- *
- *	Returns 0 on success or a negative error code on failure
- *	to allocate kernel memory to copy to.
+/*
+ * If uarg != NULL copy and replace all frags.
+ * If uarg == NULL then only copy and replace those which have a destructor
+ * pointer.
  */
-int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+static int skb_copy_frags(struct sk_buff *skb, gfp_t gfp_mask,
+			  struct ubuf_info *uarg)
 {
 	int i;
 	int num_frags = skb_shinfo(skb)->nr_frags;
 	struct page *page, *head = NULL;
-	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 
 	for (i = 0; i < num_frags; i++) {
 		u8 *vaddr;
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
 
+		if (!uarg && !f->page.destructor)
+			continue;
+
 		page = alloc_page(GFP_ATOMIC);
 		if (!page) {
 			while (head) {
@@ -730,11 +724,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		head = page;
 	}
 
-	/* skb frags release userspace buffers */
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+	/* skb frags release buffers */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+		if (!uarg && !f->page.destructor)
+			continue;
 		skb_frag_unref(skb, i);
+	}
 
-	uarg->callback(uarg);
+	if (uarg)
+		uarg->callback(uarg);
 
 	/* skb frags point to kernel buffers */
 	for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
@@ -743,10 +742,40 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		head = (struct page *)head->private;
 	}
 
-	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
 	return 0;
 }
 
+/*	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
+ *	@skb: the skb to modify
+ *	@gfp_mask: allocation priority
+ *
+ *	This must be called on SKBTX_DEV_ZEROCOPY skb.
+ *	It will copy all frags into kernel and drop the reference
+ *	to userspace pages.
+ *
+ *	If this function is called from an interrupt gfp_mask() must be
+ *	%GFP_ATOMIC.
+ *
+ *	Returns 0 on success or a negative error code on failure
+ *	to allocate kernel memory to copy to.
+ */
+int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
+	int rc;
+
+	rc = skb_copy_frags(skb, gfp_mask, uarg);
+
+	if (rc == 0)
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+
+	return rc;
+}
+
+int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	return skb_copy_frags(skb, gfp_mask, NULL);
+}
 
 /**
  *	skb_clone	-	duplicate an sk_buff
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 08/10] net: add skb_orphan_frags to copy aside frags with destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (14 preceding siblings ...)
  2012-04-10 14:26 ` [PATCH 08/10] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` [PATCH 09/10] net: add paged frag destructor support to kernel_sendpage Ian Campbell
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Ian Campbell, Eric Dumazet, Michael S. Tsirkin,
	xen-devel, David Miller

This should be used by drivers which need to hold on to an skb for an extended
(perhaps unbounded) period of time. e.g. the tun driver which relies on
userspace consuming the skb.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: mst@redhat.com
---
 drivers/net/tun.c      |    1 +
 include/linux/skbuff.h |   11 ++++++++
 net/core/skbuff.c      |   67 ++++++++++++++++++++++++++++++++++-------------
 3 files changed, 60 insertions(+), 19 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 74d7f76..b20789e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -416,6 +416,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Orphan the skb - required as we might hang on to it
 	 * for indefinite time. */
 	skb_orphan(skb);
+	skb_orphan_frags(skb, GFP_KERNEL);
 
 	/* Enqueue packet */
 	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8593ac2..77145f0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1688,6 +1688,17 @@ static inline void skb_orphan(struct sk_buff *skb)
 }
 
 /**
+ *	skb_orphan_frags - orphan the frags contained in a buffer
+ *	@skb: buffer to orphan frags from
+ *	@gfp_mask: allocation mask for replacement pages
+ *
+ *	For each frag in the SKB which has a destructor (i.e. has an
+ *	owner) create a copy of that frag and release the original
+ *	page by calling the destructor.
+ */
+extern int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask);
+
+/**
  *	__skb_queue_purge - empty a list
  *	@list: list to empty
  *
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e63a4a6..d0a1603 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -688,31 +688,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
-/*	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
- *	@skb: the skb to modify
- *	@gfp_mask: allocation priority
- *
- *	This must be called on SKBTX_DEV_ZEROCOPY skb.
- *	It will copy all frags into kernel and drop the reference
- *	to userspace pages.
- *
- *	If this function is called from an interrupt gfp_mask() must be
- *	%GFP_ATOMIC.
- *
- *	Returns 0 on success or a negative error code on failure
- *	to allocate kernel memory to copy to.
+/*
+ * If uarg != NULL copy and replace all frags.
+ * If uarg == NULL then only copy and replace those which have a destructor
+ * pointer.
  */
-int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+static int skb_copy_frags(struct sk_buff *skb, gfp_t gfp_mask,
+			  struct ubuf_info *uarg)
 {
 	int i;
 	int num_frags = skb_shinfo(skb)->nr_frags;
 	struct page *page, *head = NULL;
-	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 
 	for (i = 0; i < num_frags; i++) {
 		u8 *vaddr;
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
 
+		if (!uarg && !f->page.destructor)
+			continue;
+
 		page = alloc_page(GFP_ATOMIC);
 		if (!page) {
 			while (head) {
@@ -730,11 +724,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		head = page;
 	}
 
-	/* skb frags release userspace buffers */
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+	/* skb frags release buffers */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+		if (!uarg && !f->page.destructor)
+			continue;
 		skb_frag_unref(skb, i);
+	}
 
-	uarg->callback(uarg);
+	if (uarg)
+		uarg->callback(uarg);
 
 	/* skb frags point to kernel buffers */
 	for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
@@ -743,10 +742,40 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		head = (struct page *)head->private;
 	}
 
-	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
 	return 0;
 }
 
+/*	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
+ *	@skb: the skb to modify
+ *	@gfp_mask: allocation priority
+ *
+ *	This must be called on SKBTX_DEV_ZEROCOPY skb.
+ *	It will copy all frags into kernel and drop the reference
+ *	to userspace pages.
+ *
+ *	If this function is called from an interrupt gfp_mask() must be
+ *	%GFP_ATOMIC.
+ *
+ *	Returns 0 on success or a negative error code on failure
+ *	to allocate kernel memory to copy to.
+ */
+int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
+	int rc;
+
+	rc = skb_copy_frags(skb, gfp_mask, uarg);
+
+	if (rc == 0)
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+
+	return rc;
+}
+
+int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	return skb_copy_frags(skb, gfp_mask, NULL);
+}
 
 /**
  *	skb_clone	-	duplicate an sk_buff
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 09/10] net: add paged frag destructor support to kernel_sendpage.
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
  2012-04-10 14:26 ` [PATCH 01/10] net: add and use SKB_ALLOCSIZE Ian Campbell
@ 2012-04-10 14:26     ` Ian Campbell
  2012-04-10 14:26 ` [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell
                       ` (22 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel-GuqFBffKawuEi8DpZVb4nw, Ian Campbell, Alexey Kuznetsov,
	Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman,
	drbd-user-cunTk1MwBs8qoQakbn7OcQ,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	ocfs2-devel-N0ozoZBvEnrZJqsBc5GL+g,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Alexey Kuznetsov <kuznet-v/Mj1YrvjDBInbfyfbPRSQ@public.gmane.org>
Cc: "Pekka Savola (ipv6)" <pekkas-UjJjq++bwZ7HOG6cAo2yLw@public.gmane.org>
Cc: James Morris <jmorris-gx6/JNMH7DfYtjvyW6yDsg@public.gmane.org>
Cc: Hideaki YOSHIFUJI <yoshfuji-VfPWfsRibaP+Ru+s062T9g@public.gmane.org>
Cc: Patrick McHardy <kaber-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
Cc: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Cc: Greg Kroah-Hartman <gregkh-l3A5Bk7waGM@public.gmane.org>
Cc: drbd-user-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
Cc: devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org
Cc: cluster-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: ocfs2-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: rds-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org
Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 drivers/block/drbd/drbd_main.c           |    1 +
 drivers/scsi/iscsi_tcp.c                 |    4 ++--
 drivers/scsi/iscsi_tcp.h                 |    3 ++-
 drivers/target/iscsi/iscsi_target_util.c |    3 ++-
 fs/dlm/lowcomms.c                        |    4 ++--
 fs/ocfs2/cluster/tcp.c                   |    1 +
 include/linux/net.h                      |    6 +++++-
 include/net/inet_common.h                |    4 +++-
 include/net/ip.h                         |    4 +++-
 include/net/sock.h                       |    8 +++++---
 include/net/tcp.h                        |    4 +++-
 net/ceph/messenger.c                     |    2 +-
 net/core/sock.c                          |    6 +++++-
 net/ipv4/af_inet.c                       |    9 ++++++---
 net/ipv4/ip_output.c                     |    6 ++++--
 net/ipv4/tcp.c                           |   24 +++++++++++++++---------
 net/ipv4/udp.c                           |   11 ++++++-----
 net/ipv4/udp_impl.h                      |    5 +++--
 net/rds/tcp_send.c                       |    1 +
 net/socket.c                             |   11 +++++++----
 net/sunrpc/svcsock.c                     |    6 +++---
 net/sunrpc/xprtsock.c                    |    2 +-
 22 files changed, 81 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 211fc44..e70ba0c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	set_fs(KERNEL_DS);
 	do {
 		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							NULL,
 							offset, len,
 							msg_flags);
 		if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 453a740..df9f7dd 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
 		if (!segment->data) {
 			sg = segment->sg;
 			offset += segment->sg_offset + sg->offset;
-			r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
-						  copy, flags);
+			r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+						  offset, copy, flags);
 		} else {
 			struct msghdr msg = { .msg_flags = flags };
 			struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
 	uint32_t		sendpage_failures_cnt;
 	uint32_t		discontiguous_hdr_cnt;
 
-	ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+	ssize_t (*sendpage)(struct socket *, struct page *,
+			    struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 4eba86d..d876dae 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1323,7 +1323,8 @@ send_hdr:
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+					sg_page(sg), NULL,
+					sg->offset + offset, sub_len, 0);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..0673cea 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1336,8 +1336,8 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
-					      msg_flags);
+			ret = kernel_sendpage(con->sock, e->page, NULL,
+					      offset, len, msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
 				    test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 044e7b5..e13851e 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -983,6 +983,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		mutex_lock(&sc->sc_send_lock);
 		ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
 						 virt_to_page(kmalloced_virt),
+						 NULL,
 						 (long)kmalloced_virt & ~PAGE_MASK,
 						 size, MSG_DONTWAIT);
 		mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index be60c7f..d9b0d648 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -203,6 +204,7 @@ struct proto_ops {
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -274,7 +276,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
 			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index b53d65f..6bf9926 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int		ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4,
+				struct page *page,
+				struct skb_frag_destructor *destroy,
 				int offset, size_t size, int flags);
 extern struct sk_buff  *__ip_make_skb(struct sock *sk,
 				      struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index a6ba1f8..c927997 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -834,6 +834,7 @@ struct proto {
 					size_t len, int noblock, int flags, 
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
@@ -1452,9 +1453,10 @@ extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
 extern ssize_t			sock_no_sendpage(struct socket *sock,
-						struct page *page,
-						int offset, size_t size, 
-						int flags);
+					struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size,
+					int flags);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f75a04d..7536266 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -331,7 +331,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
 extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
 			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ad5b708..69f049b 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -851,7 +851,7 @@ static int write_partial_msg_pages(struct ceph_connection *con)
 				cpu_to_le32(crc32c(tmpcrc, base, len));
 			con->out_msg_pos.did_page_crc = 1;
 		}
-		ret = kernel_sendpage(con->sock, page,
+		ret = kernel_sendpage(con->sock, page, NULL,
 				      con->out_msg_pos.page_pos + page_shift,
 				      len,
 				      MSG_DONTWAIT | MSG_NOSIGNAL |
diff --git a/net/core/sock.c b/net/core/sock.c
index 9be6d0d..f56fc8c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1965,7 +1965,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
 }
 EXPORT_SYMBOL(sock_no_mmap);
 
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+			 struct skb_frag_destructor *destroy,
+			 int offset, size_t size, int flags)
 {
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
@@ -1975,6 +1977,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
 	kunmap(page);
+	/* kernel_sendmsg copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index fdf49fd..e55a6e1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -748,7 +748,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -761,8 +763,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		return -EAGAIN;
 
 	if (sk->sk_prot->sendpage)
-		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
-	return sock_no_sendpage(sock, page, offset, size, flags);
+		return sk->sk_prot->sendpage(sk, page, destroy,
+					     offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(inet_sendpage);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9e4eca6..2ce0b8e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1130,6 +1130,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 }
 
 ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+		       struct skb_frag_destructor *destroy,
 		       int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1243,11 +1244,12 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+		if (skb_can_coalesce(skb, i, page, destroy, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		} else {
 			err = -EMSGSIZE;
 			goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b1612e9..89d4db0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -757,8 +757,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
-			 size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor *destroy,
+				int poffset,
+				size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int mss_now, size_goal;
@@ -804,7 +807,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -815,8 +818,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -871,18 +875,20 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	ssize_t res;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
 	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-		return sock_no_sendpage(sk->sk_socket, page, offset, size,
-					flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page, destroy,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d6f5fee..f9038e4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1032,8 +1032,9 @@ do_confirm:
 }
 EXPORT_SYMBOL(udp_sendmsg);
 
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1061,11 +1062,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	}
 
 	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
+			     page, destroy, offset, size, flags);
 	if (ret == -EOPNOTSUPP) {
 		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 	}
 	if (ret < 0) {
 		udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index aaad650..4923d82 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int	compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 extern int	udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    size_t len, int noblock, int flags, int *addr_len);
-extern int	udp_sendpage(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+extern int	udp_sendpage(struct sock *sk, struct page *page,
+			     struct skb_frag_destructor *destroy,
+			     int offset, size_t size, int flags);
 extern int	udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb);
 extern void	udp_destroy_sock(struct sock *sk);
 
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	while (sg < rm->data.op_nents) {
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
+						NULL,
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
 						MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index 12a48d8..d0c0d8d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -815,7 +815,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	if (more)
 		flags |= MSG_MORE;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3350,15 +3350,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
 	sock_update_classid(sock->sk);
 
 	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+		return sock->ops->sendpage(sock, page, destroy,
+					   offset, size, flags);
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 40ae884..706305b 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -185,7 +185,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -198,7 +198,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -212,7 +212,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 92bc518..f05082b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 09/10] net: add paged frag destructor support to kernel_sendpage.
@ 2012-04-10 14:26     ` Ian Campbell
  0 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman, drbd-user, devel,
	cluster-devel, ocfs2-devel, ceph-devel, rds-devel, linux-nfs

This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: drbd-user@lists.linbit.com
Cc: devel@driverdev.osuosl.org
Cc: cluster-devel@redhat.com
Cc: ocfs2-devel@oss.oracle.com
Cc: netdev@vger.kernel.org
Cc: ceph-devel@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-nfs@vger.kernel.org
---
 drivers/block/drbd/drbd_main.c           |    1 +
 drivers/scsi/iscsi_tcp.c                 |    4 ++--
 drivers/scsi/iscsi_tcp.h                 |    3 ++-
 drivers/target/iscsi/iscsi_target_util.c |    3 ++-
 fs/dlm/lowcomms.c                        |    4 ++--
 fs/ocfs2/cluster/tcp.c                   |    1 +
 include/linux/net.h                      |    6 +++++-
 include/net/inet_common.h                |    4 +++-
 include/net/ip.h                         |    4 +++-
 include/net/sock.h                       |    8 +++++---
 include/net/tcp.h                        |    4 +++-
 net/ceph/messenger.c                     |    2 +-
 net/core/sock.c                          |    6 +++++-
 net/ipv4/af_inet.c                       |    9 ++++++---
 net/ipv4/ip_output.c                     |    6 ++++--
 net/ipv4/tcp.c                           |   24 +++++++++++++++---------
 net/ipv4/udp.c                           |   11 ++++++-----
 net/ipv4/udp_impl.h                      |    5 +++--
 net/rds/tcp_send.c                       |    1 +
 net/socket.c                             |   11 +++++++----
 net/sunrpc/svcsock.c                     |    6 +++---
 net/sunrpc/xprtsock.c                    |    2 +-
 22 files changed, 81 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 211fc44..e70ba0c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	set_fs(KERNEL_DS);
 	do {
 		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							NULL,
 							offset, len,
 							msg_flags);
 		if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 453a740..df9f7dd 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
 		if (!segment->data) {
 			sg = segment->sg;
 			offset += segment->sg_offset + sg->offset;
-			r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
-						  copy, flags);
+			r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+						  offset, copy, flags);
 		} else {
 			struct msghdr msg = { .msg_flags = flags };
 			struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
 	uint32_t		sendpage_failures_cnt;
 	uint32_t		discontiguous_hdr_cnt;
 
-	ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+	ssize_t (*sendpage)(struct socket *, struct page *,
+			    struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 4eba86d..d876dae 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1323,7 +1323,8 @@ send_hdr:
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+					sg_page(sg), NULL,
+					sg->offset + offset, sub_len, 0);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..0673cea 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1336,8 +1336,8 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
-					      msg_flags);
+			ret = kernel_sendpage(con->sock, e->page, NULL,
+					      offset, len, msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
 				    test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 044e7b5..e13851e 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -983,6 +983,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		mutex_lock(&sc->sc_send_lock);
 		ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
 						 virt_to_page(kmalloced_virt),
+						 NULL,
 						 (long)kmalloced_virt & ~PAGE_MASK,
 						 size, MSG_DONTWAIT);
 		mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index be60c7f..d9b0d648 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -203,6 +204,7 @@ struct proto_ops {
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -274,7 +276,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
 			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index b53d65f..6bf9926 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int		ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4,
+				struct page *page,
+				struct skb_frag_destructor *destroy,
 				int offset, size_t size, int flags);
 extern struct sk_buff  *__ip_make_skb(struct sock *sk,
 				      struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index a6ba1f8..c927997 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -834,6 +834,7 @@ struct proto {
 					size_t len, int noblock, int flags, 
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
@@ -1452,9 +1453,10 @@ extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
 extern ssize_t			sock_no_sendpage(struct socket *sock,
-						struct page *page,
-						int offset, size_t size, 
-						int flags);
+					struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size,
+					int flags);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f75a04d..7536266 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -331,7 +331,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
 extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
 			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ad5b708..69f049b 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -851,7 +851,7 @@ static int write_partial_msg_pages(struct ceph_connection *con)
 				cpu_to_le32(crc32c(tmpcrc, base, len));
 			con->out_msg_pos.did_page_crc = 1;
 		}
-		ret = kernel_sendpage(con->sock, page,
+		ret = kernel_sendpage(con->sock, page, NULL,
 				      con->out_msg_pos.page_pos + page_shift,
 				      len,
 				      MSG_DONTWAIT | MSG_NOSIGNAL |
diff --git a/net/core/sock.c b/net/core/sock.c
index 9be6d0d..f56fc8c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1965,7 +1965,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
 }
 EXPORT_SYMBOL(sock_no_mmap);
 
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+			 struct skb_frag_destructor *destroy,
+			 int offset, size_t size, int flags)
 {
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
@@ -1975,6 +1977,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
 	kunmap(page);
+	/* kernel_sendmsg copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index fdf49fd..e55a6e1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -748,7 +748,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -761,8 +763,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		return -EAGAIN;
 
 	if (sk->sk_prot->sendpage)
-		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
-	return sock_no_sendpage(sock, page, offset, size, flags);
+		return sk->sk_prot->sendpage(sk, page, destroy,
+					     offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(inet_sendpage);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9e4eca6..2ce0b8e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1130,6 +1130,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 }
 
 ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+		       struct skb_frag_destructor *destroy,
 		       int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1243,11 +1244,12 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+		if (skb_can_coalesce(skb, i, page, destroy, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		} else {
 			err = -EMSGSIZE;
 			goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b1612e9..89d4db0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -757,8 +757,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
-			 size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor *destroy,
+				int poffset,
+				size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int mss_now, size_goal;
@@ -804,7 +807,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -815,8 +818,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -871,18 +875,20 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	ssize_t res;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
 	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-		return sock_no_sendpage(sk->sk_socket, page, offset, size,
-					flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page, destroy,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d6f5fee..f9038e4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1032,8 +1032,9 @@ do_confirm:
 }
 EXPORT_SYMBOL(udp_sendmsg);
 
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1061,11 +1062,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	}
 
 	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
+			     page, destroy, offset, size, flags);
 	if (ret == -EOPNOTSUPP) {
 		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 	}
 	if (ret < 0) {
 		udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index aaad650..4923d82 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int	compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 extern int	udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    size_t len, int noblock, int flags, int *addr_len);
-extern int	udp_sendpage(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+extern int	udp_sendpage(struct sock *sk, struct page *page,
+			     struct skb_frag_destructor *destroy,
+			     int offset, size_t size, int flags);
 extern int	udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb);
 extern void	udp_destroy_sock(struct sock *sk);
 
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	while (sg < rm->data.op_nents) {
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
+						NULL,
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
 						MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index 12a48d8..d0c0d8d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -815,7 +815,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	if (more)
 		flags |= MSG_MORE;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3350,15 +3350,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
 	sock_update_classid(sock->sk);
 
 	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+		return sock->ops->sendpage(sock, page, destroy,
+					   offset, size, flags);
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 40ae884..706305b 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -185,7 +185,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -198,7 +198,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -212,7 +212,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 92bc518..f05082b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Ocfs2-devel] [PATCH 09/10] net: add paged frag destructor support to kernel_sendpage.
@ 2012-04-10 14:26     ` Ian Campbell
  0 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel-GuqFBffKawuEi8DpZVb4nw, Ian Campbell, Alexey Kuznetsov,
	Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman,
	drbd-user-cunTk1MwBs8qoQakbn7OcQ,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	ocfs2-devel-N0ozoZBvEnrZJqsBc5GL+g,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: drbd-user at lists.linbit.com
Cc: devel at driverdev.osuosl.org
Cc: cluster-devel at redhat.com
Cc: ocfs2-devel at oss.oracle.com
Cc: netdev at vger.kernel.org
Cc: ceph-devel at vger.kernel.org
Cc: rds-devel at oss.oracle.com
Cc: linux-nfs at vger.kernel.org
---
 drivers/block/drbd/drbd_main.c           |    1 +
 drivers/scsi/iscsi_tcp.c                 |    4 ++--
 drivers/scsi/iscsi_tcp.h                 |    3 ++-
 drivers/target/iscsi/iscsi_target_util.c |    3 ++-
 fs/dlm/lowcomms.c                        |    4 ++--
 fs/ocfs2/cluster/tcp.c                   |    1 +
 include/linux/net.h                      |    6 +++++-
 include/net/inet_common.h                |    4 +++-
 include/net/ip.h                         |    4 +++-
 include/net/sock.h                       |    8 +++++---
 include/net/tcp.h                        |    4 +++-
 net/ceph/messenger.c                     |    2 +-
 net/core/sock.c                          |    6 +++++-
 net/ipv4/af_inet.c                       |    9 ++++++---
 net/ipv4/ip_output.c                     |    6 ++++--
 net/ipv4/tcp.c                           |   24 +++++++++++++++---------
 net/ipv4/udp.c                           |   11 ++++++-----
 net/ipv4/udp_impl.h                      |    5 +++--
 net/rds/tcp_send.c                       |    1 +
 net/socket.c                             |   11 +++++++----
 net/sunrpc/svcsock.c                     |    6 +++---
 net/sunrpc/xprtsock.c                    |    2 +-
 22 files changed, 81 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 211fc44..e70ba0c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	set_fs(KERNEL_DS);
 	do {
 		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							NULL,
 							offset, len,
 							msg_flags);
 		if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 453a740..df9f7dd 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
 		if (!segment->data) {
 			sg = segment->sg;
 			offset += segment->sg_offset + sg->offset;
-			r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
-						  copy, flags);
+			r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+						  offset, copy, flags);
 		} else {
 			struct msghdr msg = { .msg_flags = flags };
 			struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
 	uint32_t		sendpage_failures_cnt;
 	uint32_t		discontiguous_hdr_cnt;
 
-	ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+	ssize_t (*sendpage)(struct socket *, struct page *,
+			    struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 4eba86d..d876dae 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1323,7 +1323,8 @@ send_hdr:
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+					sg_page(sg), NULL,
+					sg->offset + offset, sub_len, 0);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..0673cea 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1336,8 +1336,8 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
-					      msg_flags);
+			ret = kernel_sendpage(con->sock, e->page, NULL,
+					      offset, len, msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
 				    test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 044e7b5..e13851e 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -983,6 +983,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		mutex_lock(&sc->sc_send_lock);
 		ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
 						 virt_to_page(kmalloced_virt),
+						 NULL,
 						 (long)kmalloced_virt & ~PAGE_MASK,
 						 size, MSG_DONTWAIT);
 		mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index be60c7f..d9b0d648 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -203,6 +204,7 @@ struct proto_ops {
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -274,7 +276,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
 			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index b53d65f..6bf9926 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int		ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4,
+				struct page *page,
+				struct skb_frag_destructor *destroy,
 				int offset, size_t size, int flags);
 extern struct sk_buff  *__ip_make_skb(struct sock *sk,
 				      struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index a6ba1f8..c927997 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -834,6 +834,7 @@ struct proto {
 					size_t len, int noblock, int flags, 
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
@@ -1452,9 +1453,10 @@ extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
 extern ssize_t			sock_no_sendpage(struct socket *sock,
-						struct page *page,
-						int offset, size_t size, 
-						int flags);
+					struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size,
+					int flags);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f75a04d..7536266 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -331,7 +331,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
 extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
 			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ad5b708..69f049b 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -851,7 +851,7 @@ static int write_partial_msg_pages(struct ceph_connection *con)
 				cpu_to_le32(crc32c(tmpcrc, base, len));
 			con->out_msg_pos.did_page_crc = 1;
 		}
-		ret = kernel_sendpage(con->sock, page,
+		ret = kernel_sendpage(con->sock, page, NULL,
 				      con->out_msg_pos.page_pos + page_shift,
 				      len,
 				      MSG_DONTWAIT | MSG_NOSIGNAL |
diff --git a/net/core/sock.c b/net/core/sock.c
index 9be6d0d..f56fc8c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1965,7 +1965,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
 }
 EXPORT_SYMBOL(sock_no_mmap);
 
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+			 struct skb_frag_destructor *destroy,
+			 int offset, size_t size, int flags)
 {
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
@@ -1975,6 +1977,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
 	kunmap(page);
+	/* kernel_sendmsg copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index fdf49fd..e55a6e1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -748,7 +748,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -761,8 +763,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		return -EAGAIN;
 
 	if (sk->sk_prot->sendpage)
-		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
-	return sock_no_sendpage(sock, page, offset, size, flags);
+		return sk->sk_prot->sendpage(sk, page, destroy,
+					     offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(inet_sendpage);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9e4eca6..2ce0b8e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1130,6 +1130,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 }
 
 ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+		       struct skb_frag_destructor *destroy,
 		       int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1243,11 +1244,12 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+		if (skb_can_coalesce(skb, i, page, destroy, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		} else {
 			err = -EMSGSIZE;
 			goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b1612e9..89d4db0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -757,8 +757,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
-			 size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor *destroy,
+				int poffset,
+				size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int mss_now, size_goal;
@@ -804,7 +807,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -815,8 +818,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -871,18 +875,20 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	ssize_t res;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
 	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-		return sock_no_sendpage(sk->sk_socket, page, offset, size,
-					flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page, destroy,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d6f5fee..f9038e4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1032,8 +1032,9 @@ do_confirm:
 }
 EXPORT_SYMBOL(udp_sendmsg);
 
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1061,11 +1062,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	}
 
 	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
+			     page, destroy, offset, size, flags);
 	if (ret == -EOPNOTSUPP) {
 		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 	}
 	if (ret < 0) {
 		udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index aaad650..4923d82 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int	compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 extern int	udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    size_t len, int noblock, int flags, int *addr_len);
-extern int	udp_sendpage(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+extern int	udp_sendpage(struct sock *sk, struct page *page,
+			     struct skb_frag_destructor *destroy,
+			     int offset, size_t size, int flags);
 extern int	udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb);
 extern void	udp_destroy_sock(struct sock *sk);
 
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	while (sg < rm->data.op_nents) {
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
+						NULL,
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
 						MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index 12a48d8..d0c0d8d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -815,7 +815,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	if (more)
 		flags |= MSG_MORE;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3350,15 +3350,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
 	sock_update_classid(sock->sk);
 
 	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+		return sock->ops->sendpage(sock, page, destroy,
+					   offset, size, flags);
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 40ae884..706305b 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -185,7 +185,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -198,7 +198,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -212,7 +212,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 92bc518..f05082b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 09/10] net: add paged frag destructor support to kernel_sendpage.
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (15 preceding siblings ...)
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
  2012-04-10 14:26 ` [PATCH 10/10] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: devel, rds-devel, Wei Liu, Ian Campbell, Eric Dumazet,
	Michael S. Tsirkin, Hideaki YOSHIFUJI, Greg Kroah-Hartman,
	Trond Myklebust, James Morris, xen-devel, cluster-devel,
	ocfs2-devel, Patrick McHardy, Alexey Kuznetsov, ceph-devel,
	linux-nfs, David Miller, drbd-user, Pekka Savola (ipv6)

This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: drbd-user@lists.linbit.com
Cc: devel@driverdev.osuosl.org
Cc: cluster-devel@redhat.com
Cc: ocfs2-devel@oss.oracle.com
Cc: netdev@vger.kernel.org
Cc: ceph-devel@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-nfs@vger.kernel.org
---
 drivers/block/drbd/drbd_main.c           |    1 +
 drivers/scsi/iscsi_tcp.c                 |    4 ++--
 drivers/scsi/iscsi_tcp.h                 |    3 ++-
 drivers/target/iscsi/iscsi_target_util.c |    3 ++-
 fs/dlm/lowcomms.c                        |    4 ++--
 fs/ocfs2/cluster/tcp.c                   |    1 +
 include/linux/net.h                      |    6 +++++-
 include/net/inet_common.h                |    4 +++-
 include/net/ip.h                         |    4 +++-
 include/net/sock.h                       |    8 +++++---
 include/net/tcp.h                        |    4 +++-
 net/ceph/messenger.c                     |    2 +-
 net/core/sock.c                          |    6 +++++-
 net/ipv4/af_inet.c                       |    9 ++++++---
 net/ipv4/ip_output.c                     |    6 ++++--
 net/ipv4/tcp.c                           |   24 +++++++++++++++---------
 net/ipv4/udp.c                           |   11 ++++++-----
 net/ipv4/udp_impl.h                      |    5 +++--
 net/rds/tcp_send.c                       |    1 +
 net/socket.c                             |   11 +++++++----
 net/sunrpc/svcsock.c                     |    6 +++---
 net/sunrpc/xprtsock.c                    |    2 +-
 22 files changed, 81 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 211fc44..e70ba0c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	set_fs(KERNEL_DS);
 	do {
 		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							NULL,
 							offset, len,
 							msg_flags);
 		if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 453a740..df9f7dd 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
 		if (!segment->data) {
 			sg = segment->sg;
 			offset += segment->sg_offset + sg->offset;
-			r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
-						  copy, flags);
+			r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+						  offset, copy, flags);
 		} else {
 			struct msghdr msg = { .msg_flags = flags };
 			struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
 	uint32_t		sendpage_failures_cnt;
 	uint32_t		discontiguous_hdr_cnt;
 
-	ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+	ssize_t (*sendpage)(struct socket *, struct page *,
+			    struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 4eba86d..d876dae 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1323,7 +1323,8 @@ send_hdr:
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+					sg_page(sg), NULL,
+					sg->offset + offset, sub_len, 0);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..0673cea 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1336,8 +1336,8 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
-					      msg_flags);
+			ret = kernel_sendpage(con->sock, e->page, NULL,
+					      offset, len, msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
 				    test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 044e7b5..e13851e 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -983,6 +983,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		mutex_lock(&sc->sc_send_lock);
 		ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
 						 virt_to_page(kmalloced_virt),
+						 NULL,
 						 (long)kmalloced_virt & ~PAGE_MASK,
 						 size, MSG_DONTWAIT);
 		mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index be60c7f..d9b0d648 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -203,6 +204,7 @@ struct proto_ops {
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -274,7 +276,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
 			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index b53d65f..6bf9926 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int		ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4,
+				struct page *page,
+				struct skb_frag_destructor *destroy,
 				int offset, size_t size, int flags);
 extern struct sk_buff  *__ip_make_skb(struct sock *sk,
 				      struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index a6ba1f8..c927997 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -834,6 +834,7 @@ struct proto {
 					size_t len, int noblock, int flags, 
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
@@ -1452,9 +1453,10 @@ extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
 extern ssize_t			sock_no_sendpage(struct socket *sock,
-						struct page *page,
-						int offset, size_t size, 
-						int flags);
+					struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size,
+					int flags);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f75a04d..7536266 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -331,7 +331,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
 extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
 			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ad5b708..69f049b 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -851,7 +851,7 @@ static int write_partial_msg_pages(struct ceph_connection *con)
 				cpu_to_le32(crc32c(tmpcrc, base, len));
 			con->out_msg_pos.did_page_crc = 1;
 		}
-		ret = kernel_sendpage(con->sock, page,
+		ret = kernel_sendpage(con->sock, page, NULL,
 				      con->out_msg_pos.page_pos + page_shift,
 				      len,
 				      MSG_DONTWAIT | MSG_NOSIGNAL |
diff --git a/net/core/sock.c b/net/core/sock.c
index 9be6d0d..f56fc8c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1965,7 +1965,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
 }
 EXPORT_SYMBOL(sock_no_mmap);
 
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+			 struct skb_frag_destructor *destroy,
+			 int offset, size_t size, int flags)
 {
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
@@ -1975,6 +1977,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
 	kunmap(page);
+	/* kernel_sendmsg copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index fdf49fd..e55a6e1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -748,7 +748,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -761,8 +763,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		return -EAGAIN;
 
 	if (sk->sk_prot->sendpage)
-		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
-	return sock_no_sendpage(sock, page, offset, size, flags);
+		return sk->sk_prot->sendpage(sk, page, destroy,
+					     offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(inet_sendpage);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9e4eca6..2ce0b8e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1130,6 +1130,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 }
 
 ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+		       struct skb_frag_destructor *destroy,
 		       int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1243,11 +1244,12 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+		if (skb_can_coalesce(skb, i, page, destroy, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		} else {
 			err = -EMSGSIZE;
 			goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b1612e9..89d4db0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -757,8 +757,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
-			 size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor *destroy,
+				int poffset,
+				size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int mss_now, size_goal;
@@ -804,7 +807,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -815,8 +818,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -871,18 +875,20 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	ssize_t res;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
 	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-		return sock_no_sendpage(sk->sk_socket, page, offset, size,
-					flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page, destroy,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d6f5fee..f9038e4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1032,8 +1032,9 @@ do_confirm:
 }
 EXPORT_SYMBOL(udp_sendmsg);
 
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1061,11 +1062,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	}
 
 	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
+			     page, destroy, offset, size, flags);
 	if (ret == -EOPNOTSUPP) {
 		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 	}
 	if (ret < 0) {
 		udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index aaad650..4923d82 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int	compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 extern int	udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    size_t len, int noblock, int flags, int *addr_len);
-extern int	udp_sendpage(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+extern int	udp_sendpage(struct sock *sk, struct page *page,
+			     struct skb_frag_destructor *destroy,
+			     int offset, size_t size, int flags);
 extern int	udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb);
 extern void	udp_destroy_sock(struct sock *sk);
 
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	while (sg < rm->data.op_nents) {
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
+						NULL,
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
 						MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index 12a48d8..d0c0d8d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -815,7 +815,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	if (more)
 		flags |= MSG_MORE;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3350,15 +3350,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
 	sock_update_classid(sock->sk);
 
 	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+		return sock->ops->sendpage(sock, page, destroy,
+					   offset, size, flags);
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 40ae884..706305b 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -185,7 +185,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -198,7 +198,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -212,7 +212,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 92bc518..f05082b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 10/10] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
@ 2012-04-10 14:26     ` Ian Campbell
  2012-04-10 14:26 ` Ian Campbell
                       ` (23 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel-GuqFBffKawuEi8DpZVb4nw, Ian Campbell, Neil Brown,
	J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    3 ++-
 net/sunrpc/xprt.c           |   12 ++++++++++++
 net/sunrpc/xprtsock.c       |    3 ++-
 6 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index af70af3..ff1b121 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 77d278d..e8d3f18 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 7a4cb5f..4e94e2a 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -62,6 +62,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1417,6 +1418,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1582,6 +1585,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1615,8 +1619,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1689,7 +1693,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1729,7 +1734,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
@@ -1907,12 +1911,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1927,6 +1933,17 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	dprintk("RPC: %5u call_complete result %d\n",
+		task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 706305b..efa95df 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor,
+					 base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 0cbcd1a..a252759 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1108,6 +1108,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(struct skb_frag_destructor *destroy)
+{
+	struct rpc_rqst	*req =
+		container_of(destroy, struct rpc_rqst, destructor);
+
+	dprintk("RPC: %5u completing skb pages\n", req->rq_task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, req->rq_task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1120,6 +1130,8 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index f05082b..b6ee8b7 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+					  base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 10/10] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
@ 2012-04-10 14:26     ` Ian Campbell
  0 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Ian Campbell, Neil Brown, J. Bruce Fields, linux-nfs

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    3 ++-
 net/sunrpc/xprt.c           |   12 ++++++++++++
 net/sunrpc/xprtsock.c       |    3 ++-
 6 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index af70af3..ff1b121 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 77d278d..e8d3f18 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 7a4cb5f..4e94e2a 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -62,6 +62,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1417,6 +1418,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1582,6 +1585,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1615,8 +1619,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1689,7 +1693,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1729,7 +1734,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
@@ -1907,12 +1911,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1927,6 +1933,17 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	dprintk("RPC: %5u call_complete result %d\n",
+		task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 706305b..efa95df 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor,
+					 base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 0cbcd1a..a252759 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1108,6 +1108,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(struct skb_frag_destructor *destroy)
+{
+	struct rpc_rqst	*req =
+		container_of(destroy, struct rpc_rqst, destructor);
+
+	dprintk("RPC: %5u completing skb pages\n", req->rq_task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, req->rq_task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1120,6 +1130,8 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index f05082b..b6ee8b7 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+					  base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 10/10] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (16 preceding siblings ...)
  2012-04-10 14:26 ` [PATCH 09/10] net: add paged frag destructor support to kernel_sendpage Ian Campbell
@ 2012-04-10 14:26 ` Ian Campbell
       [not found] ` <1334067965.5394.22.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: linux-nfs, Wei Liu, Ian Campbell, Eric Dumazet,
	Michael S. Tsirkin, Neil Brown, xen-devel, J. Bruce Fields,
	David Miller

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    3 ++-
 net/sunrpc/xprt.c           |   12 ++++++++++++
 net/sunrpc/xprtsock.c       |    3 ++-
 6 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index af70af3..ff1b121 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 77d278d..e8d3f18 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 7a4cb5f..4e94e2a 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -62,6 +62,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1417,6 +1418,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1582,6 +1585,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1615,8 +1619,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1689,7 +1693,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1729,7 +1734,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
@@ -1907,12 +1911,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1927,6 +1933,17 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	dprintk("RPC: %5u call_complete result %d\n",
+		task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 706305b..efa95df 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor,
+					 base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 0cbcd1a..a252759 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1108,6 +1108,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(struct skb_frag_destructor *destroy)
+{
+	struct rpc_rqst	*req =
+		container_of(destroy, struct rpc_rqst, destructor);
+
+	dprintk("RPC: %5u completing skb pages\n", req->rq_task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, req->rq_task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1120,6 +1130,8 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index f05082b..b6ee8b7 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+					  base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/10] net: add and use SKB_ALLOCSIZE
  2012-04-10 14:26 ` [PATCH 01/10] net: add and use SKB_ALLOCSIZE Ian Campbell
  2012-04-10 14:57   ` Eric Dumazet
@ 2012-04-10 14:57   ` Eric Dumazet
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 14:57 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller, Michael S. Tsirkin, Wei Liu, xen-devel

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> This gives the allocation size required for an skb containing X bytes of data
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2.c        |    7 +++----
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |    3 +--
>  drivers/net/ethernet/broadcom/tg3.c         |    3 +--
>  include/linux/skbuff.h                      |   12 ++++++++++++
>  net/core/skbuff.c                           |    8 +-------
>  5 files changed, 18 insertions(+), 15 deletions(-)
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/10] net: add and use SKB_ALLOCSIZE
  2012-04-10 14:26 ` [PATCH 01/10] net: add and use SKB_ALLOCSIZE Ian Campbell
@ 2012-04-10 14:57   ` Eric Dumazet
  2012-04-10 14:57   ` Eric Dumazet
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 14:57 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, xen-devel, Wei Liu, David Miller, Michael S. Tsirkin

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> This gives the allocation size required for an skb containing X bytes of data
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2.c        |    7 +++----
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |    3 +--
>  drivers/net/ethernet/broadcom/tg3.c         |    3 +--
>  include/linux/skbuff.h                      |   12 ++++++++++++
>  net/core/skbuff.c                           |    8 +-------
>  5 files changed, 18 insertions(+), 15 deletions(-)
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb
  2012-04-10 14:26 ` [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell
  2012-04-10 14:58   ` Eric Dumazet
@ 2012-04-10 14:58   ` Eric Dumazet
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 14:58 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller, Michael S. Tsirkin, Wei Liu, xen-devel

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  net/core/skbuff.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 59a1ecb..d4e139e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -264,7 +264,7 @@ struct sk_buff *build_skb(void *data)
>  	if (!skb)
>  		return NULL;
>  
> -	size = ksize(data) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	size = SKB_WITH_OVERHEAD(ksize(data));
>  
>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>  	skb->truesize = SKB_TRUESIZE(size);

Well, why not ;)

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb
  2012-04-10 14:26 ` [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell
@ 2012-04-10 14:58   ` Eric Dumazet
  2012-04-10 14:58   ` Eric Dumazet
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 14:58 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, xen-devel, Wei Liu, David Miller, Michael S. Tsirkin

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  net/core/skbuff.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 59a1ecb..d4e139e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -264,7 +264,7 @@ struct sk_buff *build_skb(void *data)
>  	if (!skb)
>  		return NULL;
>  
> -	size = ksize(data) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	size = SKB_WITH_OVERHEAD(ksize(data));
>  
>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>  	skb->truesize = SKB_TRUESIZE(size);

Well, why not ;)

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (19 preceding siblings ...)
  2012-04-10 14:58 ` [PATCH v4 0/10] skb paged fragment destructors Michael S. Tsirkin
@ 2012-04-10 14:58 ` Michael S. Tsirkin
  2012-04-10 15:00 ` Michael S. Tsirkin
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-04-10 14:58 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David Miller, Eric Dumazet, Wei Liu, David VomLehn,
	Bart Van Assche, xen-devel

On Tue, Apr 10, 2012 at 03:26:05PM +0100, Ian Campbell wrote:
>               * I can't for the life of me get anything to actually hit
>                 this code path. I've been trying with an NFS server
>                 running in a Xen HVM domain with emulated (e.g. tap)
>                 networking and a client in domain 0, using the NFS fix
>                 in this series which generates SKBs with destructors
>                 set, so far -- nothing. I suspect that lack of TSO/GSO
>                 etc on the TAP interface is causing the frags to be
>                 copied to normal pages during skb_segment().

To enable gso you need to call TUNSETOFFLOAD.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (18 preceding siblings ...)
       [not found] ` <1334067965.5394.22.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
@ 2012-04-10 14:58 ` Michael S. Tsirkin
  2012-04-10 14:58 ` Michael S. Tsirkin
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-04-10 14:58 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Bart Van Assche, Eric Dumazet, netdev, David VomLehn,
	xen-devel, David Miller

On Tue, Apr 10, 2012 at 03:26:05PM +0100, Ian Campbell wrote:
>               * I can't for the life of me get anything to actually hit
>                 this code path. I've been trying with an NFS server
>                 running in a Xen HVM domain with emulated (e.g. tap)
>                 networking and a client in domain 0, using the NFS fix
>                 in this series which generates SKBs with destructors
>                 set, so far -- nothing. I suspect that lack of TSO/GSO
>                 etc on the TAP interface is causing the frags to be
>                 copied to normal pages during skb_segment().

To enable gso you need to call TUNSETOFFLOAD.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD
  2012-04-10 14:26 ` [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD Ian Campbell
@ 2012-04-10 14:59   ` Eric Dumazet
  2012-04-10 14:59   ` Eric Dumazet
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 14:59 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David Miller, Michael S. Tsirkin, Wei Liu, xen-devel,
	Divy Le Ray

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Divy Le Ray <divy@chelsio.com>
> ---
>  drivers/net/ethernet/chelsio/cxgb/sge.c  |    3 +--
>  drivers/net/ethernet/chelsio/cxgb3/sge.c |    6 +++---
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD
  2012-04-10 14:26 ` [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD Ian Campbell
  2012-04-10 14:59   ` Eric Dumazet
@ 2012-04-10 14:59   ` Eric Dumazet
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 14:59 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Michael S. Tsirkin, netdev, xen-devel, David Miller,
	Divy Le Ray

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Divy Le Ray <divy@chelsio.com>
> ---
>  drivers/net/ethernet/chelsio/cxgb/sge.c  |    3 +--
>  drivers/net/ethernet/chelsio/cxgb3/sge.c |    6 +++---
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (21 preceding siblings ...)
  2012-04-10 15:00 ` Michael S. Tsirkin
@ 2012-04-10 15:00 ` Michael S. Tsirkin
  2012-04-10 15:46 ` Bart Van Assche
  2012-04-10 15:46 ` Bart Van Assche
  24 siblings, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-04-10 15:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David Miller, Eric Dumazet, Wei Liu, David VomLehn,
	Bart Van Assche, xen-devel

On Tue, Apr 10, 2012 at 03:26:05PM +0100, Ian Campbell wrote:
> I think this is v4, but I've sort of lost count, sorry that it's taken
> me so long to get back to this stuff.
> 
> The following series makes use of the skb fragment API (which is in 3.2
> +) to add a per-paged-fragment destructor callback. This can be used by
> creators of skbs who are interested in the lifecycle of the pages
> included in that skb after they have handed it off to the network stack.
> 
> The mail at [0] contains some more background and rationale but
> basically the completed series will allow entities which inject pages
> into the networking stack to receive a notification when the stack has
> really finished with those pages (i.e. including retransmissions,
> clones, pull-ups etc) and not just when the original skb is finished
> with, which is beneficial to many subsystems which wish to inject pages
> into the network stack without giving up full ownership of those page's
> lifecycle. It implements something broadly along the lines of what was
> described in [1].
> 
> I have also included a patch to the RPC subsystem which uses this API to
> fix the bug which I describe at [2].
> 
> I've also had some interest from David VemLehn and Bart Van Assche
> regarding using this functionality in the context of vmsplice and iSCSI
> targets respectively (I think).
> 
> Changes since last time:
> 
>       * Added skb_orphan_frags API for the use of recipients of SKBs who
>         may hold onto the SKB for a long time (this is analogous to
>         skb_orphan). This was pointed out by Michael. The TUN driver is
>         currently the only user.
>               * I can't for the life of me get anything to actually hit
>                 this code path. I've been trying with an NFS server
>                 running in a Xen HVM domain with emulated (e.g. tap)
>                 networking and a client in domain 0, using the NFS fix
>                 in this series which generates SKBs with destructors
>                 set, so far -- nothing. I suspect that lack of TSO/GSO
>                 etc on the TAP interface is causing the frags to be
>                 copied to normal pages during skb_segment().

Will take a look tomorrow, thanks!

>       * Various fixups related to the change of alignment/padding in
>         shinfo, in particular to build_skb as pointed out by Eric.
>       * Tweaked ordering of shinfo members to ensure that all hotpath
>         variables up to and including the first frag fit within (and are
>         aligned to) a single 64 byte cache line. (Eric again)
> 
> I ran a monothread UDP benchmark (similar to that described by Eric in
> e52fcb2462ac) and don't see any difference in pps throughput, it was
> ~810,000 pps both before and after.
> 
> Cheers,
> Ian.
> 
> [0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
> [1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
> [2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (20 preceding siblings ...)
  2012-04-10 14:58 ` Michael S. Tsirkin
@ 2012-04-10 15:00 ` Michael S. Tsirkin
  2012-04-10 15:00 ` Michael S. Tsirkin
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-04-10 15:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Bart Van Assche, Eric Dumazet, netdev, David VomLehn,
	xen-devel, David Miller

On Tue, Apr 10, 2012 at 03:26:05PM +0100, Ian Campbell wrote:
> I think this is v4, but I've sort of lost count, sorry that it's taken
> me so long to get back to this stuff.
> 
> The following series makes use of the skb fragment API (which is in 3.2
> +) to add a per-paged-fragment destructor callback. This can be used by
> creators of skbs who are interested in the lifecycle of the pages
> included in that skb after they have handed it off to the network stack.
> 
> The mail at [0] contains some more background and rationale but
> basically the completed series will allow entities which inject pages
> into the networking stack to receive a notification when the stack has
> really finished with those pages (i.e. including retransmissions,
> clones, pull-ups etc) and not just when the original skb is finished
> with, which is beneficial to many subsystems which wish to inject pages
> into the network stack without giving up full ownership of those page's
> lifecycle. It implements something broadly along the lines of what was
> described in [1].
> 
> I have also included a patch to the RPC subsystem which uses this API to
> fix the bug which I describe at [2].
> 
> I've also had some interest from David VemLehn and Bart Van Assche
> regarding using this functionality in the context of vmsplice and iSCSI
> targets respectively (I think).
> 
> Changes since last time:
> 
>       * Added skb_orphan_frags API for the use of recipients of SKBs who
>         may hold onto the SKB for a long time (this is analogous to
>         skb_orphan). This was pointed out by Michael. The TUN driver is
>         currently the only user.
>               * I can't for the life of me get anything to actually hit
>                 this code path. I've been trying with an NFS server
>                 running in a Xen HVM domain with emulated (e.g. tap)
>                 networking and a client in domain 0, using the NFS fix
>                 in this series which generates SKBs with destructors
>                 set, so far -- nothing. I suspect that lack of TSO/GSO
>                 etc on the TAP interface is causing the frags to be
>                 copied to normal pages during skb_segment().

Will take a look tomorrow, thanks!

>       * Various fixups related to the change of alignment/padding in
>         shinfo, in particular to build_skb as pointed out by Eric.
>       * Tweaked ordering of shinfo members to ensure that all hotpath
>         variables up to and including the first frag fit within (and are
>         aligned to) a single 64 byte cache line. (Eric again)
> 
> I ran a monothread UDP benchmark (similar to that described by Eric in
> e52fcb2462ac) and don't see any difference in pps throughput, it was
> ~810,000 pps both before and after.
> 
> Cheers,
> Ian.
> 
> [0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
> [1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
> [2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually
  2012-04-10 14:26 ` [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
  2012-04-10 15:01   ` Eric Dumazet
@ 2012-04-10 15:01   ` Eric Dumazet
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 15:01 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller, Michael S. Tsirkin, Wei Liu, xen-devel

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> This reduces the minimum overhead required for this allocation such that the
> shinfo can be grown in the following patch without overflowing 2048 bytes for a
> 1500 byte frame.
> 
> Reducing this overhead while also growing the shinfo means that sometimes the
> tail end of the data can end up in the same cache line as the beginning of the
> shinfo. Specifically in the case of the 64 byte cache lines on a 64 bit system
> the first 8 bytes of shinfo can overlap the tail cacheline of the data. In many
> cases the allocation slop means that there is no overlap.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> ---

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually
  2012-04-10 14:26 ` [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
@ 2012-04-10 15:01   ` Eric Dumazet
  2012-04-10 15:01   ` Eric Dumazet
  1 sibling, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 15:01 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, xen-devel, Wei Liu, David Miller, Michael S. Tsirkin

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> This reduces the minimum overhead required for this allocation such that the
> shinfo can be grown in the following patch without overflowing 2048 bytes for a
> 1500 byte frame.
> 
> Reducing this overhead while also growing the shinfo means that sometimes the
> tail end of the data can end up in the same cache line as the beginning of the
> shinfo. Specifically in the case of the 64 byte cache lines on a 64 bit system
> the first 8 bytes of shinfo can overlap the tail cacheline of the data. In many
> cases the allocation slop means that there is no overlap.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> ---

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 15:05   ` Eric Dumazet
  2012-04-10 15:19     ` Ian Campbell
  2012-04-10 15:19     ` Ian Campbell
  2012-04-10 15:05   ` Eric Dumazet
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 15:05 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller, Michael S. Tsirkin, Wei Liu, xen-devel

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> As of the previous patch we align the end (rather than the start) of the struct
> to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
> increase from the next patch, the first 8 bytes of the struct end up on a
> different cache line to the rest of it so make sure it is something relatively
> unimportant to avoid hitting an extra cache line on hot operations such as
> kfree_skb.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/skbuff.h |   15 ++++++++++-----
>  net/core/skbuff.c      |    5 ++++-
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0ad6a46..f0ae39c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -265,6 +265,15 @@ struct ubuf_info {
>   * the end of the header data, ie. at skb->end.
>   */
>  struct skb_shared_info {
> +	/* Intermediate layers must ensure that destructor_arg
> +	 * remains valid until skb destructor */
> +	void		*destructor_arg;
> +
> +	/*
> +	 * Warning: all fields from here until dataref are cleared in
> +	 * __alloc_skb()
> +	 *
> +	 */
>  	unsigned char	nr_frags;
>  	__u8		tx_flags;
>  	unsigned short	gso_size;
> @@ -276,14 +285,10 @@ struct skb_shared_info {
>  	__be32          ip6_frag_id;
>  
>  	/*
> -	 * Warning : all fields before dataref are cleared in __alloc_skb()
> +	 * Warning: all fields before dataref are cleared in __alloc_skb()
>  	 */
>  	atomic_t	dataref;
>  
> -	/* Intermediate layers must ensure that destructor_arg
> -	 * remains valid until skb destructor */
> -	void *		destructor_arg;
> -
>  	/* must be last field, see pskb_expand_head() */
>  	skb_frag_t	frags[MAX_SKB_FRAGS];
>  };
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index d4e139e..b8a41d6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  
>  	/* make sure we initialize shinfo sequentially */
>  	shinfo = skb_shinfo(skb);
> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +
> +	memset(&shinfo->nr_frags, 0,
> +	       offsetof(struct skb_shared_info, dataref)
> +	       - offsetof(struct skb_shared_info, nr_frags));
>  	atomic_set(&shinfo->dataref, 1);
>  	kmemcheck_annotate_variable(shinfo->destructor_arg);
>  

Not sure if we can do the same in build_skb()

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 14:26 ` Ian Campbell
  2012-04-10 15:05   ` Eric Dumazet
@ 2012-04-10 15:05   ` Eric Dumazet
  2012-04-10 18:33   ` Alexander Duyck
  2012-04-10 18:33   ` Alexander Duyck
  3 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 15:05 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, xen-devel, Wei Liu, David Miller, Michael S. Tsirkin

On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> As of the previous patch we align the end (rather than the start) of the struct
> to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
> increase from the next patch, the first 8 bytes of the struct end up on a
> different cache line to the rest of it so make sure it is something relatively
> unimportant to avoid hitting an extra cache line on hot operations such as
> kfree_skb.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/skbuff.h |   15 ++++++++++-----
>  net/core/skbuff.c      |    5 ++++-
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0ad6a46..f0ae39c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -265,6 +265,15 @@ struct ubuf_info {
>   * the end of the header data, ie. at skb->end.
>   */
>  struct skb_shared_info {
> +	/* Intermediate layers must ensure that destructor_arg
> +	 * remains valid until skb destructor */
> +	void		*destructor_arg;
> +
> +	/*
> +	 * Warning: all fields from here until dataref are cleared in
> +	 * __alloc_skb()
> +	 *
> +	 */
>  	unsigned char	nr_frags;
>  	__u8		tx_flags;
>  	unsigned short	gso_size;
> @@ -276,14 +285,10 @@ struct skb_shared_info {
>  	__be32          ip6_frag_id;
>  
>  	/*
> -	 * Warning : all fields before dataref are cleared in __alloc_skb()
> +	 * Warning: all fields before dataref are cleared in __alloc_skb()
>  	 */
>  	atomic_t	dataref;
>  
> -	/* Intermediate layers must ensure that destructor_arg
> -	 * remains valid until skb destructor */
> -	void *		destructor_arg;
> -
>  	/* must be last field, see pskb_expand_head() */
>  	skb_frag_t	frags[MAX_SKB_FRAGS];
>  };
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index d4e139e..b8a41d6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  
>  	/* make sure we initialize shinfo sequentially */
>  	shinfo = skb_shinfo(skb);
> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +
> +	memset(&shinfo->nr_frags, 0,
> +	       offsetof(struct skb_shared_info, dataref)
> +	       - offsetof(struct skb_shared_info, nr_frags));
>  	atomic_set(&shinfo->dataref, 1);
>  	kmemcheck_annotate_variable(shinfo->destructor_arg);
>  

Not sure if we can do the same in build_skb()

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 15:05   ` Eric Dumazet
  2012-04-10 15:19     ` Ian Campbell
@ 2012-04-10 15:19     ` Ian Campbell
  1 sibling, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 15:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David Miller, Michael S. Tsirkin, Wei Liu (Intern), xen-devel

On Tue, 2012-04-10 at 16:05 +0100, Eric Dumazet wrote:
> On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> > As of the previous patch we align the end (rather than the start) of the struct
> > to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
> > increase from the next patch, the first 8 bytes of the struct end up on a
> > different cache line to the rest of it so make sure it is something relatively
> > unimportant to avoid hitting an extra cache line on hot operations such as
> > kfree_skb.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > ---
> >  include/linux/skbuff.h |   15 ++++++++++-----
> >  net/core/skbuff.c      |    5 ++++-
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 0ad6a46..f0ae39c 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -265,6 +265,15 @@ struct ubuf_info {
> >   * the end of the header data, ie. at skb->end.
> >   */
> >  struct skb_shared_info {
> > +	/* Intermediate layers must ensure that destructor_arg
> > +	 * remains valid until skb destructor */
> > +	void		*destructor_arg;
> > +
> > +	/*
> > +	 * Warning: all fields from here until dataref are cleared in
> > +	 * __alloc_skb()
> > +	 *
> > +	 */
> >  	unsigned char	nr_frags;
> >  	__u8		tx_flags;
> >  	unsigned short	gso_size;
> > @@ -276,14 +285,10 @@ struct skb_shared_info {
> >  	__be32          ip6_frag_id;
> >  
> >  	/*
> > -	 * Warning : all fields before dataref are cleared in __alloc_skb()
> > +	 * Warning: all fields before dataref are cleared in __alloc_skb()
> >  	 */
> >  	atomic_t	dataref;
> >  
> > -	/* Intermediate layers must ensure that destructor_arg
> > -	 * remains valid until skb destructor */
> > -	void *		destructor_arg;
> > -
> >  	/* must be last field, see pskb_expand_head() */
> >  	skb_frag_t	frags[MAX_SKB_FRAGS];
> >  };
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index d4e139e..b8a41d6 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  
> >  	/* make sure we initialize shinfo sequentially */
> >  	shinfo = skb_shinfo(skb);
> > -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > +
> > +	memset(&shinfo->nr_frags, 0,
> > +	       offsetof(struct skb_shared_info, dataref)
> > +	       - offsetof(struct skb_shared_info, nr_frags));
> >  	atomic_set(&shinfo->dataref, 1);
> >  	kmemcheck_annotate_variable(shinfo->destructor_arg);
> >  
> 
> Not sure if we can do the same in build_skb()

I don't think there's any chance of there being a destructor_arg to
preserve in that case?

Regardless of that though I think for consistency it would be worth
pulling the common shinfo init out into a helper and using it in both
places.

I'll make that change.

Ian.

> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 15:05   ` Eric Dumazet
@ 2012-04-10 15:19     ` Ian Campbell
  2012-04-10 15:19     ` Ian Campbell
  1 sibling, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 15:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, xen-devel, Wei Liu (Intern), David Miller, Michael S. Tsirkin

On Tue, 2012-04-10 at 16:05 +0100, Eric Dumazet wrote:
> On Tue, 2012-04-10 at 15:26 +0100, Ian Campbell wrote:
> > As of the previous patch we align the end (rather than the start) of the struct
> > to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
> > increase from the next patch, the first 8 bytes of the struct end up on a
> > different cache line to the rest of it so make sure it is something relatively
> > unimportant to avoid hitting an extra cache line on hot operations such as
> > kfree_skb.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > ---
> >  include/linux/skbuff.h |   15 ++++++++++-----
> >  net/core/skbuff.c      |    5 ++++-
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 0ad6a46..f0ae39c 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -265,6 +265,15 @@ struct ubuf_info {
> >   * the end of the header data, ie. at skb->end.
> >   */
> >  struct skb_shared_info {
> > +	/* Intermediate layers must ensure that destructor_arg
> > +	 * remains valid until skb destructor */
> > +	void		*destructor_arg;
> > +
> > +	/*
> > +	 * Warning: all fields from here until dataref are cleared in
> > +	 * __alloc_skb()
> > +	 *
> > +	 */
> >  	unsigned char	nr_frags;
> >  	__u8		tx_flags;
> >  	unsigned short	gso_size;
> > @@ -276,14 +285,10 @@ struct skb_shared_info {
> >  	__be32          ip6_frag_id;
> >  
> >  	/*
> > -	 * Warning : all fields before dataref are cleared in __alloc_skb()
> > +	 * Warning: all fields before dataref are cleared in __alloc_skb()
> >  	 */
> >  	atomic_t	dataref;
> >  
> > -	/* Intermediate layers must ensure that destructor_arg
> > -	 * remains valid until skb destructor */
> > -	void *		destructor_arg;
> > -
> >  	/* must be last field, see pskb_expand_head() */
> >  	skb_frag_t	frags[MAX_SKB_FRAGS];
> >  };
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index d4e139e..b8a41d6 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  
> >  	/* make sure we initialize shinfo sequentially */
> >  	shinfo = skb_shinfo(skb);
> > -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > +
> > +	memset(&shinfo->nr_frags, 0,
> > +	       offsetof(struct skb_shared_info, dataref)
> > +	       - offsetof(struct skb_shared_info, nr_frags));
> >  	atomic_set(&shinfo->dataref, 1);
> >  	kmemcheck_annotate_variable(shinfo->destructor_arg);
> >  
> 
> Not sure if we can do the same in build_skb()

I don't think there's any chance of there being a destructor_arg to
preserve in that case?

Regardless of that though I think for consistency it would be worth
pulling the common shinfo init out into a helper and using it in both
places.

I'll make that change.

Ian.

> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (23 preceding siblings ...)
  2012-04-10 15:46 ` Bart Van Assche
@ 2012-04-10 15:46 ` Bart Van Assche
  2012-04-10 15:50   ` Ian Campbell
  2012-04-10 15:50   ` Ian Campbell
  24 siblings, 2 replies; 71+ messages in thread
From: Bart Van Assche @ 2012-04-10 15:46 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	David VomLehn, xen-devel

On 04/10/12 14:26, Ian Campbell wrote:

> I think this is v4, but I've sort of lost count, sorry that it's taken
> me so long to get back to this stuff.
> 
> The following series makes use of the skb fragment API (which is in 3.2
> +) to add a per-paged-fragment destructor callback. This can be used by
> creators of skbs who are interested in the lifecycle of the pages
> included in that skb after they have handed it off to the network stack.


Hello Ian,

Great to see v4 of this patch series. But which kernel version has this
patch series been based on ? I've tried to apply this series on 3.4-rc2 but
apparently applying patch 09/10 failed:

patching file net/ceph/messenger.c
Hunk #1 FAILED at 851.
1 out of 1 hunk FAILED -- saving rejects to file net/ceph/messenger.c.rej

Regards,

Bart.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
                   ` (22 preceding siblings ...)
  2012-04-10 15:00 ` Michael S. Tsirkin
@ 2012-04-10 15:46 ` Bart Van Assche
  2012-04-10 15:46 ` Bart Van Assche
  24 siblings, 0 replies; 71+ messages in thread
From: Bart Van Assche @ 2012-04-10 15:46 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Eric Dumazet, Michael S. Tsirkin, netdev, David VomLehn,
	xen-devel, David Miller

On 04/10/12 14:26, Ian Campbell wrote:

> I think this is v4, but I've sort of lost count, sorry that it's taken
> me so long to get back to this stuff.
> 
> The following series makes use of the skb fragment API (which is in 3.2
> +) to add a per-paged-fragment destructor callback. This can be used by
> creators of skbs who are interested in the lifecycle of the pages
> included in that skb after they have handed it off to the network stack.


Hello Ian,

Great to see v4 of this patch series. But which kernel version has this
patch series been based on ? I've tried to apply this series on 3.4-rc2 but
apparently applying patch 09/10 failed:

patching file net/ceph/messenger.c
Hunk #1 FAILED at 851.
1 out of 1 hunk FAILED -- saving rejects to file net/ceph/messenger.c.rej

Regards,

Bart.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 15:46 ` Bart Van Assche
  2012-04-10 15:50   ` Ian Campbell
@ 2012-04-10 15:50   ` Ian Campbell
  2012-04-11 10:02     ` Bart Van Assche
  2012-04-11 10:02     ` Bart Van Assche
  1 sibling, 2 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 15:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: netdev, David Miller, Eric Dumazet, Michael S. Tsirkin,
	Wei Liu (Intern),
	David VomLehn, xen-devel

On Tue, 2012-04-10 at 16:46 +0100, Bart Van Assche wrote:
> On 04/10/12 14:26, Ian Campbell wrote:
> 
> > I think this is v4, but I've sort of lost count, sorry that it's taken
> > me so long to get back to this stuff.
> > 
> > The following series makes use of the skb fragment API (which is in 3.2
> > +) to add a per-paged-fragment destructor callback. This can be used by
> > creators of skbs who are interested in the lifecycle of the pages
> > included in that skb after they have handed it off to the network stack.
> 
> 
> Hello Ian,
> 
> Great to see v4 of this patch series. But which kernel version has this
> patch series been based on ? I've tried to apply this series on 3.4-rc2

It's based on net-next/master. Specifically commit de8856d2c11f.

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 15:46 ` Bart Van Assche
@ 2012-04-10 15:50   ` Ian Campbell
  2012-04-10 15:50   ` Ian Campbell
  1 sibling, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 15:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Wei Liu (Intern),
	Eric Dumazet, Michael S. Tsirkin, netdev, David VomLehn,
	xen-devel, David Miller

On Tue, 2012-04-10 at 16:46 +0100, Bart Van Assche wrote:
> On 04/10/12 14:26, Ian Campbell wrote:
> 
> > I think this is v4, but I've sort of lost count, sorry that it's taken
> > me so long to get back to this stuff.
> > 
> > The following series makes use of the skb fragment API (which is in 3.2
> > +) to add a per-paged-fragment destructor callback. This can be used by
> > creators of skbs who are interested in the lifecycle of the pages
> > included in that skb after they have handed it off to the network stack.
> 
> 
> Hello Ian,
> 
> Great to see v4 of this patch series. But which kernel version has this
> patch series been based on ? I've tried to apply this series on 3.4-rc2

It's based on net-next/master. Specifically commit de8856d2c11f.

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 14:26 ` Ian Campbell
  2012-04-10 15:05   ` Eric Dumazet
  2012-04-10 15:05   ` Eric Dumazet
@ 2012-04-10 18:33   ` Alexander Duyck
  2012-04-10 18:41     ` Eric Dumazet
                       ` (3 more replies)
  2012-04-10 18:33   ` Alexander Duyck
  3 siblings, 4 replies; 71+ messages in thread
From: Alexander Duyck @ 2012-04-10 18:33 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel

On 04/10/2012 07:26 AM, Ian Campbell wrote:
> As of the previous patch we align the end (rather than the start) of the struct
> to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
> increase from the next patch, the first 8 bytes of the struct end up on a
> different cache line to the rest of it so make sure it is something relatively
> unimportant to avoid hitting an extra cache line on hot operations such as
> kfree_skb.
>
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/skbuff.h |   15 ++++++++++-----
>  net/core/skbuff.c      |    5 ++++-
>  2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0ad6a46..f0ae39c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -265,6 +265,15 @@ struct ubuf_info {
>   * the end of the header data, ie. at skb->end.
>   */
>  struct skb_shared_info {
> +	/* Intermediate layers must ensure that destructor_arg
> +	 * remains valid until skb destructor */
> +	void		*destructor_arg;
> +
> +	/*
> +	 * Warning: all fields from here until dataref are cleared in
> +	 * __alloc_skb()
> +	 *
> +	 */
>  	unsigned char	nr_frags;
>  	__u8		tx_flags;
>  	unsigned short	gso_size;
> @@ -276,14 +285,10 @@ struct skb_shared_info {
>  	__be32          ip6_frag_id;
>  
>  	/*
> -	 * Warning : all fields before dataref are cleared in __alloc_skb()
> +	 * Warning: all fields before dataref are cleared in __alloc_skb()
>  	 */
>  	atomic_t	dataref;
>  
> -	/* Intermediate layers must ensure that destructor_arg
> -	 * remains valid until skb destructor */
> -	void *		destructor_arg;
> -
>  	/* must be last field, see pskb_expand_head() */
>  	skb_frag_t	frags[MAX_SKB_FRAGS];
>  };
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index d4e139e..b8a41d6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  
>  	/* make sure we initialize shinfo sequentially */
>  	shinfo = skb_shinfo(skb);
> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +
> +	memset(&shinfo->nr_frags, 0,
> +	       offsetof(struct skb_shared_info, dataref)
> +	       - offsetof(struct skb_shared_info, nr_frags));
>  	atomic_set(&shinfo->dataref, 1);
>  	kmemcheck_annotate_variable(shinfo->destructor_arg);
>  

Have you checked this for 32 bit as well as 64?  Based on my math your
next patch will still mess up the memset on 32 bit with the structure
being split somewhere just in front of hwtstamps.

Why not just take frags and move it to the start of the structure?  It
is already an unknown value because it can be either 16 or 17 depending
on the value of PAGE_SIZE, and since you are making changes to frags the
changes wouldn't impact the alignment of the other values later on since
you are aligning the end of the structure.  That way you would be
guaranteed that all of the fields that will be memset would be in the
last 64 bytes.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 14:26 ` Ian Campbell
                     ` (2 preceding siblings ...)
  2012-04-10 18:33   ` Alexander Duyck
@ 2012-04-10 18:33   ` Alexander Duyck
  3 siblings, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2012-04-10 18:33 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Eric Dumazet, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On 04/10/2012 07:26 AM, Ian Campbell wrote:
> As of the previous patch we align the end (rather than the start) of the struct
> to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
> increase from the next patch, the first 8 bytes of the struct end up on a
> different cache line to the rest of it so make sure it is something relatively
> unimportant to avoid hitting an extra cache line on hot operations such as
> kfree_skb.
>
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/skbuff.h |   15 ++++++++++-----
>  net/core/skbuff.c      |    5 ++++-
>  2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0ad6a46..f0ae39c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -265,6 +265,15 @@ struct ubuf_info {
>   * the end of the header data, ie. at skb->end.
>   */
>  struct skb_shared_info {
> +	/* Intermediate layers must ensure that destructor_arg
> +	 * remains valid until skb destructor */
> +	void		*destructor_arg;
> +
> +	/*
> +	 * Warning: all fields from here until dataref are cleared in
> +	 * __alloc_skb()
> +	 *
> +	 */
>  	unsigned char	nr_frags;
>  	__u8		tx_flags;
>  	unsigned short	gso_size;
> @@ -276,14 +285,10 @@ struct skb_shared_info {
>  	__be32          ip6_frag_id;
>  
>  	/*
> -	 * Warning : all fields before dataref are cleared in __alloc_skb()
> +	 * Warning: all fields before dataref are cleared in __alloc_skb()
>  	 */
>  	atomic_t	dataref;
>  
> -	/* Intermediate layers must ensure that destructor_arg
> -	 * remains valid until skb destructor */
> -	void *		destructor_arg;
> -
>  	/* must be last field, see pskb_expand_head() */
>  	skb_frag_t	frags[MAX_SKB_FRAGS];
>  };
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index d4e139e..b8a41d6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  
>  	/* make sure we initialize shinfo sequentially */
>  	shinfo = skb_shinfo(skb);
> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +
> +	memset(&shinfo->nr_frags, 0,
> +	       offsetof(struct skb_shared_info, dataref)
> +	       - offsetof(struct skb_shared_info, nr_frags));
>  	atomic_set(&shinfo->dataref, 1);
>  	kmemcheck_annotate_variable(shinfo->destructor_arg);
>  

Have you checked this for 32 bit as well as 64?  Based on my math your
next patch will still mess up the memset on 32 bit with the structure
being split somewhere just in front of hwtstamps.

Why not just take frags and move it to the start of the structure?  It
is already an unknown value because it can be either 16 or 17 depending
on the value of PAGE_SIZE, and since you are making changes to frags the
changes wouldn't impact the alignment of the other values later on since
you are aligning the end of the structure.  That way you would be
guaranteed that all of the fields that will be memset would be in the
last 64 bytes.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 18:33   ` Alexander Duyck
  2012-04-10 18:41     ` Eric Dumazet
@ 2012-04-10 18:41     ` Eric Dumazet
  2012-04-10 19:15       ` Alexander Duyck
  2012-04-10 19:15       ` Alexander Duyck
  2012-04-11  7:56     ` Ian Campbell
  2012-04-11  7:56     ` Ian Campbell
  3 siblings, 2 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 18:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Ian Campbell, netdev, David Miller, Michael S. Tsirkin, Wei Liu,
	xen-devel

On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:

> Have you checked this for 32 bit as well as 64?  Based on my math your
> next patch will still mess up the memset on 32 bit with the structure
> being split somewhere just in front of hwtstamps.
> 
> Why not just take frags and move it to the start of the structure?  It
> is already an unknown value because it can be either 16 or 17 depending
> on the value of PAGE_SIZE, and since you are making changes to frags the
> changes wouldn't impact the alignment of the other values later on since
> you are aligning the end of the structure.  That way you would be
> guaranteed that all of the fields that will be memset would be in the
> last 64 bytes.
> 

Now when a fragmented packet is copied in pskb_expand_head(), you access
two separate zones of memory to copy the shinfo. But its supposed to be
slow path.

Problem with this is that the offsets of often used fields will be big
(instead of being < 127) and code will be bigger on x86.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 18:33   ` Alexander Duyck
@ 2012-04-10 18:41     ` Eric Dumazet
  2012-04-10 18:41     ` Eric Dumazet
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-10 18:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Wei Liu, Ian Campbell, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:

> Have you checked this for 32 bit as well as 64?  Based on my math your
> next patch will still mess up the memset on 32 bit with the structure
> being split somewhere just in front of hwtstamps.
> 
> Why not just take frags and move it to the start of the structure?  It
> is already an unknown value because it can be either 16 or 17 depending
> on the value of PAGE_SIZE, and since you are making changes to frags the
> changes wouldn't impact the alignment of the other values later on since
> you are aligning the end of the structure.  That way you would be
> guaranteed that all of the fields that will be memset would be in the
> last 64 bytes.
> 

Now when a fragmented packet is copied in pskb_expand_head(), you access
two separate zones of memory to copy the shinfo. But its supposed to be
slow path.

Problem with this is that the offsets of often used fields will be big
(instead of being < 127) and code will be bigger on x86.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 18:41     ` Eric Dumazet
@ 2012-04-10 19:15       ` Alexander Duyck
  2012-04-11  8:00         ` Ian Campbell
                           ` (3 more replies)
  2012-04-10 19:15       ` Alexander Duyck
  1 sibling, 4 replies; 71+ messages in thread
From: Alexander Duyck @ 2012-04-10 19:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ian Campbell, netdev, David Miller, Michael S. Tsirkin, Wei Liu,
	xen-devel

On 04/10/2012 11:41 AM, Eric Dumazet wrote:
> On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:
>
>> Have you checked this for 32 bit as well as 64?  Based on my math your
>> next patch will still mess up the memset on 32 bit with the structure
>> being split somewhere just in front of hwtstamps.
>>
>> Why not just take frags and move it to the start of the structure?  It
>> is already an unknown value because it can be either 16 or 17 depending
>> on the value of PAGE_SIZE, and since you are making changes to frags the
>> changes wouldn't impact the alignment of the other values later on since
>> you are aligning the end of the structure.  That way you would be
>> guaranteed that all of the fields that will be memset would be in the
>> last 64 bytes.
>>
> Now when a fragmented packet is copied in pskb_expand_head(), you access
> two separate zones of memory to copy the shinfo. But its supposed to be
> slow path.
>
> Problem with this is that the offsets of often used fields will be big
> (instead of being < 127) and code will be bigger on x86.

Actually now that I think about it my concerns go much further than the
memset.  I'm convinced that this is going to cause a pretty significant
performance regression on multiple drivers, especially on non x86_64
architecture.  What we have right now on most platforms is a
skb_shared_info structure in which everything up to and including frag 0
is all in one cache line.  This gives us pretty good performance for igb
and ixgbe since that is our common case when jumbo frames are not
enabled is to split the head and place the data in a page.

However the change being recommend here only resolves the issue for one
specific architecture, and that is what I don't agree with.  What we
need is a solution that also works for 64K pages or 32 bit pointers and
I am fairly certain this current solution does not.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 18:41     ` Eric Dumazet
  2012-04-10 19:15       ` Alexander Duyck
@ 2012-04-10 19:15       ` Alexander Duyck
  1 sibling, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2012-04-10 19:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Wei Liu, Ian Campbell, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On 04/10/2012 11:41 AM, Eric Dumazet wrote:
> On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:
>
>> Have you checked this for 32 bit as well as 64?  Based on my math your
>> next patch will still mess up the memset on 32 bit with the structure
>> being split somewhere just in front of hwtstamps.
>>
>> Why not just take frags and move it to the start of the structure?  It
>> is already an unknown value because it can be either 16 or 17 depending
>> on the value of PAGE_SIZE, and since you are making changes to frags the
>> changes wouldn't impact the alignment of the other values later on since
>> you are aligning the end of the structure.  That way you would be
>> guaranteed that all of the fields that will be memset would be in the
>> last 64 bytes.
>>
> Now when a fragmented packet is copied in pskb_expand_head(), you access
> two separate zones of memory to copy the shinfo. But its supposed to be
> slow path.
>
> Problem with this is that the offsets of often used fields will be big
> (instead of being < 127) and code will be bigger on x86.

Actually now that I think about it my concerns go much further than the
memset.  I'm convinced that this is going to cause a pretty significant
performance regression on multiple drivers, especially on non x86_64
architecture.  What we have right now on most platforms is a
skb_shared_info structure in which everything up to and including frag 0
is all in one cache line.  This gives us pretty good performance for igb
and ixgbe since that is our common case when jumbo frames are not
enabled is to split the head and place the data in a page.

However the change being recommend here only resolves the issue for one
specific architecture, and that is what I don't agree with.  What we
need is a solution that also works for 64K pages or 32 bit pointers and
I am fairly certain this current solution does not.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced.
  2012-04-10 14:26 ` Ian Campbell
  2012-04-10 20:11   ` Ben Hutchings
@ 2012-04-10 20:11   ` Ben Hutchings
  2012-04-11  7:45     ` Ian Campbell
  2012-04-11  7:45     ` Ian Campbell
  1 sibling, 2 replies; 71+ messages in thread
From: Ben Hutchings @ 2012-04-10 20:11 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David Miller, Eric Dumazet, Michael S. Tsirkin, Wei Liu,
	xen-devel, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Michał Mirosław

Shouldn't this be folded into the previous change 'net: add support for
per-paged-fragment destructors'?  Maybe it doesn't matter since nothing
is setting a non-NULL fragment destructor yet.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced.
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-10 20:11   ` Ben Hutchings
  2012-04-10 20:11   ` Ben Hutchings
  1 sibling, 0 replies; 71+ messages in thread
From: Ben Hutchings @ 2012-04-10 20:11 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Pekka Savola (ipv6),
	Eric Dumazet, Michael S. Tsirkin, Hideaki YOSHIFUJI, netdev,
	James Morris, xen-devel, Patrick McHardy, Alexey Kuznetsov,
	Michał Mirosław, David Miller

Shouldn't this be folded into the previous change 'net: add support for
per-paged-fragment destructors'?  Maybe it doesn't matter since nothing
is setting a non-NULL fragment destructor yet.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced.
  2012-04-10 20:11   ` Ben Hutchings
  2012-04-11  7:45     ` Ian Campbell
@ 2012-04-11  7:45     ` Ian Campbell
  1 sibling, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-11  7:45 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, David Miller, Eric Dumazet, Michael S. Tsirkin,
	Wei Liu (Intern),
	xen-devel, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Michał Mirosław

On Tue, 2012-04-10 at 21:11 +0100, Ben Hutchings wrote:
> Shouldn't this be folded into the previous change 'net: add support for
> per-paged-fragment destructors'?  Maybe it doesn't matter since nothing
> is setting a non-NULL fragment destructor yet.

I keep following exactly the same thought pattern and then ending up
leaving it due to indecision. I'll squash it next time unless anyone
thinks it is worth keeping split out.

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced.
  2012-04-10 20:11   ` Ben Hutchings
@ 2012-04-11  7:45     ` Ian Campbell
  2012-04-11  7:45     ` Ian Campbell
  1 sibling, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-11  7:45 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Wei Liu (Intern), Pekka Savola (ipv6),
	Eric Dumazet, Michael S. Tsirkin, Hideaki YOSHIFUJI, netdev,
	James Morris, xen-devel, Patrick McHardy, Alexey Kuznetsov,
	Michał Mirosław, David Miller

On Tue, 2012-04-10 at 21:11 +0100, Ben Hutchings wrote:
> Shouldn't this be folded into the previous change 'net: add support for
> per-paged-fragment destructors'?  Maybe it doesn't matter since nothing
> is setting a non-NULL fragment destructor yet.

I keep following exactly the same thought pattern and then ending up
leaving it due to indecision. I'll squash it next time unless anyone
thinks it is worth keeping split out.

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 18:33   ` Alexander Duyck
  2012-04-10 18:41     ` Eric Dumazet
  2012-04-10 18:41     ` Eric Dumazet
@ 2012-04-11  7:56     ` Ian Campbell
  2012-04-11  7:56     ` Ian Campbell
  3 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-11  7:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, David Miller, Eric Dumazet, Michael S. Tsirkin,
	Wei Liu (Intern),
	xen-devel

On Tue, 2012-04-10 at 19:33 +0100, Alexander Duyck wrote:
> On 04/10/2012 07:26 AM, Ian Campbell wrote:
> > As of the previous patch we align the end (rather than the start) of the struct
> > to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
> > increase from the next patch, the first 8 bytes of the struct end up on a
> > different cache line to the rest of it so make sure it is something relatively
> > unimportant to avoid hitting an extra cache line on hot operations such as
> > kfree_skb.
> >
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > ---
> >  include/linux/skbuff.h |   15 ++++++++++-----
> >  net/core/skbuff.c      |    5 ++++-
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 0ad6a46..f0ae39c 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -265,6 +265,15 @@ struct ubuf_info {
> >   * the end of the header data, ie. at skb->end.
> >   */
> >  struct skb_shared_info {
> > +	/* Intermediate layers must ensure that destructor_arg
> > +	 * remains valid until skb destructor */
> > +	void		*destructor_arg;
> > +
> > +	/*
> > +	 * Warning: all fields from here until dataref are cleared in
> > +	 * __alloc_skb()
> > +	 *
> > +	 */
> >  	unsigned char	nr_frags;
> >  	__u8		tx_flags;
> >  	unsigned short	gso_size;
> > @@ -276,14 +285,10 @@ struct skb_shared_info {
> >  	__be32          ip6_frag_id;
> >  
> >  	/*
> > -	 * Warning : all fields before dataref are cleared in __alloc_skb()
> > +	 * Warning: all fields before dataref are cleared in __alloc_skb()
> >  	 */
> >  	atomic_t	dataref;
> >  
> > -	/* Intermediate layers must ensure that destructor_arg
> > -	 * remains valid until skb destructor */
> > -	void *		destructor_arg;
> > -
> >  	/* must be last field, see pskb_expand_head() */
> >  	skb_frag_t	frags[MAX_SKB_FRAGS];
> >  };
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index d4e139e..b8a41d6 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  
> >  	/* make sure we initialize shinfo sequentially */
> >  	shinfo = skb_shinfo(skb);
> > -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > +
> > +	memset(&shinfo->nr_frags, 0,
> > +	       offsetof(struct skb_shared_info, dataref)
> > +	       - offsetof(struct skb_shared_info, nr_frags));
> >  	atomic_set(&shinfo->dataref, 1);
> >  	kmemcheck_annotate_variable(shinfo->destructor_arg);
> >  
> 
> Have you checked this for 32 bit as well as 64?  Based on my math your
> next patch will still mess up the memset on 32 bit with the structure
> being split somewhere just in front of hwtstamps.

You mean 32 byte cache lines? If so then yes there is a split half way
through the structure in that case but there's no way all this data
could ever fit in a single 32 byte cache line. Not including the frags
or destructor_arg the region nr_frags up to and including dataref is 36
bytes on 32 bit and 40 bytes on 64 bit. I've not changed anything in
this respect.

If you meant 64 byte cache lines with 32 bit structure sizes then by my
calculations everything from destructor_arg (in fact a bit earlier, from
12 bytes before then) up to and including frag[0] is in the same 64 byte
cache line.

I find the easiest way to check is to use gdb and open code an offsetof
macro.

(gdb) print/d sizeof(struct skb_shared_info) - (unsigned long)&(((struct skb_shared_info *)0)->nr_frags)
$3 = 240
(gdb) print/d sizeof(struct skb_shared_info) - (unsigned long)&(((struct skb_shared_info *)0)->frags[1])
$4 = 192

So given 64 byte cache lines the interesting area starts at 240/64=3.75
cache lines from the (aligned) end and it finishes just before 192/64=3
cache lines from the end, so nr_frags through to frags[0] are therefore
on the same cache line.

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 18:33   ` Alexander Duyck
                       ` (2 preceding siblings ...)
  2012-04-11  7:56     ` Ian Campbell
@ 2012-04-11  7:56     ` Ian Campbell
  3 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-11  7:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Wei Liu (Intern),
	Eric Dumazet, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On Tue, 2012-04-10 at 19:33 +0100, Alexander Duyck wrote:
> On 04/10/2012 07:26 AM, Ian Campbell wrote:
> > As of the previous patch we align the end (rather than the start) of the struct
> > to a cache line and so, with 32 and 64 byte cache lines and the shinfo size
> > increase from the next patch, the first 8 bytes of the struct end up on a
> > different cache line to the rest of it so make sure it is something relatively
> > unimportant to avoid hitting an extra cache line on hot operations such as
> > kfree_skb.
> >
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > ---
> >  include/linux/skbuff.h |   15 ++++++++++-----
> >  net/core/skbuff.c      |    5 ++++-
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 0ad6a46..f0ae39c 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -265,6 +265,15 @@ struct ubuf_info {
> >   * the end of the header data, ie. at skb->end.
> >   */
> >  struct skb_shared_info {
> > +	/* Intermediate layers must ensure that destructor_arg
> > +	 * remains valid until skb destructor */
> > +	void		*destructor_arg;
> > +
> > +	/*
> > +	 * Warning: all fields from here until dataref are cleared in
> > +	 * __alloc_skb()
> > +	 *
> > +	 */
> >  	unsigned char	nr_frags;
> >  	__u8		tx_flags;
> >  	unsigned short	gso_size;
> > @@ -276,14 +285,10 @@ struct skb_shared_info {
> >  	__be32          ip6_frag_id;
> >  
> >  	/*
> > -	 * Warning : all fields before dataref are cleared in __alloc_skb()
> > +	 * Warning: all fields before dataref are cleared in __alloc_skb()
> >  	 */
> >  	atomic_t	dataref;
> >  
> > -	/* Intermediate layers must ensure that destructor_arg
> > -	 * remains valid until skb destructor */
> > -	void *		destructor_arg;
> > -
> >  	/* must be last field, see pskb_expand_head() */
> >  	skb_frag_t	frags[MAX_SKB_FRAGS];
> >  };
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index d4e139e..b8a41d6 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -214,7 +214,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  
> >  	/* make sure we initialize shinfo sequentially */
> >  	shinfo = skb_shinfo(skb);
> > -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > +
> > +	memset(&shinfo->nr_frags, 0,
> > +	       offsetof(struct skb_shared_info, dataref)
> > +	       - offsetof(struct skb_shared_info, nr_frags));
> >  	atomic_set(&shinfo->dataref, 1);
> >  	kmemcheck_annotate_variable(shinfo->destructor_arg);
> >  
> 
> Have you checked this for 32 bit as well as 64?  Based on my math your
> next patch will still mess up the memset on 32 bit with the structure
> being split somewhere just in front of hwtstamps.

You mean 32 byte cache lines? If so then yes there is a split half way
through the structure in that case but there's no way all this data
could ever fit in a single 32 byte cache line. Not including the frags
or destructor_arg the region nr_frags up to and including dataref is 36
bytes on 32 bit and 40 bytes on 64 bit. I've not changed anything in
this respect.

If you meant 64 byte cache lines with 32 bit structure sizes then by my
calculations everything from destructor_arg (in fact a bit earlier, from
12 bytes before then) up to and including frag[0] is in the same 64 byte
cache line.

I find the easiest way to check is to use gdb and open code an offsetof
macro.

(gdb) print/d sizeof(struct skb_shared_info) - (unsigned long)&(((struct skb_shared_info *)0)->nr_frags)
$3 = 240
(gdb) print/d sizeof(struct skb_shared_info) - (unsigned long)&(((struct skb_shared_info *)0)->frags[1])
$4 = 192

So given 64 byte cache lines the interesting area starts at 240/64=3.75
cache lines from the (aligned) end and it finishes just before 192/64=3
cache lines from the end, so nr_frags through to frags[0] are therefore
on the same cache line.

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 19:15       ` Alexander Duyck
  2012-04-11  8:00         ` Ian Campbell
@ 2012-04-11  8:00         ` Ian Campbell
  2012-04-11 16:31           ` Alexander Duyck
  2012-04-11 16:31           ` Alexander Duyck
  2012-04-11  8:20         ` Eric Dumazet
  2012-04-11  8:20         ` Eric Dumazet
  3 siblings, 2 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-11  8:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, netdev, David Miller, Michael S. Tsirkin,
	Wei Liu (Intern),
	xen-devel

On Tue, 2012-04-10 at 20:15 +0100, Alexander Duyck wrote:
> On 04/10/2012 11:41 AM, Eric Dumazet wrote:
> > On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:
> >
> >> Have you checked this for 32 bit as well as 64?  Based on my math your
> >> next patch will still mess up the memset on 32 bit with the structure
> >> being split somewhere just in front of hwtstamps.
> >>
> >> Why not just take frags and move it to the start of the structure?  It
> >> is already an unknown value because it can be either 16 or 17 depending
> >> on the value of PAGE_SIZE, and since you are making changes to frags the
> >> changes wouldn't impact the alignment of the other values later on since
> >> you are aligning the end of the structure.  That way you would be
> >> guaranteed that all of the fields that will be memset would be in the
> >> last 64 bytes.
> >>
> > Now when a fragmented packet is copied in pskb_expand_head(), you access
> > two separate zones of memory to copy the shinfo. But its supposed to be
> > slow path.
> >
> > Problem with this is that the offsets of often used fields will be big
> > (instead of being < 127) and code will be bigger on x86.
> 
> Actually now that I think about it my concerns go much further than the
> memset.  I'm convinced that this is going to cause a pretty significant
> performance regression on multiple drivers, especially on non x86_64
> architecture.  What we have right now on most platforms is a
> skb_shared_info structure in which everything up to and including frag 0
> is all in one cache line.  This gives us pretty good performance for igb
> and ixgbe since that is our common case when jumbo frames are not
> enabled is to split the head and place the data in a page.

With all the changes in this series it is still possible to fit a
maximum standard MTU frame and the shinfo on the same 4K page while also
have the skb_shared_info up to and including frag [0] aligned to the
same 64 byte cache line. 

The only exception is destructor_arg on 64 bit which is on the preceding
cache line but that is not a field used in any hot path.

> However the change being recommend here only resolves the issue for one
> specific architecture, and that is what I don't agree with.  What we
> need is a solution that also works for 64K pages or 32 bit pointers and
> I am fairly certain this current solution does not.

I think it does work for 32 bit pointers. What issue to do you see with
64K pages?

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 19:15       ` Alexander Duyck
@ 2012-04-11  8:00         ` Ian Campbell
  2012-04-11  8:00         ` Ian Campbell
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-11  8:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Wei Liu (Intern),
	Eric Dumazet, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On Tue, 2012-04-10 at 20:15 +0100, Alexander Duyck wrote:
> On 04/10/2012 11:41 AM, Eric Dumazet wrote:
> > On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:
> >
> >> Have you checked this for 32 bit as well as 64?  Based on my math your
> >> next patch will still mess up the memset on 32 bit with the structure
> >> being split somewhere just in front of hwtstamps.
> >>
> >> Why not just take frags and move it to the start of the structure?  It
> >> is already an unknown value because it can be either 16 or 17 depending
> >> on the value of PAGE_SIZE, and since you are making changes to frags the
> >> changes wouldn't impact the alignment of the other values later on since
> >> you are aligning the end of the structure.  That way you would be
> >> guaranteed that all of the fields that will be memset would be in the
> >> last 64 bytes.
> >>
> > Now when a fragmented packet is copied in pskb_expand_head(), you access
> > two separate zones of memory to copy the shinfo. But its supposed to be
> > slow path.
> >
> > Problem with this is that the offsets of often used fields will be big
> > (instead of being < 127) and code will be bigger on x86.
> 
> Actually now that I think about it my concerns go much further than the
> memset.  I'm convinced that this is going to cause a pretty significant
> performance regression on multiple drivers, especially on non x86_64
> architecture.  What we have right now on most platforms is a
> skb_shared_info structure in which everything up to and including frag 0
> is all in one cache line.  This gives us pretty good performance for igb
> and ixgbe since that is our common case when jumbo frames are not
> enabled is to split the head and place the data in a page.

With all the changes in this series it is still possible to fit a
maximum standard MTU frame and the shinfo on the same 4K page while also
have the skb_shared_info up to and including frag [0] aligned to the
same 64 byte cache line. 

The only exception is destructor_arg on 64 bit which is on the preceding
cache line but that is not a field used in any hot path.

> However the change being recommend here only resolves the issue for one
> specific architecture, and that is what I don't agree with.  What we
> need is a solution that also works for 64K pages or 32 bit pointers and
> I am fairly certain this current solution does not.

I think it does work for 32 bit pointers. What issue to do you see with
64K pages?

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 19:15       ` Alexander Duyck
  2012-04-11  8:00         ` Ian Campbell
  2012-04-11  8:00         ` Ian Campbell
@ 2012-04-11  8:20         ` Eric Dumazet
  2012-04-11 16:05           ` Alexander Duyck
  2012-04-11 16:05           ` Alexander Duyck
  2012-04-11  8:20         ` Eric Dumazet
  3 siblings, 2 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-11  8:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Ian Campbell, netdev, David Miller, Michael S. Tsirkin, Wei Liu,
	xen-devel

On Tue, 2012-04-10 at 12:15 -0700, Alexander Duyck wrote:

> 
> Actually now that I think about it my concerns go much further than the
> memset.  I'm convinced that this is going to cause a pretty significant
> performance regression on multiple drivers, especially on non x86_64
> architecture.  What we have right now on most platforms is a
> skb_shared_info structure in which everything up to and including frag 0
> is all in one cache line.  This gives us pretty good performance for igb
> and ixgbe since that is our common case when jumbo frames are not
> enabled is to split the head and place the data in a page.

I dont understand this split thing for MTU=1500 frames.

Even using half a page per fragment, each skb :

needs 2 allocations for sk_buff and skb->head, plus one page alloc /
reference.

skb->truesize = ksize(skb->head) + sizeof(*skb) + PAGE_SIZE/2 = 512 +
256 + 2048 = 2816 bytes


With non split you have :

2 allocations for sk_buff and skb->head.

skb->truesize = ksize(skb->head) + sizeof(*skb) = 2048 + 256 = 2304
bytes

less overhead and less calls to page allocator...

This only can benefit if GRO is on, since aggregation can use fragments
and a single sk_buff, instead of a frag_list

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-10 19:15       ` Alexander Duyck
                           ` (2 preceding siblings ...)
  2012-04-11  8:20         ` Eric Dumazet
@ 2012-04-11  8:20         ` Eric Dumazet
  3 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2012-04-11  8:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Wei Liu, Ian Campbell, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On Tue, 2012-04-10 at 12:15 -0700, Alexander Duyck wrote:

> 
> Actually now that I think about it my concerns go much further than the
> memset.  I'm convinced that this is going to cause a pretty significant
> performance regression on multiple drivers, especially on non x86_64
> architecture.  What we have right now on most platforms is a
> skb_shared_info structure in which everything up to and including frag 0
> is all in one cache line.  This gives us pretty good performance for igb
> and ixgbe since that is our common case when jumbo frames are not
> enabled is to split the head and place the data in a page.

I dont understand this split thing for MTU=1500 frames.

Even using half a page per fragment, each skb :

needs 2 allocations for sk_buff and skb->head, plus one page alloc /
reference.

skb->truesize = ksize(skb->head) + sizeof(*skb) + PAGE_SIZE/2 = 512 +
256 + 2048 = 2816 bytes


With non split you have :

2 allocations for sk_buff and skb->head.

skb->truesize = ksize(skb->head) + sizeof(*skb) = 2048 + 256 = 2304
bytes

less overhead and less calls to page allocator...

This only can benefit if GRO is on, since aggregation can use fragments
and a single sk_buff, instead of a frag_list

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 15:50   ` Ian Campbell
  2012-04-11 10:02     ` Bart Van Assche
@ 2012-04-11 10:02     ` Bart Van Assche
  1 sibling, 0 replies; 71+ messages in thread
From: Bart Van Assche @ 2012-04-11 10:02 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David Miller, Eric Dumazet, Michael S. Tsirkin,
	Wei Liu (Intern),
	David VomLehn, xen-devel

On 04/10/12 15:50, Ian Campbell wrote:

> On Tue, 2012-04-10 at 16:46 +0100, Bart Van Assche wrote:
>> Great to see v4 of this patch series. But which kernel version has this
>> patch series been based on ? I've tried to apply this series on 3.4-rc2
> 
> It's based on net-next/master. Specifically commit de8856d2c11f.


Thanks, that information allowed me to apply the patch series and to
test it with kernel 3.4-rc2 and iSCSI target code. The test ran fine.

The failure to apply this patch series on 3.4-rc2 I had reported turned
out to be an easy to resolve merge conflict:

+ static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
+ 		     int offset, size_t size, int more)
+ {
+ 	int flags = MSG_DONTWAIT | MSG_NOSIGNAL | (more ? MSG_MORE : MSG_EOR);
+ 	int ret;
+ 
 -	ret = kernel_sendpage(sock, page, offset, size, flags);
++	ret = kernel_sendpage(sock, page, NULL, offset, size, flags);
+ 	if (ret == -EAGAIN)
+ 		ret = 0;
+ 
+ 	return ret;
+ }
+ 

Bart.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/10] skb paged fragment destructors
  2012-04-10 15:50   ` Ian Campbell
@ 2012-04-11 10:02     ` Bart Van Assche
  2012-04-11 10:02     ` Bart Van Assche
  1 sibling, 0 replies; 71+ messages in thread
From: Bart Van Assche @ 2012-04-11 10:02 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu (Intern),
	Eric Dumazet, Michael S. Tsirkin, netdev, David VomLehn,
	xen-devel, David Miller

On 04/10/12 15:50, Ian Campbell wrote:

> On Tue, 2012-04-10 at 16:46 +0100, Bart Van Assche wrote:
>> Great to see v4 of this patch series. But which kernel version has this
>> patch series been based on ? I've tried to apply this series on 3.4-rc2
> 
> It's based on net-next/master. Specifically commit de8856d2c11f.


Thanks, that information allowed me to apply the patch series and to
test it with kernel 3.4-rc2 and iSCSI target code. The test ran fine.

The failure to apply this patch series on 3.4-rc2 I had reported turned
out to be an easy to resolve merge conflict:

+ static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
+ 		     int offset, size_t size, int more)
+ {
+ 	int flags = MSG_DONTWAIT | MSG_NOSIGNAL | (more ? MSG_MORE : MSG_EOR);
+ 	int ret;
+ 
 -	ret = kernel_sendpage(sock, page, offset, size, flags);
++	ret = kernel_sendpage(sock, page, NULL, offset, size, flags);
+ 	if (ret == -EAGAIN)
+ 		ret = 0;
+ 
+ 	return ret;
+ }
+ 

Bart.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-11  8:20         ` Eric Dumazet
  2012-04-11 16:05           ` Alexander Duyck
@ 2012-04-11 16:05           ` Alexander Duyck
  1 sibling, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2012-04-11 16:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ian Campbell, netdev, David Miller, Michael S. Tsirkin, Wei Liu,
	xen-devel

On 04/11/2012 01:20 AM, Eric Dumazet wrote:
> On Tue, 2012-04-10 at 12:15 -0700, Alexander Duyck wrote:
>
>> Actually now that I think about it my concerns go much further than the
>> memset.  I'm convinced that this is going to cause a pretty significant
>> performance regression on multiple drivers, especially on non x86_64
>> architecture.  What we have right now on most platforms is a
>> skb_shared_info structure in which everything up to and including frag 0
>> is all in one cache line.  This gives us pretty good performance for igb
>> and ixgbe since that is our common case when jumbo frames are not
>> enabled is to split the head and place the data in a page.
> I dont understand this split thing for MTU=1500 frames.
>
> Even using half a page per fragment, each skb :
>
> needs 2 allocations for sk_buff and skb->head, plus one page alloc /
> reference.
>
> skb->truesize = ksize(skb->head) + sizeof(*skb) + PAGE_SIZE/2 = 512 +
> 256 + 2048 = 2816 bytes
The number you provide for head is currently only available for 128 byte
skb allocations.  Anything larger than that will generate a 1K
allocation.  Also after all of these patches the smallest size you can
allocate will be 1K for anything under 504 bytes.

The size advantage is actually more for smaller frames.  In the case of
igb the behaviour is to place anything less than 512 bytes into just the
header and to skip using the page.  As such we get a much more ideal
allocation for small packets. since the truesize is only 1280 in that case.

In the case of ixgbe the advantage is more of a cache miss advantage. 
Ixgbe only receives the data into pages now.  I can prefetch the first
two cache lines of the page into memory while allocating the skb to
receive it.  As such we essentially cut the number of cache misses in
half versus the old approach which had us generating cache misses on the
sk_buff during allocation, and then generating more cache misses again
once we received the buffer and can fill out the sk_buff fields.  A
similar size advantage exists as well, but only for frames 256 bytes or
smaller.

> With non split you have :
>
> 2 allocations for sk_buff and skb->head.
>
> skb->truesize = ksize(skb->head) + sizeof(*skb) = 2048 + 256 = 2304
> bytes
>
> less overhead and less calls to page allocator...
>
> This only can benefit if GRO is on, since aggregation can use fragments
> and a single sk_buff, instead of a frag_list
There is much more than true size involved here.  My main argument is
that if we are going to align this modified skb_shared_info so that it
is aligned on nr_frags we should do it on all architectures, not just
x86_64.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-11  8:20         ` Eric Dumazet
@ 2012-04-11 16:05           ` Alexander Duyck
  2012-04-11 16:05           ` Alexander Duyck
  1 sibling, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2012-04-11 16:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Wei Liu, Ian Campbell, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On 04/11/2012 01:20 AM, Eric Dumazet wrote:
> On Tue, 2012-04-10 at 12:15 -0700, Alexander Duyck wrote:
>
>> Actually now that I think about it my concerns go much further than the
>> memset.  I'm convinced that this is going to cause a pretty significant
>> performance regression on multiple drivers, especially on non x86_64
>> architecture.  What we have right now on most platforms is a
>> skb_shared_info structure in which everything up to and including frag 0
>> is all in one cache line.  This gives us pretty good performance for igb
>> and ixgbe since that is our common case when jumbo frames are not
>> enabled is to split the head and place the data in a page.
> I dont understand this split thing for MTU=1500 frames.
>
> Even using half a page per fragment, each skb :
>
> needs 2 allocations for sk_buff and skb->head, plus one page alloc /
> reference.
>
> skb->truesize = ksize(skb->head) + sizeof(*skb) + PAGE_SIZE/2 = 512 +
> 256 + 2048 = 2816 bytes
The number you provide for head is currently only available for 128 byte
skb allocations.  Anything larger than that will generate a 1K
allocation.  Also after all of these patches the smallest size you can
allocate will be 1K for anything under 504 bytes.

The size advantage is actually more for smaller frames.  In the case of
igb the behaviour is to place anything less than 512 bytes into just the
header and to skip using the page.  As such we get a much more ideal
allocation for small packets. since the truesize is only 1280 in that case.

In the case of ixgbe the advantage is more of a cache miss advantage. 
Ixgbe only receives the data into pages now.  I can prefetch the first
two cache lines of the page into memory while allocating the skb to
receive it.  As such we essentially cut the number of cache misses in
half versus the old approach which had us generating cache misses on the
sk_buff during allocation, and then generating more cache misses again
once we received the buffer and can fill out the sk_buff fields.  A
similar size advantage exists as well, but only for frames 256 bytes or
smaller.

> With non split you have :
>
> 2 allocations for sk_buff and skb->head.
>
> skb->truesize = ksize(skb->head) + sizeof(*skb) = 2048 + 256 = 2304
> bytes
>
> less overhead and less calls to page allocator...
>
> This only can benefit if GRO is on, since aggregation can use fragments
> and a single sk_buff, instead of a frag_list
There is much more than true size involved here.  My main argument is
that if we are going to align this modified skb_shared_info so that it
is aligned on nr_frags we should do it on all architectures, not just
x86_64.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-11  8:00         ` Ian Campbell
@ 2012-04-11 16:31           ` Alexander Duyck
  2012-04-11 17:00             ` Ian Campbell
  2012-04-11 17:00             ` Ian Campbell
  2012-04-11 16:31           ` Alexander Duyck
  1 sibling, 2 replies; 71+ messages in thread
From: Alexander Duyck @ 2012-04-11 16:31 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Eric Dumazet, netdev, David Miller, Michael S. Tsirkin,
	Wei Liu (Intern),
	xen-devel

On 04/11/2012 01:00 AM, Ian Campbell wrote:
> On Tue, 2012-04-10 at 20:15 +0100, Alexander Duyck wrote:
>> On 04/10/2012 11:41 AM, Eric Dumazet wrote:
>>> On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:
>>>
>>>> Have you checked this for 32 bit as well as 64?  Based on my math your
>>>> next patch will still mess up the memset on 32 bit with the structure
>>>> being split somewhere just in front of hwtstamps.
>>>>
>>>> Why not just take frags and move it to the start of the structure?  It
>>>> is already an unknown value because it can be either 16 or 17 depending
>>>> on the value of PAGE_SIZE, and since you are making changes to frags the
>>>> changes wouldn't impact the alignment of the other values later on since
>>>> you are aligning the end of the structure.  That way you would be
>>>> guaranteed that all of the fields that will be memset would be in the
>>>> last 64 bytes.
>>>>
>>> Now when a fragmented packet is copied in pskb_expand_head(), you access
>>> two separate zones of memory to copy the shinfo. But its supposed to be
>>> slow path.
>>>
>>> Problem with this is that the offsets of often used fields will be big
>>> (instead of being < 127) and code will be bigger on x86.
>> Actually now that I think about it my concerns go much further than the
>> memset.  I'm convinced that this is going to cause a pretty significant
>> performance regression on multiple drivers, especially on non x86_64
>> architecture.  What we have right now on most platforms is a
>> skb_shared_info structure in which everything up to and including frag 0
>> is all in one cache line.  This gives us pretty good performance for igb
>> and ixgbe since that is our common case when jumbo frames are not
>> enabled is to split the head and place the data in a page.
> With all the changes in this series it is still possible to fit a
> maximum standard MTU frame and the shinfo on the same 4K page while also
> have the skb_shared_info up to and including frag [0] aligned to the
> same 64 byte cache line. 
>
> The only exception is destructor_arg on 64 bit which is on the preceding
> cache line but that is not a field used in any hot path.
The problem I have is that this is only true on x86_64.  Proper work
hasn't been done to guarantee this on any other architectures.

I think what I would like to see is instead of just setting things up
and hoping it comes out cache aligned on nr_frags why not take steps to
guarantee it?  You could do something like place and size the structure
based on:
SKB_DATA_ALIGN(sizeof(skb_shared_info) - offsetof(struct
skb_shared_info, nr_frags)) + offsetof(struct skb_shared_info, nr_frags)

That way you would have your alignment still guaranteed based off of the
end of the structure, but anything placed before nr_frags would be
placed on the end of the previous cache line.

>> However the change being recommend here only resolves the issue for one
>> specific architecture, and that is what I don't agree with.  What we
>> need is a solution that also works for 64K pages or 32 bit pointers and
>> I am fairly certain this current solution does not.
> I think it does work for 32 bit pointers. What issue to do you see with
> 64K pages?
>
> Ian.
With 64K pages the MAX_SKB_FRAGS value drops from 17 to 16.  That will
undoubtedly mess up the alignment.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-11  8:00         ` Ian Campbell
  2012-04-11 16:31           ` Alexander Duyck
@ 2012-04-11 16:31           ` Alexander Duyck
  1 sibling, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2012-04-11 16:31 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu (Intern),
	Eric Dumazet, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On 04/11/2012 01:00 AM, Ian Campbell wrote:
> On Tue, 2012-04-10 at 20:15 +0100, Alexander Duyck wrote:
>> On 04/10/2012 11:41 AM, Eric Dumazet wrote:
>>> On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:
>>>
>>>> Have you checked this for 32 bit as well as 64?  Based on my math your
>>>> next patch will still mess up the memset on 32 bit with the structure
>>>> being split somewhere just in front of hwtstamps.
>>>>
>>>> Why not just take frags and move it to the start of the structure?  It
>>>> is already an unknown value because it can be either 16 or 17 depending
>>>> on the value of PAGE_SIZE, and since you are making changes to frags the
>>>> changes wouldn't impact the alignment of the other values later on since
>>>> you are aligning the end of the structure.  That way you would be
>>>> guaranteed that all of the fields that will be memset would be in the
>>>> last 64 bytes.
>>>>
>>> Now when a fragmented packet is copied in pskb_expand_head(), you access
>>> two separate zones of memory to copy the shinfo. But its supposed to be
>>> slow path.
>>>
>>> Problem with this is that the offsets of often used fields will be big
>>> (instead of being < 127) and code will be bigger on x86.
>> Actually now that I think about it my concerns go much further than the
>> memset.  I'm convinced that this is going to cause a pretty significant
>> performance regression on multiple drivers, especially on non x86_64
>> architecture.  What we have right now on most platforms is a
>> skb_shared_info structure in which everything up to and including frag 0
>> is all in one cache line.  This gives us pretty good performance for igb
>> and ixgbe since that is our common case when jumbo frames are not
>> enabled is to split the head and place the data in a page.
> With all the changes in this series it is still possible to fit a
> maximum standard MTU frame and the shinfo on the same 4K page while also
> have the skb_shared_info up to and including frag [0] aligned to the
> same 64 byte cache line. 
>
> The only exception is destructor_arg on 64 bit which is on the preceding
> cache line but that is not a field used in any hot path.
The problem I have is that this is only true on x86_64.  Proper work
hasn't been done to guarantee this on any other architectures.

I think what I would like to see is instead of just setting things up
and hoping it comes out cache aligned on nr_frags why not take steps to
guarantee it?  You could do something like place and size the structure
based on:
SKB_DATA_ALIGN(sizeof(skb_shared_info) - offsetof(struct
skb_shared_info, nr_frags)) + offsetof(struct skb_shared_info, nr_frags)

That way you would have your alignment still guaranteed based off of the
end of the structure, but anything placed before nr_frags would be
placed on the end of the previous cache line.

>> However the change being recommend here only resolves the issue for one
>> specific architecture, and that is what I don't agree with.  What we
>> need is a solution that also works for 64K pages or 32 bit pointers and
>> I am fairly certain this current solution does not.
> I think it does work for 32 bit pointers. What issue to do you see with
> 64K pages?
>
> Ian.
With 64K pages the MAX_SKB_FRAGS value drops from 17 to 16.  That will
undoubtedly mess up the alignment.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-11 16:31           ` Alexander Duyck
@ 2012-04-11 17:00             ` Ian Campbell
  2012-04-11 17:00             ` Ian Campbell
  1 sibling, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-11 17:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, netdev, David Miller, Michael S. Tsirkin,
	Wei Liu (Intern),
	xen-devel

On Wed, 2012-04-11 at 17:31 +0100, Alexander Duyck wrote:
> On 04/11/2012 01:00 AM, Ian Campbell wrote:
> > On Tue, 2012-04-10 at 20:15 +0100, Alexander Duyck wrote:
> >> On 04/10/2012 11:41 AM, Eric Dumazet wrote:
> >>> On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:
> >>>
> >>>> Have you checked this for 32 bit as well as 64?  Based on my math your
> >>>> next patch will still mess up the memset on 32 bit with the structure
> >>>> being split somewhere just in front of hwtstamps.
> >>>>
> >>>> Why not just take frags and move it to the start of the structure?  It
> >>>> is already an unknown value because it can be either 16 or 17 depending
> >>>> on the value of PAGE_SIZE, and since you are making changes to frags the
> >>>> changes wouldn't impact the alignment of the other values later on since
> >>>> you are aligning the end of the structure.  That way you would be
> >>>> guaranteed that all of the fields that will be memset would be in the
> >>>> last 64 bytes.
> >>>>
> >>> Now when a fragmented packet is copied in pskb_expand_head(), you access
> >>> two separate zones of memory to copy the shinfo. But its supposed to be
> >>> slow path.
> >>>
> >>> Problem with this is that the offsets of often used fields will be big
> >>> (instead of being < 127) and code will be bigger on x86.
> >> Actually now that I think about it my concerns go much further than the
> >> memset.  I'm convinced that this is going to cause a pretty significant
> >> performance regression on multiple drivers, especially on non x86_64
> >> architecture.  What we have right now on most platforms is a
> >> skb_shared_info structure in which everything up to and including frag 0
> >> is all in one cache line.  This gives us pretty good performance for igb
> >> and ixgbe since that is our common case when jumbo frames are not
> >> enabled is to split the head and place the data in a page.
> > With all the changes in this series it is still possible to fit a
> > maximum standard MTU frame and the shinfo on the same 4K page while also
> > have the skb_shared_info up to and including frag [0] aligned to the
> > same 64 byte cache line. 
> >
> > The only exception is destructor_arg on 64 bit which is on the preceding
> > cache line but that is not a field used in any hot path.
> The problem I have is that this is only true on x86_64.  Proper work
> hasn't been done to guarantee this on any other architectures.

FWIW I did also explicitly cover i386 (see
<1334130984.12209.195.camel@dagon.hellion.org.uk>)

> I think what I would like to see is instead of just setting things up
> and hoping it comes out cache aligned on nr_frags why not take steps to
> guarantee it?  You could do something like place and size the structure
> based on:
> SKB_DATA_ALIGN(sizeof(skb_shared_info) - offsetof(struct
> skb_shared_info, nr_frags)) + offsetof(struct skb_shared_info, nr_frags)
> 
> That way you would have your alignment still guaranteed based off of the
> end of the structure, but anything placed before nr_frags would be
> placed on the end of the previous cache line.
> 
> >> However the change being recommend here only resolves the issue for one
> >> specific architecture, and that is what I don't agree with.  What we
> >> need is a solution that also works for 64K pages or 32 bit pointers and
> >> I am fairly certain this current solution does not.
> > I think it does work for 32 bit pointers. What issue to do you see with
> > 64K pages?
> >
> > Ian.
> With 64K pages the MAX_SKB_FRAGS value drops from 17 to 16.  That will
> undoubtedly mess up the alignment.

Oh, I see. Need to think about this some more but your suggestion above
is an interesting one, I'll see what I can do with that.

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/10] net: move destructor_arg to the front of sk_buff.
  2012-04-11 16:31           ` Alexander Duyck
  2012-04-11 17:00             ` Ian Campbell
@ 2012-04-11 17:00             ` Ian Campbell
  1 sibling, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-11 17:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Wei Liu (Intern),
	Eric Dumazet, Michael S. Tsirkin, netdev, xen-devel,
	David Miller

On Wed, 2012-04-11 at 17:31 +0100, Alexander Duyck wrote:
> On 04/11/2012 01:00 AM, Ian Campbell wrote:
> > On Tue, 2012-04-10 at 20:15 +0100, Alexander Duyck wrote:
> >> On 04/10/2012 11:41 AM, Eric Dumazet wrote:
> >>> On Tue, 2012-04-10 at 11:33 -0700, Alexander Duyck wrote:
> >>>
> >>>> Have you checked this for 32 bit as well as 64?  Based on my math your
> >>>> next patch will still mess up the memset on 32 bit with the structure
> >>>> being split somewhere just in front of hwtstamps.
> >>>>
> >>>> Why not just take frags and move it to the start of the structure?  It
> >>>> is already an unknown value because it can be either 16 or 17 depending
> >>>> on the value of PAGE_SIZE, and since you are making changes to frags the
> >>>> changes wouldn't impact the alignment of the other values later on since
> >>>> you are aligning the end of the structure.  That way you would be
> >>>> guaranteed that all of the fields that will be memset would be in the
> >>>> last 64 bytes.
> >>>>
> >>> Now when a fragmented packet is copied in pskb_expand_head(), you access
> >>> two separate zones of memory to copy the shinfo. But its supposed to be
> >>> slow path.
> >>>
> >>> Problem with this is that the offsets of often used fields will be big
> >>> (instead of being < 127) and code will be bigger on x86.
> >> Actually now that I think about it my concerns go much further than the
> >> memset.  I'm convinced that this is going to cause a pretty significant
> >> performance regression on multiple drivers, especially on non x86_64
> >> architecture.  What we have right now on most platforms is a
> >> skb_shared_info structure in which everything up to and including frag 0
> >> is all in one cache line.  This gives us pretty good performance for igb
> >> and ixgbe since that is our common case when jumbo frames are not
> >> enabled is to split the head and place the data in a page.
> > With all the changes in this series it is still possible to fit a
> > maximum standard MTU frame and the shinfo on the same 4K page while also
> > have the skb_shared_info up to and including frag [0] aligned to the
> > same 64 byte cache line. 
> >
> > The only exception is destructor_arg on 64 bit which is on the preceding
> > cache line but that is not a field used in any hot path.
> The problem I have is that this is only true on x86_64.  Proper work
> hasn't been done to guarantee this on any other architectures.

FWIW I did also explicitly cover i386 (see
<1334130984.12209.195.camel@dagon.hellion.org.uk>)

> I think what I would like to see is instead of just setting things up
> and hoping it comes out cache aligned on nr_frags why not take steps to
> guarantee it?  You could do something like place and size the structure
> based on:
> SKB_DATA_ALIGN(sizeof(skb_shared_info) - offsetof(struct
> skb_shared_info, nr_frags)) + offsetof(struct skb_shared_info, nr_frags)
> 
> That way you would have your alignment still guaranteed based off of the
> end of the structure, but anything placed before nr_frags would be
> placed on the end of the previous cache line.
> 
> >> However the change being recommend here only resolves the issue for one
> >> specific architecture, and that is what I don't agree with.  What we
> >> need is a solution that also works for 64K pages or 32 bit pointers and
> >> I am fairly certain this current solution does not.
> > I think it does work for 32 bit pointers. What issue to do you see with
> > 64K pages?
> >
> > Ian.
> With 64K pages the MAX_SKB_FRAGS value drops from 17 to 16.  That will
> undoubtedly mess up the alignment.

Oh, I see. Need to think about this some more but your suggestion above
is an interesting one, I'll see what I can do with that.

Ian.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Xen-devel] [PATCH 06/10] net: add support for per-paged-fragment destructors
  2012-04-10 14:26 ` Ian Campbell
@ 2012-04-26 20:44   ` Konrad Rzeszutek Wilk
  2012-04-26 20:44   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 71+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-04-26 20:44 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, Wei Liu, Eric Dumazet, Michael S. Tsirkin, xen-devel,
	Michał Mirosław, David Miller

On Tue, Apr 10, 2012 at 03:26:20PM +0100, Ian Campbell wrote:
> Entities which care about the complete lifecycle of pages which they inject
> into the network stack via an skb paged fragment can choose to set this
> destructor in order to receive a callback when the stack is really finished
> with a page (including all clones, retransmits, pull-ups etc etc).
> 
> This destructor will always be propagated alongside the struct page when
> copying skb_frag_t->page. This is the reason I chose to embed the destructor in
> a "struct { } page" within the skb_frag_t, rather than as a separate field,
> since it allows existing code which propagates ->frags[N].page to Just
> Work(tm).
> 
> When the destructor is present the page reference counting is done slightly
> differently. No references are held by the network stack on the struct page (it
> is up to the caller to manage this as necessary) instead the network stack will
> track references via the count embedded in the destructor structure. When this
> reference count reaches zero then the destructor will be called and the caller
> can take the necesary steps to release the page (i.e. release the struct page
> reference itself).
> 
> The intention is that callers can use this callback to delay completion to
> _their_ callers until the network stack has completely released the page, in
> order to prevent use-after-free or modification of data pages which are still
> in use by the stack.
> 
> It is allowable (indeed expected) for a caller to share a single destructor
> instance between multiple pages injected into the stack e.g. a group of pages
> included in a single higher level operation might share a destructor which is
> used to complete that higher level operation.
> 
> With this change and the previous two changes to shinfo alignment and field
> orderring it is now the case tyhat on a 64 bit system with 64 byte cache lines,
                              ^^^^ - that.

> everything from nr_frags until the end of frags[0] is on the same cacheline.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
> Cc: netdev@vger.kernel.org
> ---
>  include/linux/skbuff.h |   43 +++++++++++++++++++++++++++++++++++++++++++
>  net/core/skbuff.c      |   17 +++++++++++++++++
>  2 files changed, 60 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index f0ae39c..6ac283e 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -166,9 +166,15 @@ struct sk_buff;
>  
>  typedef struct skb_frag_struct skb_frag_t;
>  
> +struct skb_frag_destructor {
> +	atomic_t ref;
> +	int (*destroy)(struct skb_frag_destructor *destructor);
> +};
> +
>  struct skb_frag_struct {
>  	struct {
>  		struct page *p;
> +		struct skb_frag_destructor *destructor;
>  	} page;
>  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
>  	__u32 page_offset;
> @@ -1221,6 +1227,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
>  }
>  
>  /**
> + * skb_frag_set_destructor - set destructor for a paged fragment
> + * @skb: buffer containing fragment to be initialised
> + * @i: paged fragment index to initialise
> + * @destroy: the destructor to use for this fragment
> + *
> + * Sets @destroy as the destructor to be called when all references to
> + * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
> + * etc) are released.
> + *
> + * When a destructor is set then reference counting is performed on
> + * @destroy->ref. When the ref reaches zero then @destroy->destroy
> + * will be called. The caller is responsible for holding and managing
> + * any other references (such a the struct page reference count).
> + *
> + * This function must be called before any use of skb_frag_ref() or
> + * skb_frag_unref().
> + */
> +static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
> +					   struct skb_frag_destructor *destroy)
> +{
> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> +	frag->page.destructor = destroy;
> +}
> +
> +/**
>   * __skb_fill_page_desc - initialise a paged fragment in an skb
>   * @skb: buffer containing fragment to be initialised
>   * @i: paged fragment index to initialise
> @@ -1239,6 +1270,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
>  	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
>  
>  	frag->page.p		  = page;
> +	frag->page.destructor     = NULL;
>  	frag->page_offset	  = off;
>  	skb_frag_size_set(frag, size);
>  }
> @@ -1743,6 +1775,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>  	return frag->page.p;
>  }
>  
> +extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
> +extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
> +
>  /**
>   * __skb_frag_ref - take an addition reference on a paged fragment.
>   * @frag: the paged fragment
> @@ -1751,6 +1786,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>   */
>  static inline void __skb_frag_ref(skb_frag_t *frag)
>  {
> +	if (unlikely(frag->page.destructor)) {
> +		skb_frag_destructor_ref(frag->page.destructor);
> +		return;
> +	}
>  	get_page(skb_frag_page(frag));
>  }
>  
> @@ -1774,6 +1813,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
>   */
>  static inline void __skb_frag_unref(skb_frag_t *frag)
>  {
> +	if (unlikely(frag->page.destructor)) {
> +		skb_frag_destructor_unref(frag->page.destructor);
> +		return;
> +	}
>  	put_page(skb_frag_page(frag));
>  }
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b8a41d6..9ec88ce 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -349,6 +349,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
>  }
>  EXPORT_SYMBOL(dev_alloc_skb);
>  
> +void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
> +{
> +	BUG_ON(destroy == NULL);
> +	atomic_inc(&destroy->ref);
> +}
> +EXPORT_SYMBOL(skb_frag_destructor_ref);
> +
> +void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
> +{
> +	if (destroy == NULL)
> +		return;
> +
> +	if (atomic_dec_and_test(&destroy->ref))
> +		destroy->destroy(destroy);
> +}
> +EXPORT_SYMBOL(skb_frag_destructor_unref);
> +
>  static void skb_drop_list(struct sk_buff **listp)
>  {
>  	struct sk_buff *list = *listp;
> -- 
> 1.7.2.5
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/10] net: add support for per-paged-fragment destructors
  2012-04-10 14:26 ` Ian Campbell
  2012-04-26 20:44   ` [Xen-devel] " Konrad Rzeszutek Wilk
@ 2012-04-26 20:44   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 71+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-04-26 20:44 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Eric Dumazet, Michael S. Tsirkin, netdev,
	Michał Mirosław, xen-devel, David Miller

On Tue, Apr 10, 2012 at 03:26:20PM +0100, Ian Campbell wrote:
> Entities which care about the complete lifecycle of pages which they inject
> into the network stack via an skb paged fragment can choose to set this
> destructor in order to receive a callback when the stack is really finished
> with a page (including all clones, retransmits, pull-ups etc etc).
> 
> This destructor will always be propagated alongside the struct page when
> copying skb_frag_t->page. This is the reason I chose to embed the destructor in
> a "struct { } page" within the skb_frag_t, rather than as a separate field,
> since it allows existing code which propagates ->frags[N].page to Just
> Work(tm).
> 
> When the destructor is present the page reference counting is done slightly
> differently. No references are held by the network stack on the struct page (it
> is up to the caller to manage this as necessary) instead the network stack will
> track references via the count embedded in the destructor structure. When this
> reference count reaches zero then the destructor will be called and the caller
> can take the necesary steps to release the page (i.e. release the struct page
> reference itself).
> 
> The intention is that callers can use this callback to delay completion to
> _their_ callers until the network stack has completely released the page, in
> order to prevent use-after-free or modification of data pages which are still
> in use by the stack.
> 
> It is allowable (indeed expected) for a caller to share a single destructor
> instance between multiple pages injected into the stack e.g. a group of pages
> included in a single higher level operation might share a destructor which is
> used to complete that higher level operation.
> 
> With this change and the previous two changes to shinfo alignment and field
> orderring it is now the case tyhat on a 64 bit system with 64 byte cache lines,
                              ^^^^ - that.

> everything from nr_frags until the end of frags[0] is on the same cacheline.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
> Cc: netdev@vger.kernel.org
> ---
>  include/linux/skbuff.h |   43 +++++++++++++++++++++++++++++++++++++++++++
>  net/core/skbuff.c      |   17 +++++++++++++++++
>  2 files changed, 60 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index f0ae39c..6ac283e 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -166,9 +166,15 @@ struct sk_buff;
>  
>  typedef struct skb_frag_struct skb_frag_t;
>  
> +struct skb_frag_destructor {
> +	atomic_t ref;
> +	int (*destroy)(struct skb_frag_destructor *destructor);
> +};
> +
>  struct skb_frag_struct {
>  	struct {
>  		struct page *p;
> +		struct skb_frag_destructor *destructor;
>  	} page;
>  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
>  	__u32 page_offset;
> @@ -1221,6 +1227,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
>  }
>  
>  /**
> + * skb_frag_set_destructor - set destructor for a paged fragment
> + * @skb: buffer containing fragment to be initialised
> + * @i: paged fragment index to initialise
> + * @destroy: the destructor to use for this fragment
> + *
> + * Sets @destroy as the destructor to be called when all references to
> + * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
> + * etc) are released.
> + *
> + * When a destructor is set then reference counting is performed on
> + * @destroy->ref. When the ref reaches zero then @destroy->destroy
> + * will be called. The caller is responsible for holding and managing
> + * any other references (such a the struct page reference count).
> + *
> + * This function must be called before any use of skb_frag_ref() or
> + * skb_frag_unref().
> + */
> +static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
> +					   struct skb_frag_destructor *destroy)
> +{
> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> +	frag->page.destructor = destroy;
> +}
> +
> +/**
>   * __skb_fill_page_desc - initialise a paged fragment in an skb
>   * @skb: buffer containing fragment to be initialised
>   * @i: paged fragment index to initialise
> @@ -1239,6 +1270,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
>  	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
>  
>  	frag->page.p		  = page;
> +	frag->page.destructor     = NULL;
>  	frag->page_offset	  = off;
>  	skb_frag_size_set(frag, size);
>  }
> @@ -1743,6 +1775,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>  	return frag->page.p;
>  }
>  
> +extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
> +extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
> +
>  /**
>   * __skb_frag_ref - take an addition reference on a paged fragment.
>   * @frag: the paged fragment
> @@ -1751,6 +1786,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>   */
>  static inline void __skb_frag_ref(skb_frag_t *frag)
>  {
> +	if (unlikely(frag->page.destructor)) {
> +		skb_frag_destructor_ref(frag->page.destructor);
> +		return;
> +	}
>  	get_page(skb_frag_page(frag));
>  }
>  
> @@ -1774,6 +1813,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
>   */
>  static inline void __skb_frag_unref(skb_frag_t *frag)
>  {
> +	if (unlikely(frag->page.destructor)) {
> +		skb_frag_destructor_unref(frag->page.destructor);
> +		return;
> +	}
>  	put_page(skb_frag_page(frag));
>  }
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b8a41d6..9ec88ce 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -349,6 +349,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
>  }
>  EXPORT_SYMBOL(dev_alloc_skb);
>  
> +void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
> +{
> +	BUG_ON(destroy == NULL);
> +	atomic_inc(&destroy->ref);
> +}
> +EXPORT_SYMBOL(skb_frag_destructor_ref);
> +
> +void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
> +{
> +	if (destroy == NULL)
> +		return;
> +
> +	if (atomic_dec_and_test(&destroy->ref))
> +		destroy->destroy(destroy);
> +}
> +EXPORT_SYMBOL(skb_frag_destructor_unref);
> +
>  static void skb_drop_list(struct sk_buff **listp)
>  {
>  	struct sk_buff *list = *listp;
> -- 
> 1.7.2.5
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 0/10] skb paged fragment destructors
@ 2012-04-10 14:26 Ian Campbell
  0 siblings, 0 replies; 71+ messages in thread
From: Ian Campbell @ 2012-04-10 14:26 UTC (permalink / raw)
  To: netdev
  Cc: Wei Liu, Bart Van Assche, Eric Dumazet, Michael S. Tsirkin,
	David VomLehn, xen-devel, David Miller, Ian Campbell

I think this is v4, but I've sort of lost count, sorry that it's taken
me so long to get back to this stuff.

The following series makes use of the skb fragment API (which is in 3.2
+) to add a per-paged-fragment destructor callback. This can be used by
creators of skbs who are interested in the lifecycle of the pages
included in that skb after they have handed it off to the network stack.

The mail at [0] contains some more background and rationale but
basically the completed series will allow entities which inject pages
into the networking stack to receive a notification when the stack has
really finished with those pages (i.e. including retransmissions,
clones, pull-ups etc) and not just when the original skb is finished
with, which is beneficial to many subsystems which wish to inject pages
into the network stack without giving up full ownership of those page's
lifecycle. It implements something broadly along the lines of what was
described in [1].

I have also included a patch to the RPC subsystem which uses this API to
fix the bug which I describe at [2].

I've also had some interest from David VemLehn and Bart Van Assche
regarding using this functionality in the context of vmsplice and iSCSI
targets respectively (I think).

Changes since last time:

      * Added skb_orphan_frags API for the use of recipients of SKBs who
        may hold onto the SKB for a long time (this is analogous to
        skb_orphan). This was pointed out by Michael. The TUN driver is
        currently the only user.
              * I can't for the life of me get anything to actually hit
                this code path. I've been trying with an NFS server
                running in a Xen HVM domain with emulated (e.g. tap)
                networking and a client in domain 0, using the NFS fix
                in this series which generates SKBs with destructors
                set, so far -- nothing. I suspect that lack of TSO/GSO
                etc on the TAP interface is causing the frags to be
                copied to normal pages during skb_segment().
      * Various fixups related to the change of alignment/padding in
        shinfo, in particular to build_skb as pointed out by Eric.
      * Tweaked ordering of shinfo members to ensure that all hotpath
        variables up to and including the first frag fit within (and are
        aligned to) a single 64 byte cache line. (Eric again)

I ran a monothread UDP benchmark (similar to that described by Eric in
e52fcb2462ac) and don't see any difference in pps throughput, it was
~810,000 pps both before and after.

Cheers,
Ian.

[0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
[1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
[2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2012-04-26 20:50 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-10 14:26 [PATCH v4 0/10] skb paged fragment destructors Ian Campbell
2012-04-10 14:26 ` [PATCH 01/10] net: add and use SKB_ALLOCSIZE Ian Campbell
2012-04-10 14:57   ` Eric Dumazet
2012-04-10 14:57   ` Eric Dumazet
2012-04-10 14:26 ` Ian Campbell
2012-04-10 14:26 ` [PATCH 02/10] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell
2012-04-10 14:58   ` Eric Dumazet
2012-04-10 14:58   ` Eric Dumazet
2012-04-10 14:26 ` Ian Campbell
2012-04-10 14:26 ` [PATCH 03/10] chelsio: use SKB_WITH_OVERHEAD Ian Campbell
2012-04-10 14:59   ` Eric Dumazet
2012-04-10 14:59   ` Eric Dumazet
2012-04-10 14:26 ` Ian Campbell
2012-04-10 14:26 ` [PATCH 04/10] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
2012-04-10 15:01   ` Eric Dumazet
2012-04-10 15:01   ` Eric Dumazet
2012-04-10 14:26 ` Ian Campbell
2012-04-10 14:26 ` [PATCH 05/10] net: move destructor_arg to the front of sk_buff Ian Campbell
2012-04-10 14:26 ` Ian Campbell
2012-04-10 15:05   ` Eric Dumazet
2012-04-10 15:19     ` Ian Campbell
2012-04-10 15:19     ` Ian Campbell
2012-04-10 15:05   ` Eric Dumazet
2012-04-10 18:33   ` Alexander Duyck
2012-04-10 18:41     ` Eric Dumazet
2012-04-10 18:41     ` Eric Dumazet
2012-04-10 19:15       ` Alexander Duyck
2012-04-11  8:00         ` Ian Campbell
2012-04-11  8:00         ` Ian Campbell
2012-04-11 16:31           ` Alexander Duyck
2012-04-11 17:00             ` Ian Campbell
2012-04-11 17:00             ` Ian Campbell
2012-04-11 16:31           ` Alexander Duyck
2012-04-11  8:20         ` Eric Dumazet
2012-04-11 16:05           ` Alexander Duyck
2012-04-11 16:05           ` Alexander Duyck
2012-04-11  8:20         ` Eric Dumazet
2012-04-10 19:15       ` Alexander Duyck
2012-04-11  7:56     ` Ian Campbell
2012-04-11  7:56     ` Ian Campbell
2012-04-10 18:33   ` Alexander Duyck
2012-04-10 14:26 ` [PATCH 06/10] net: add support for per-paged-fragment destructors Ian Campbell
2012-04-10 14:26 ` Ian Campbell
2012-04-26 20:44   ` [Xen-devel] " Konrad Rzeszutek Wilk
2012-04-26 20:44   ` Konrad Rzeszutek Wilk
2012-04-10 14:26 ` [PATCH 07/10] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
2012-04-10 14:26 ` Ian Campbell
2012-04-10 20:11   ` Ben Hutchings
2012-04-10 20:11   ` Ben Hutchings
2012-04-11  7:45     ` Ian Campbell
2012-04-11  7:45     ` Ian Campbell
2012-04-10 14:26 ` [PATCH 08/10] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell
2012-04-10 14:26 ` Ian Campbell
2012-04-10 14:26 ` [PATCH 09/10] net: add paged frag destructor support to kernel_sendpage Ian Campbell
2012-04-10 14:26 ` [PATCH 10/10] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell
     [not found] ` <1334067965.5394.22.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2012-04-10 14:26   ` [PATCH 09/10] net: add paged frag destructor support to kernel_sendpage Ian Campbell
2012-04-10 14:26     ` [Ocfs2-devel] " Ian Campbell
2012-04-10 14:26     ` Ian Campbell
2012-04-10 14:26   ` [PATCH 10/10] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell
2012-04-10 14:26     ` Ian Campbell
2012-04-10 14:58 ` [PATCH v4 0/10] skb paged fragment destructors Michael S. Tsirkin
2012-04-10 14:58 ` Michael S. Tsirkin
2012-04-10 15:00 ` Michael S. Tsirkin
2012-04-10 15:00 ` Michael S. Tsirkin
2012-04-10 15:46 ` Bart Van Assche
2012-04-10 15:46 ` Bart Van Assche
2012-04-10 15:50   ` Ian Campbell
2012-04-10 15:50   ` Ian Campbell
2012-04-11 10:02     ` Bart Van Assche
2012-04-11 10:02     ` Bart Van Assche
2012-04-10 14:26 Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.