netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 v2] skb paged fragment destructors
@ 2012-01-05 17:09 Ian Campbell
  2012-01-05 17:13 ` [PATCH 1/6] net: pack skb_shared_info more efficiently Ian Campbell
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 17:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet

The following series makes use of the skb fragment API (which is in 3.2)
to add a per-paged-fragment destructor callback. This can be used by
creators of skbs who are interested in the lifecycle of the pages
included in that skb after they have handed it off to the network stack.
I think these have all been posted before, but have been backed up
behind the skb fragment API.

The mail at [0] contains some more background and rationale but
basically the completed series will allow entities which inject pages
into the networking stack to receive a notification when the stack has
really finished with those pages (i.e. including retransmissions,
clones, pull-ups etc) and not just when the original skb is finished
with, which is beneficial to many subsystems which wish to inject pages
into the network stack without giving up full ownership of those page's
lifecycle. It implements something broadly along the lines of what was
described in [1].

I have also included a patch to the RPC subsystem which uses this API to
fix the bug which I describe at [2].

Last time I posted this series it was observed that the size of struct
skb_frag_struct was increased sufficiently that a 1500 byte frame would
no longer fit into a half page allocation (with 4K pages).

I investigated some options which did not require increasing the size of
the skb_frag_struct at all but they were mostly pretty ugly (either for
the user of the API or within the network stack itself). 

However having observed that MAX_SKB_FRAGS could be reduced by 1 (see
9d4dde521577 "net: only use a single page of slop in MAX_SKB_FRAGS") I
decided it was worth trying to see if I could pack the shared info a bit
tighter and fit it into the necessary space.

By tweaking the ordering of the fields and reducing the size of nr_frags
(in combination with 9d4dde521577) I was able to get the shinfo size
down to:

                                          BEFORE        AFTER(v1)	AFTER(v2)
AMD64:  sizeof(struct skb_frag_struct)  = 16            24		24
        sizeof(struct skb_shared_info)  = 344           488		456

i386:   sizeof(struct skb_frag_struct)  = 8             12		12
        sizeof(struct skb_shared_info)  = 188           260		244

(I think these are representative of 32 and 64 bit arches generally)

This isn't quite enough to squeeze things into half a page since both
the data allocation and shinfo are cache line aligned. e.g. for 64 byte
cache lines on amd64:

          ALIGN(NET_SKB_PAD(64) + 1500 + 14) + ALIGN(456)
        = ALIGN(1578) + ALIGN(456)
        = 1600 + 512
        = 2112

This actually leaves a fair bit of slack in many cases so we actually
align the end of the shinfo to a cache line by using ksize() to place it
right at the end of the actual allocation rather than aligning the
front.

If instead we align the total allocation size we reduce the amount of
slop and a maximum MTU frame fits into half a page:

          ALIGN(NET_SKB_PAD(64) + 1500 + 14 456)
        = ALIGN(1578 + 456)
        = ALIGN(2034)
        = 2048

The downside in this scenario is that, for 64 byte cache lines, the
first 8 bytes of shinfo are on the same cache line as the tail of the
data. I think this is the worst case and as the data size varies the
"overlap" will always be <= to this assuming the allocator always rounds
to a multiple of the cache line size. I think this small overlap is
better than spilling over into the next allocation size and it only
happens for sizes 427-490 and 1451-1500 bytes (inclusive).

For the 128 byte cache line case the overlap at worst is 72 bytes which
is up to and including shinfo->frags[0]. This happens for sizes in the
ranges 363-490 and 1387-1500 bytes (inclusive).

There may be various ways which we could mitigate this somewhat if it is
a problem, the most obvious being to reorder the shinfo to put less
frequently access members up front (e.g. destructor_arg seems like a
good candidate). An even more extreme idea might be to put the shinfo
_first_ within the allocation such that the overlap is with the last
(presumably less frequently used) frags.

Cheers,
Ian.

[0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
[1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
[2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6] net: pack skb_shared_info more efficiently
  2012-01-05 17:09 [PATCH 0/6 v2] skb paged fragment destructors Ian Campbell
@ 2012-01-05 17:13 ` Ian Campbell
  2012-01-05 18:30   ` Eric Dumazet
  2012-01-05 17:13 ` [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 17:13 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Ian Campbell

nr_frags can be 8 bits since 256 is plenty of fragments. This allows it to be
packed with tx_flags.

Also by moving ip6_frag_id and dataref (both 4 bytes) next to each other we can
avoid a hole between ip6_frag_id and frag_list on 64 bit systems.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/skbuff.h |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f47f0c3..50db9b0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -242,15 +242,15 @@ struct ubuf_info {
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
-	unsigned short	nr_frags;
+	unsigned char	nr_frags;
+	__u8		tx_flags;
 	unsigned short	gso_size;
 	/* Warning: this field is not always filled in (UFO)! */
 	unsigned short	gso_segs;
 	unsigned short  gso_type;
-	__be32          ip6_frag_id;
-	__u8		tx_flags;
 	struct sk_buff	*frag_list;
 	struct skb_shared_hwtstamps hwtstamps;
+	__be32          ip6_frag_id;
 
 	/*
 	 * Warning : all fields before dataref are cleared in __alloc_skb()
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-05 17:09 [PATCH 0/6 v2] skb paged fragment destructors Ian Campbell
  2012-01-05 17:13 ` [PATCH 1/6] net: pack skb_shared_info more efficiently Ian Campbell
@ 2012-01-05 17:13 ` Ian Campbell
  2012-01-05 17:46   ` Eric Dumazet
  2012-01-05 17:13 ` [PATCH 3/6] net: add support for per-paged-fragment destructors Ian Campbell
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 17:13 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Ian Campbell

We already push the shinfo to the very end of the actual allocation which is
already cache-line aligned.

This reduces the minimum overhead required for this allocation such that the
shinfo can be grown in the following patch without overflowing 2048 bytes for a
1500 byte frame. Reducing this overhead means that sometimes the tail end of
the data can end up in the same cache line as the beginning of the shinfo but
in many cases the allocation slop means that there is no overlap.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/skbuff.h |    4 ++--
 net/core/skbuff.c      |    3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 50db9b0..f04333b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,7 +41,7 @@
 #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
 				 ~(SMP_CACHE_BYTES - 1))
 #define SKB_WITH_OVERHEAD(X)	\
-	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	((X) - sizeof(struct skb_shared_info))
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
 #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
@@ -50,7 +50,7 @@
 /* return minimum truesize of one skb containing X bytes of data */
 #define SKB_TRUESIZE(X) ((X) +						\
 			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
-			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+			 sizeof(struct skb_shared_info))
 
 /* A. Checksumming of received packets by device.
  *
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index da0c97f..b6de604 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
 	 * Both skb->head and skb_shared_info are cache line aligned.
 	 */
-	size = SKB_DATA_ALIGN(size);
-	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
 	data = kmalloc_node_track_caller(size, gfp_mask, node);
 	if (!data)
 		goto nodata;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6] net: add support for per-paged-fragment destructors
  2012-01-05 17:09 [PATCH 0/6 v2] skb paged fragment destructors Ian Campbell
  2012-01-05 17:13 ` [PATCH 1/6] net: pack skb_shared_info more efficiently Ian Campbell
  2012-01-05 17:13 ` [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
@ 2012-01-05 17:13 ` Ian Campbell
  2012-01-06  1:15   ` Michał Mirosław
  2012-01-05 17:13 ` [PATCH 4/6] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 17:13 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Ian Campbell, Michał Mirosław

Entities which care about the complete lifecycle of pages which they inject
into the network stack via an skb paged fragment can choose to set this
destructor in order to receive a callback when the stack is really finished
with a page (including all clones, retransmits, pull-ups etc etc).

This destructor will always be propagated alongside the struct page when
copying skb_frag_t->page. This is the reason I chose to embed the destructor in
a "struct { } page" within the skb_frag_t, rather than as a separate field,
since it allows existing code which propagates ->frags[N].page to Just
Work(tm).

When the destructor is present the page reference counting is done slightly
differently. No references are held by the network stack on the struct page (it
is up to the caller to manage this as necessary) instead the network stack will
track references via the count embedded in the destructor structure. When this
reference count reaches zero then the destructor will be called and the caller
can take the necesary steps to release the page (i.e. release the struct page
reference itself).

The intention is that callers can use this callback to delay completion to
_their_ callers until the network stack has completely released the page, in
order to prevent use-after-free or modification of data pages which are still
in use by the stack.

It is allowable (indeed expected) for a caller to share a single destructor
instance between multiple pages injected into the stack e.g. a group of pages
included in a single higher level operation might share a destructor which is
used to complete that higher level operation.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/skbuff.c      |   17 +++++++++++++++++
 2 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f04333b..89fa908 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -143,9 +143,16 @@ struct sk_buff;
 
 typedef struct skb_frag_struct skb_frag_t;
 
+struct skb_frag_destructor {
+	atomic_t ref;
+	int (*destroy)(void *data);
+	void *data;
+};
+
 struct skb_frag_struct {
 	struct {
 		struct page *p;
+		struct skb_frag_destructor *destructor;
 	} page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
 	__u32 page_offset;
@@ -1172,6 +1179,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
 }
 
 /**
+ * skb_frag_set_destructor - set destructor for a paged fragment
+ * @skb: buffer containing fragment to be initialised
+ * @i: paged fragment index to initialise
+ * @destroy: the destructor to use for this fragment
+ *
+ * Sets @destroy as the destructor to be called when all references to
+ * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
+ * etc) are released.
+ *
+ * When a destructor is set then reference counting is performed on
+ * @destroy->ref. When the ref reaches zero then @destroy->destroy
+ * will be called. The caller is responsible for holding and managing
+ * any other references (such a the struct page reference count).
+ *
+ * This function must be called before any use of skb_frag_ref() or
+ * skb_frag_unref().
+ */
+static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
+					   struct skb_frag_destructor *destroy)
+{
+	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+	frag->page.destructor = destroy;
+}
+
+/**
  * __skb_fill_page_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
  * @i: paged fragment index to initialise
@@ -1190,6 +1222,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 	frag->page.p		  = page;
+	frag->page.destructor     = NULL;
 	frag->page_offset	  = off;
 	skb_frag_size_set(frag, size);
 }
@@ -1684,6 +1717,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
 	return frag->page.p;
 }
 
+extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
+extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
+
 /**
  * __skb_frag_ref - take an addition reference on a paged fragment.
  * @frag: the paged fragment
@@ -1692,6 +1728,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_ref(frag->page.destructor);
+		return;
+	}
 	get_page(skb_frag_page(frag));
 }
 
@@ -1715,6 +1755,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
  */
 static inline void __skb_frag_unref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_unref(frag->page.destructor);
+		return;
+	}
 	put_page(skb_frag_page(frag));
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b6de604..12f3c8c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -351,6 +351,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
 }
 EXPORT_SYMBOL(dev_alloc_skb);
 
+void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
+{
+	BUG_ON(destroy == NULL);
+	atomic_inc(&destroy->ref);
+}
+EXPORT_SYMBOL(skb_frag_destructor_ref);
+
+void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
+{
+	if (destroy == NULL)
+		return;
+
+	if (atomic_dec_and_test(&destroy->ref))
+		destroy->destroy(destroy->data);
+}
+EXPORT_SYMBOL(skb_frag_destructor_unref);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6] net: only allow paged fragments with the same destructor to be coalesced.
  2012-01-05 17:09 [PATCH 0/6 v2] skb paged fragment destructors Ian Campbell
                   ` (2 preceding siblings ...)
  2012-01-05 17:13 ` [PATCH 3/6] net: add support for per-paged-fragment destructors Ian Campbell
@ 2012-01-05 17:13 ` Ian Campbell
  2012-01-05 17:13 ` [PATCH 5/6] net: add paged frag destructor support to kernel_sendpage Ian Campbell
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 17:13 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Ian Campbell, Alexey Kuznetsov,
	Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Michał Mirosław

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |    7 +++++--
 net/core/skbuff.c      |    1 +
 net/ipv4/ip_output.c   |    2 +-
 net/ipv4/tcp.c         |    4 ++--
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 89fa908..cbc815c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1956,13 +1956,16 @@ static inline int skb_add_data(struct sk_buff *skb,
 }
 
 static inline int skb_can_coalesce(struct sk_buff *skb, int i,
-				   const struct page *page, int off)
+				   const struct page *page,
+				   const struct skb_frag_destructor *destroy,
+				   int off)
 {
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
 		return page == skb_frag_page(frag) &&
-		       off == frag->page_offset + skb_frag_size(frag);
+		       off == frag->page_offset + skb_frag_size(frag) &&
+		       frag->page.destructor == destroy;
 	}
 	return 0;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 12f3c8c..73376b5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2324,6 +2324,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 	 */
 	if (!to ||
 	    !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
+			      fragfrom->page.destructor,
 			      fragfrom->page_offset)) {
 		merge = -1;
 	} else {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index ff302bd..9e4eca6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1243,7 +1243,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, offset)) {
+		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
 			get_page(page);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9bcdec3..a928aec 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -802,7 +802,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -1011,7 +1011,7 @@ new_segment:
 
 				off = sk->sk_sndmsg_off;
 
-				if (skb_can_coalesce(skb, i, page, off) &&
+				if (skb_can_coalesce(skb, i, page, NULL, off) &&
 				    off != PAGE_SIZE) {
 					/* We can extend the last page
 					 * fragment. */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6] net: add paged frag destructor support to kernel_sendpage.
  2012-01-05 17:09 [PATCH 0/6 v2] skb paged fragment destructors Ian Campbell
                   ` (3 preceding siblings ...)
  2012-01-05 17:13 ` [PATCH 4/6] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
@ 2012-01-05 17:13 ` Ian Campbell
  2012-01-05 19:15   ` David Miller
       [not found] ` <1325783399.25206.413.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
  2012-01-05 17:27 ` [PATCH 0/6 v2] skb paged fragment destructors David Laight
  6 siblings, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 17:13 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Ian Campbell, Alexey Kuznetsov,
	Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman, drbd-user, devel,
	cluster-devel, ocfs2-devel, ceph-devel, rds-devel, linux-nfs

This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: drbd-user@lists.linbit.com
Cc: devel@driverdev.osuosl.org
Cc: cluster-devel@redhat.com
Cc: ocfs2-devel@oss.oracle.com
Cc: netdev@vger.kernel.org
Cc: ceph-devel@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-nfs@vger.kernel.org
---
 drivers/block/drbd/drbd_main.c           |    1 +
 drivers/scsi/iscsi_tcp.c                 |    4 ++--
 drivers/scsi/iscsi_tcp.h                 |    3 ++-
 drivers/staging/pohmelfs/trans.c         |    3 ++-
 drivers/target/iscsi/iscsi_target_util.c |    3 ++-
 fs/dlm/lowcomms.c                        |    4 ++--
 fs/ocfs2/cluster/tcp.c                   |    1 +
 include/linux/net.h                      |    6 +++++-
 include/net/inet_common.h                |    4 +++-
 include/net/ip.h                         |    4 +++-
 include/net/sock.h                       |    8 +++++---
 include/net/tcp.h                        |    4 +++-
 net/ceph/messenger.c                     |    2 +-
 net/core/sock.c                          |    6 +++++-
 net/ipv4/af_inet.c                       |    9 ++++++---
 net/ipv4/ip_output.c                     |    6 ++++--
 net/ipv4/tcp.c                           |   24 ++++++++++++++++--------
 net/ipv4/udp.c                           |   11 ++++++-----
 net/ipv4/udp_impl.h                      |    5 +++--
 net/rds/tcp_send.c                       |    1 +
 net/socket.c                             |   11 +++++++----
 net/sunrpc/svcsock.c                     |    6 +++---
 net/sunrpc/xprtsock.c                    |    2 +-
 23 files changed, 84 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 0358e55..49c7346 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	set_fs(KERNEL_DS);
 	do {
 		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							NULL,
 							offset, len,
 							msg_flags);
 		if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 7c34d8e..3884ae1 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
 		if (!segment->data) {
 			sg = segment->sg;
 			offset += segment->sg_offset + sg->offset;
-			r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
-						  copy, flags);
+			r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+						  offset, copy, flags);
 		} else {
 			struct msghdr msg = { .msg_flags = flags };
 			struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
 	uint32_t		sendpage_failures_cnt;
 	uint32_t		discontiguous_hdr_cnt;
 
-	ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+	ssize_t (*sendpage)(struct socket *, struct page *,
+			    struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/staging/pohmelfs/trans.c b/drivers/staging/pohmelfs/trans.c
index 06c1a74..96a7921 100644
--- a/drivers/staging/pohmelfs/trans.c
+++ b/drivers/staging/pohmelfs/trans.c
@@ -104,7 +104,8 @@ static int netfs_trans_send_pages(struct netfs_trans *t, struct netfs_state *st)
 		msg.msg_flags = MSG_WAITALL | (attached_pages == 1 ? 0 :
 				MSG_MORE);
 
-		err = kernel_sendpage(st->socket, page, 0, size, msg.msg_flags);
+		err = kernel_sendpage(st->socket, page, NULL,
+				      0, size, msg.msg_flags);
 		if (err <= 0) {
 			printk("%s: %d/%d failed to send transaction page: t: %p, gen: %u, size: %u, err: %d.\n",
 					__func__, i, t->page_num, t, t->gen, size, err);
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 02348f7..d0c984b 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1315,7 +1315,8 @@ send_hdr:
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+					sg_page(sg), NULL,
+					sg->offset + offset, sub_len, 0);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 0b3109e..622107a 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1342,8 +1342,8 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
-					      msg_flags);
+			ret = kernel_sendpage(con->sock, e->page, NULL,
+					      offset, len, msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
 				    test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 044e7b5..e13851e 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -983,6 +983,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		mutex_lock(&sc->sc_send_lock);
 		ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
 						 virt_to_page(kmalloced_virt),
+						 NULL,
 						 (long)kmalloced_virt & ~PAGE_MASK,
 						 size, MSG_DONTWAIT);
 		mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index b299230..db562ba 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -203,6 +204,7 @@ struct proto_ops {
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -273,7 +275,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
 			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index 775009f..33ea0da 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int		ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4,
+				struct page *page,
+				struct skb_frag_destructor *destroy,
 				int offset, size_t size, int flags);
 extern struct sk_buff  *__ip_make_skb(struct sock *sk,
 				      struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index bb972d2..e13a512 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -801,6 +801,7 @@ struct proto {
 					size_t len, int noblock, int flags, 
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
@@ -1422,9 +1423,10 @@ extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
 extern ssize_t			sock_no_sendpage(struct socket *sock,
-						struct page *page,
-						int offset, size_t size, 
-						int flags);
+					struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size,
+					int flags);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0118ea9..147a268 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -322,7 +322,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
 extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
 			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ad5b708..69f049b 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -851,7 +851,7 @@ static int write_partial_msg_pages(struct ceph_connection *con)
 				cpu_to_le32(crc32c(tmpcrc, base, len));
 			con->out_msg_pos.did_page_crc = 1;
 		}
-		ret = kernel_sendpage(con->sock, page,
+		ret = kernel_sendpage(con->sock, page, NULL,
 				      con->out_msg_pos.page_pos + page_shift,
 				      len,
 				      MSG_DONTWAIT | MSG_NOSIGNAL |
diff --git a/net/core/sock.c b/net/core/sock.c
index 002939c..563401b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1939,7 +1939,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
 }
 EXPORT_SYMBOL(sock_no_mmap);
 
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+			 struct skb_frag_destructor *destroy,
+			 int offset, size_t size, int flags)
 {
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
@@ -1949,6 +1951,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
 	kunmap(page);
+	/* kernel_sendmsg copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f7b5670..591665d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -745,7 +745,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -758,8 +760,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		return -EAGAIN;
 
 	if (sk->sk_prot->sendpage)
-		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
-	return sock_no_sendpage(sock, page, offset, size, flags);
+		return sk->sk_prot->sendpage(sk, page, destroy,
+					     offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(inet_sendpage);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9e4eca6..2ce0b8e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1130,6 +1130,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 }
 
 ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+		       struct skb_frag_destructor *destroy,
 		       int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1243,11 +1244,12 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+		if (skb_can_coalesce(skb, i, page, destroy, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		} else {
 			err = -EMSGSIZE;
 			goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a928aec..edd1695 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -755,7 +755,10 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor **destructors,
+				int poffset,
 			 size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -781,6 +784,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	while (psize > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
 		struct page *page = pages[poffset / PAGE_SIZE];
+		struct skb_frag_destructor *destroy =
+			destructors ? destructors[poffset / PAGE_SIZE] : NULL;
 		int copy, i, can_coalesce;
 		int offset = poffset % PAGE_SIZE;
 		int size = min_t(size_t, psize, PAGE_SIZE - offset);
@@ -802,7 +807,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -813,8 +818,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -869,18 +875,20 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	ssize_t res;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
 	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-		return sock_no_sendpage(sk->sk_socket, page, offset, size,
-					flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page, &destroy,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5d075b5..c596d36 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1029,8 +1029,9 @@ do_confirm:
 }
 EXPORT_SYMBOL(udp_sendmsg);
 
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1058,11 +1059,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	}
 
 	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
+			     page, destroy, offset, size, flags);
 	if (ret == -EOPNOTSUPP) {
 		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 	}
 	if (ret < 0) {
 		udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index aaad650..4923d82 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int	compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 extern int	udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    size_t len, int noblock, int flags, int *addr_len);
-extern int	udp_sendpage(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+extern int	udp_sendpage(struct sock *sk, struct page *page,
+			     struct skb_frag_destructor *destroy,
+			     int offset, size_t size, int flags);
 extern int	udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb);
 extern void	udp_destroy_sock(struct sock *sk);
 
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	while (sg < rm->data.op_nents) {
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
+						NULL,
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
 						MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index e62b4f0..9dd6458 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -815,7 +815,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	if (more)
 		flags |= MSG_MORE;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3372,15 +3372,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
 	sock_update_classid(sock->sk);
 
 	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+		return sock->ops->sendpage(sock, page, destroy,
+					   offset, size, flags);
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 4653286..0e6b919 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -185,7 +185,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -198,7 +198,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -212,7 +212,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 55472c4..38b2fec 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
       [not found] ` <1325783399.25206.413.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
@ 2012-01-05 17:13   ` Ian Campbell
  0 siblings, 0 replies; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 17:13 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David Miller, Eric Dumazet, Ian Campbell, Neil Brown,
	J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    3 ++-
 net/sunrpc/xprt.c           |   13 +++++++++++++
 net/sunrpc/xprtsock.c       |    3 ++-
 6 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index a20970e..172f81e 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 15518a1..75131eb 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index f0268ea..06f363f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1115,6 +1116,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1278,6 +1281,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1311,8 +1315,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1385,7 +1389,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1425,7 +1430,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
@@ -1591,12 +1595,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1611,6 +1617,17 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	dprintk("RPC: %5u call_complete result %d\n",
+		task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 0e6b919..aa7f108 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor,
+					 base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index c64c0ef..5f28bc7 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1101,6 +1101,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(void *calldata)
+{
+	struct rpc_task *task = calldata;
+	struct rpc_rqst	*req = task->tk_rqstp;
+
+	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1113,6 +1123,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
+	req->destructor.data = task;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 38b2fec..5406977 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+					  base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* RE: [PATCH 0/6 v2] skb paged fragment destructors
  2012-01-05 17:09 [PATCH 0/6 v2] skb paged fragment destructors Ian Campbell
                   ` (5 preceding siblings ...)
       [not found] ` <1325783399.25206.413.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
@ 2012-01-05 17:27 ` David Laight
  2012-01-05 17:34   ` David Miller
  6 siblings, 1 reply; 27+ messages in thread
From: David Laight @ 2012-01-05 17:27 UTC (permalink / raw)
  To: Ian Campbell, David Miller; +Cc: netdev, Eric Dumazet

 
> The mail at [0] contains some more background and rationale but
> basically the completed series will allow entities which inject pages
> into the networking stack to receive a notification when the stack has
> really finished with those pages (i.e. including retransmissions,
> clones, pull-ups etc) and not just when the original skb is finished
> with, which is beneficial to many subsystems which wish to inject
pages
> into the network stack without giving up full ownership of those
page's
> lifecycle. It implements something broadly along the lines of what was
> described in [1].

If you are doing that, then the network drivers must ensure they
free such skb as soon as the transmit completes - rather than
deferring the 'end of tx' processing until some later time.

Without that the sending code can't reuse the 'page' for another
request - which is one of the things I presume this allows.

(We had issues with this on SVR4 with STREAMS buffers allocated
with esballoc()...)

	David

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6 v2] skb paged fragment destructors
  2012-01-05 17:27 ` [PATCH 0/6 v2] skb paged fragment destructors David Laight
@ 2012-01-05 17:34   ` David Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David Miller @ 2012-01-05 17:34 UTC (permalink / raw)
  To: David.Laight; +Cc: Ian.Campbell, netdev, eric.dumazet

From: "David Laight" <David.Laight@ACULAB.COM>
Date: Thu, 5 Jan 2012 17:27:31 -0000

> If you are doing that, then the network drivers must ensure they
> free such skb as soon as the transmit completes - rather than
> deferring the 'end of tx' processing until some later time.
> 
> Without that the sending code can't reuse the 'page' for another
> request - which is one of the things I presume this allows.

TCP breaks already if a driver defers indefinitely.  All drivers
absolutely must perform TX reclaim in a small, finite, amount of
time.

> (We had issues with this on SVR4 with STREAMS buffers allocated
> with esballoc()...)

This isn't the Mentat stack... so it's extremely unwise to try and
match up concerns you've had in that stack over here when evaluating
changes to Linux.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-05 17:13 ` [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
@ 2012-01-05 17:46   ` Eric Dumazet
  2012-01-05 19:13     ` David Miller
  2012-01-05 19:19     ` Ian Campbell
  0 siblings, 2 replies; 27+ messages in thread
From: Eric Dumazet @ 2012-01-05 17:46 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller

Le jeudi 05 janvier 2012 à 17:13 +0000, Ian Campbell a écrit :
> We already push the shinfo to the very end of the actual allocation which is
> already cache-line aligned.
> 
> This reduces the minimum overhead required for this allocation such that the
> shinfo can be grown in the following patch without overflowing 2048 bytes for a
> 1500 byte frame. Reducing this overhead means that sometimes the tail end of
> the data can end up in the same cache line as the beginning of the shinfo but
> in many cases the allocation slop means that there is no overlap.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  include/linux/skbuff.h |    4 ++--
>  net/core/skbuff.c      |    3 +--
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 50db9b0..f04333b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -41,7 +41,7 @@
>  #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
>  				 ~(SMP_CACHE_BYTES - 1))
>  #define SKB_WITH_OVERHEAD(X)	\
> -	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +	((X) - sizeof(struct skb_shared_info))
>  #define SKB_MAX_ORDER(X, ORDER) \
>  	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
>  #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
> @@ -50,7 +50,7 @@
>  /* return minimum truesize of one skb containing X bytes of data */
>  #define SKB_TRUESIZE(X) ((X) +						\
>  			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
> -			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +			 sizeof(struct skb_shared_info))
>  
>  /* A. Checksumming of received packets by device.
>   *
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index da0c97f..b6de604 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>  	 * Both skb->head and skb_shared_info are cache line aligned.
>  	 */
> -	size = SKB_DATA_ALIGN(size);
> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
>  	data = kmalloc_node_track_caller(size, gfp_mask, node);
>  	if (!data)
>  		goto nodata;

Unfortunately this makes the frequently use part of skb_shared_info not
any more in a single cache line.

This will slow down many operations like kfree_skb() of cold skbs (TX
completion for example)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] net: pack skb_shared_info more efficiently
  2012-01-05 17:13 ` [PATCH 1/6] net: pack skb_shared_info more efficiently Ian Campbell
@ 2012-01-05 18:30   ` Eric Dumazet
  2012-01-05 19:09     ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-01-05 18:30 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller

Le jeudi 05 janvier 2012 à 17:13 +0000, Ian Campbell a écrit :
> nr_frags can be 8 bits since 256 is plenty of fragments. This allows it to be
> packed with tx_flags.
> 
> Also by moving ip6_frag_id and dataref (both 4 bytes) next to each other we can
> avoid a hole between ip6_frag_id and frag_list on 64 bit systems.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---


Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] net: pack skb_shared_info more efficiently
  2012-01-05 18:30   ` Eric Dumazet
@ 2012-01-05 19:09     ` David Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David Miller @ 2012-01-05 19:09 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ian.campbell, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 05 Jan 2012 19:30:58 +0100

> Le jeudi 05 janvier 2012 à 17:13 +0000, Ian Campbell a écrit :
>> nr_frags can be 8 bits since 256 is plenty of fragments. This allows it to be
>> packed with tx_flags.
>> 
>> Also by moving ip6_frag_id and dataref (both 4 bytes) next to each other we can
>> avoid a hole between ip6_frag_id and frag_list on 64 bit systems.
>> 
>> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
>> ---
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-05 17:46   ` Eric Dumazet
@ 2012-01-05 19:13     ` David Miller
  2012-01-05 20:00       ` Ian Campbell
  2012-01-05 19:19     ` Ian Campbell
  1 sibling, 1 reply; 27+ messages in thread
From: David Miller @ 2012-01-05 19:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ian.campbell, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 05 Jan 2012 18:46:47 +0100

>> @@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>>  	 * Both skb->head and skb_shared_info are cache line aligned.
>>  	 */
>> -	size = SKB_DATA_ALIGN(size);
>> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
>>  	data = kmalloc_node_track_caller(size, gfp_mask, node);
>>  	if (!data)
>>  		goto nodata;
> 
> Unfortunately this makes the frequently use part of skb_shared_info not
> any more in a single cache line.
> 
> This will slow down many operations like kfree_skb() of cold skbs (TX
> completion for example)

My concerns match Eric's here.

Ian, you state that you attempted to look into schemes that put the
destructor info somewhere other than skb_shared_info() and that none
of them worked well.

But what kind of issues do you run into if you add this into struct
page itself?  Other subsystems could then use this facility too, and
if I recall there was even some talk of this.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] net: add paged frag destructor support to kernel_sendpage.
  2012-01-05 17:13 ` [PATCH 5/6] net: add paged frag destructor support to kernel_sendpage Ian Campbell
@ 2012-01-05 19:15   ` David Miller
  2012-01-05 20:04     ` Ian Campbell
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2012-01-05 19:15 UTC (permalink / raw)
  To: ian.campbell
  Cc: devel, rds-devel, pekkas, eric.dumazet, yoshfuji, netdev, gregkh,
	Trond.Myklebust, jmorris, cluster-devel, ocfs2-devel, kuznet,
	ceph-devel, linux-nfs, kaber, drbd-user

From: Ian Campbell <ian.campbell@citrix.com>
Date: Thu, 5 Jan 2012 17:13:43 +0000

> -static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
> +static ssize_t do_tcp_sendpages(struct sock *sk,
> +				struct page **pages,
> +				struct skb_frag_destructor **destructors,
> +				int poffset,
>  			 size_t psize, int flags)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);

An array of destructors is madness, and the one call site that specifies this
passes an address of a single entry.

This also would never even have to occur if you put the destructor inside of
struct page instead.

Finally, except for the skb_shared_info() layout optimization in patch #1 which
I alreayd applied, this stuff is not baked enough for the 3.3 merge window.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-05 17:46   ` Eric Dumazet
  2012-01-05 19:13     ` David Miller
@ 2012-01-05 19:19     ` Ian Campbell
  2012-01-05 20:22       ` Eric Dumazet
  1 sibling, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 19:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller

On Thu, 2012-01-05 at 17:46 +0000, Eric Dumazet wrote:
> Le jeudi 05 janvier 2012 à 17:13 +0000, Ian Campbell a écrit :
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index da0c97f..b6de604 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> >  	 * Both skb->head and skb_shared_info are cache line aligned.
> >  	 */
> > -	size = SKB_DATA_ALIGN(size);
> > -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
> >  	data = kmalloc_node_track_caller(size, gfp_mask, node);
> >  	if (!data)
> >  		goto nodata;
> 
> Unfortunately this makes the frequently use part of skb_shared_info not
> any more in a single cache line.
> 
> This will slow down many operations like kfree_skb() of cold skbs (TX
> completion for example)

I've been toying with the following which moves the infrequently used
destructor_arg field to the front. With 32 or 64 byte cache lines this
is then the only field which ends up on a different cache line. With 128
byte cache lines it is all already on the same line.

Does that alleviate your concern? If so I can clean it up and include it
in the series.

Cheers,
Ian.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cbc815c..533c2ea 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -249,6 +249,11 @@ struct ubuf_info {
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
+	/* Intermediate layers must ensure that destructor_arg
+	 * remains valid until skb destructor */
+	void *		destructor_arg;
+
+	/* Warning all fields from here until dataref are cleared in __alloc_skb() */
 	unsigned char	nr_frags;
 	__u8		tx_flags;
 	unsigned short	gso_size;
@@ -264,9 +269,6 @@ struct skb_shared_info {
 	 */
 	atomic_t	dataref;
 
-	/* Intermediate layers must ensure that destructor_arg
-	 * remains valid until skb destructor */
-	void *		destructor_arg;
 
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e942a86..592b8a8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -219,7 +219,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+
+	memset(&shinfo->nr_frags, 0,
+	       offsetof(struct skb_shared_info, dataref)
+	       - offsetof(struct skb_shared_info, nr_frags));
 	atomic_set(&shinfo->dataref, 1);
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-05 19:13     ` David Miller
@ 2012-01-05 20:00       ` Ian Campbell
  2012-01-07 18:18         ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 20:00 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Thu, 2012-01-05 at 19:13 +0000, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 05 Jan 2012 18:46:47 +0100
> 
> >> @@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> >>  	 * Both skb->head and skb_shared_info are cache line aligned.
> >>  	 */
> >> -	size = SKB_DATA_ALIGN(size);
> >> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >> +	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
> >>  	data = kmalloc_node_track_caller(size, gfp_mask, node);
> >>  	if (!data)
> >>  		goto nodata;
> > 
> > Unfortunately this makes the frequently use part of skb_shared_info not
> > any more in a single cache line.
> > 
> > This will slow down many operations like kfree_skb() of cold skbs (TX
> > completion for example)
> 
> My concerns match Eric's here.
> 
> Ian, you state that you attempted to look into schemes that put the
> destructor info somewhere other than skb_shared_info() and that none
> of them worked well.
> 
> But what kind of issues do you run into if you add this into struct
> page itself?  Other subsystems could then use this facility too, and
> if I recall there was even some talk of this.

This is what the old out of tree Xen stuff used to do and it was my
understanding that it had been nacked by the mm guys, although it was a
long while ago and I wasn't really involved at the time. I think space
in struct page is pretty tight and any addition is multiplied over the
number of pages in the system which makes it undesirable. AIUI there
were also complaints about the addition of a hook on the page free path,
which is quite a hot one.

I could run it by them again but...

Although this scheme works fine for Xen's netback I don't think it works
for other use cases like the NFS one (or basically any kernel_sendpage
usage). In those cases you don't want to wait for the last core mm ref
on the page to get dropped, you just want to wait for the last ref due
to the particular sendpage. If you use the core page_count() reference
count then you end up waiting for the process to exit (and drop the core
reference count) before the destructor fires and you can complete the
write, which is obviously not what is desired!

There's also issues with things like two threads simultaneously doing
I/O from the same page. If one lot of I/O is to NFS and the other to
iSCSI (assuming they both use this facility in the future) then they
will clash over the use of the struct page field. In fact even if they
were both to NFS I bet nothing good would happen...

Cheers,
Ian.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] net: add paged frag destructor support to kernel_sendpage.
  2012-01-05 19:15   ` David Miller
@ 2012-01-05 20:04     ` Ian Campbell
  0 siblings, 0 replies; 27+ messages in thread
From: Ian Campbell @ 2012-01-05 20:04 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, eric.dumazet, kuznet, pekkas, jmorris, yoshfuji, kaber,
	Trond.Myklebust, gregkh, drbd-user, devel, cluster-devel,
	ocfs2-devel, ceph-devel, rds-devel, linux-nfs

On Thu, 2012-01-05 at 19:15 +0000, David Miller wrote:
> From: Ian Campbell <ian.campbell@citrix.com>
> Date: Thu, 5 Jan 2012 17:13:43 +0000
> 
> > -static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
> > +static ssize_t do_tcp_sendpages(struct sock *sk,
> > +				struct page **pages,
> > +				struct skb_frag_destructor **destructors,
> > +				int poffset,
> >  			 size_t psize, int flags)
> >  {
> >  	struct tcp_sock *tp = tcp_sk(sk);
> 
> An array of destructors is madness, and the one call site that specifies this
> passes an address of a single entry.

I figured it was easy enough to accommodate the multiple destructor case
but you are right that is overkill given the current (and realistically,
expected) usage, I'll change that for the next round.

(that's assuming we don't end up with some scheme where the struct page
* is in the destructor struct like I was investigating previously to
alleviate the frag size overhead. I guess this illustrates nicely why
that approach got ugly: these array propagate all the way up the call
chain if you do that)

> This also would never even have to occur if you put the destructor inside of
> struct page instead.
> 
> Finally, except for the skb_shared_info() layout optimization in patch #1 which
> I alreayd applied, this stuff is not baked enough for the 3.3 merge window.

Sure thing, I should have made it clear in my intro mail that I was
aiming for 3.4.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-05 19:19     ` Ian Campbell
@ 2012-01-05 20:22       ` Eric Dumazet
  2012-01-06 11:20         ` Ian Campbell
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-01-05 20:22 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller

Le jeudi 05 janvier 2012 à 19:19 +0000, Ian Campbell a écrit :
> On Thu, 2012-01-05 at 17:46 +0000, Eric Dumazet wrote:
> > Le jeudi 05 janvier 2012 à 17:13 +0000, Ian Campbell a écrit :
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index da0c97f..b6de604 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > >  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> > >  	 * Both skb->head and skb_shared_info are cache line aligned.
> > >  	 */
> > > -	size = SKB_DATA_ALIGN(size);
> > > -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > +	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
> > >  	data = kmalloc_node_track_caller(size, gfp_mask, node);
> > >  	if (!data)
> > >  		goto nodata;
> > 
> > Unfortunately this makes the frequently use part of skb_shared_info not
> > any more in a single cache line.
> > 
> > This will slow down many operations like kfree_skb() of cold skbs (TX
> > completion for example)
> 
> I've been toying with the following which moves the infrequently used
> destructor_arg field to the front. With 32 or 64 byte cache lines this
> is then the only field which ends up on a different cache line. With 128
> byte cache lines it is all already on the same line.
> 
> Does that alleviate your concern? If so I can clean it up and include it
> in the series.
> 
> Cheers,
> Ian.
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index cbc815c..533c2ea 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -249,6 +249,11 @@ struct ubuf_info {
>   * the end of the header data, ie. at skb->end.
>   */
>  struct skb_shared_info {
> +	/* Intermediate layers must ensure that destructor_arg
> +	 * remains valid until skb destructor */
> +	void *		destructor_arg;
> +
> +	/* Warning all fields from here until dataref are cleared in __alloc_skb() */
>  	unsigned char	nr_frags;
>  	__u8		tx_flags;
>  	unsigned short	gso_size;
> @@ -264,9 +269,6 @@ struct skb_shared_info {
>  	 */
>  	atomic_t	dataref;
>  
> -	/* Intermediate layers must ensure that destructor_arg
> -	 * remains valid until skb destructor */
> -	void *		destructor_arg;
>  
>  	/* must be last field, see pskb_expand_head() */
>  	skb_frag_t	frags[MAX_SKB_FRAGS];

Hmmm

Take a look at kfree_skb() and please list all skb_shared_info needed
fields, for a typical skb with nr_frags= 0 or 1.

If all fit in a single cache line, its ok.

If not, this adds a cache line miss in a critical path.

Also in transmit path, when an skb is queued by cpu X on qdisc, and
dequeued by another cpu Y to be delivered to device.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] net: add support for per-paged-fragment destructors
  2012-01-05 17:13 ` [PATCH 3/6] net: add support for per-paged-fragment destructors Ian Campbell
@ 2012-01-06  1:15   ` Michał Mirosław
  2012-01-06  8:50     ` Ian Campbell
  0 siblings, 1 reply; 27+ messages in thread
From: Michał Mirosław @ 2012-01-06  1:15 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller, Eric Dumazet

On Thu, Jan 05, 2012 at 05:13:41PM +0000, Ian Campbell wrote:
> Entities which care about the complete lifecycle of pages which they inject
> into the network stack via an skb paged fragment can choose to set this
> destructor in order to receive a callback when the stack is really finished
> with a page (including all clones, retransmits, pull-ups etc etc).
[...]
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -143,9 +143,16 @@ struct sk_buff;
>  
>  typedef struct skb_frag_struct skb_frag_t;
>  
> +struct skb_frag_destructor {
> +	atomic_t ref;
> +	int (*destroy)(void *data);
> +	void *data;
> +};
> +
[...]

What happened to removal of 'data' field from this structure?

ref: http://www.spinics.net/lists/netdev/msg179050.html

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] net: add support for per-paged-fragment destructors
  2012-01-06  1:15   ` Michał Mirosław
@ 2012-01-06  8:50     ` Ian Campbell
  0 siblings, 0 replies; 27+ messages in thread
From: Ian Campbell @ 2012-01-06  8:50 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: netdev, David Miller, Eric Dumazet

On Fri, 2012-01-06 at 01:15 +0000, Michał Mirosław wrote:
> On Thu, Jan 05, 2012 at 05:13:41PM +0000, Ian Campbell wrote:
> > Entities which care about the complete lifecycle of pages which they inject
> > into the network stack via an skb paged fragment can choose to set this
> > destructor in order to receive a callback when the stack is really finished
> > with a page (including all clones, retransmits, pull-ups etc etc).
> [...]
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -143,9 +143,16 @@ struct sk_buff;
> >  
> >  typedef struct skb_frag_struct skb_frag_t;
> >  
> > +struct skb_frag_destructor {
> > +	atomic_t ref;
> > +	int (*destroy)(void *data);
> > +	void *data;
> > +};
> > +
> [...]
> 
> What happened to removal of 'data' field from this structure?

I did that in my branch where I also moved the *page into there (to try
and shrink the frag size) and forgot to rescue it when I switched back.
I'll make the change again for next time.

Thanks for the reminder.

Ian.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-05 20:22       ` Eric Dumazet
@ 2012-01-06 11:20         ` Ian Campbell
  2012-01-06 12:33           ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2012-01-06 11:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller

On Thu, 2012-01-05 at 20:22 +0000, Eric Dumazet wrote:
> Take a look at kfree_skb() and please list all skb_shared_info needed
> fields, for a typical skb with nr_frags= 0 or 1.
> 
> If all fit in a single cache line, its ok.

It doesn't fit in a single cache line today.

Before I started touching any of this stuff (e.g. in v3.1) kfree_skb
would touch 3x32 byte caches lines (with boundaries between
frag_list/hwtstamps and destructor_arg/frags[0]) or 2x64 byte cache
lines (boundary between between frag_list/hwtstamps).

After the v2 version of these patches the situation is much the same,
although the boundaries are different and which field is in which line
has shifted a bit. The boundaries are now gso_type/frag_list and
dataref/destructor_arg for 32 byte cache lines and gso_type/frag_list
for 64 byte cache lines.

With the proposed reordering to put destructor_arg first it actually
shrinks to just 2 32 byte cache lines (boundary between hwtstamps and
ip6_frag_id) or a single 64 byte cache line, since the other 32&64
boundary is between destructor_arg and nr_frags.

Even if we don't consider destructor_arg to be rarely used / cold things
are no worse than they were before (3x32 or 2x64 byte cache lines
touched).

So I think if anything I think these patches improve things when
nr_frags = 0 or 1 or at at least no worse than the current situation.

> If not, this adds a cache line miss in a critical path.

I think it does not, but please do check my working (see below).

> Also in transmit path, when an skb is queued by cpu X on qdisc, and
> dequeued by another cpu Y to be delivered to device.

You mean the path starting at dev_queue_xmit -> q->enqueue is true ->
__dev_xmit_skb ?

I didn't follow it properly but it doesn't seem like there is much use
of the shinfo at all in this area qdisc, the only thing I spotted was
the use of the gso_{size,segs} fields in qdisc_bstats_update and those
are always on the same cache line. Did I miss something?

Anyway, by similar logic to the kfree_skb case I think the number of
cachelines touched will always be <= the current number.

Ian.

My working:

						Used in kfree_skb?
unsigned short  nr_frags;			YES (if dataref -> 0 etc)
unsigned short  gso_size;
unsigned short  gso_segs;
unsigned short  gso_type;
__be32          ip6_frag_id;
__u8            tx_flags;			YES (if dataref -> 0 etc)
struct sk_buff  *frag_list;			YES (if dataref -> 0 etc)
struct skb_shared_hwtstamps hwtstamps;
atomic_t        dataref;			YES
void *          destructor_arg;			YES if skb->destructor || tx_flags & ZEROCOPY
                                                -> I count this as unused by
						   kfree_skb for the purposes here

skb_frag_t      frags[MAX_SKB_FRAGS];		YES (<nr_frags, if dataref->0 etc)

LEGEND:

*<-used in kfree_skb
 N<-offset from start of shinfo		    offset from end of shinfo-> -N

v3.1:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 byte cache line =-=-=-=-=
 -40									-384
--------------------------------------------- 32 byte cache line --------------
 -8									-352
struct skb_shared_info {
*0	unsigned short  nr_frags;					-344
 2	unsigned short  gso_size;					-342
        /* Warning: this field is not always filled in (UFO)! */
 4	unsigned short  gso_segs;					-340
 6	unsigned short  gso_type;					-338
 8	__be32          ip6_frag_id;					-336
*12	__u8            tx_flags;					-332
*16	struct sk_buff  *frag_list;					-328
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 byte cache line =-=-=-=-=
 24	struct skb_shared_hwtstamps hwtstamps;				-320

        /*
         * Warning : all fields before dataref are cleared in __alloc_skb()
         */
*40	atomic_t        dataref;					-304

        /* Intermediate layers must ensure that destructor_arg
         * remains valid until skb destructor */
 48	void *          destructor_arg;					-296

        /* must be last field, see pskb_expand_head() */
--------------------------------------------- 32 byte cache line --------------
*56	skb_frag_t      frags[MAX_SKB_FRAGS];				-16x18 
									= -288
									=  9x32
									or 4.5x64
									or 2.25x128
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 & 128 byte cache line =-=
 344	= 10.75x32 or 5.375x64 or 2.6875x128				-0
};

skb patches v2:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 byte cache line =-=-=-=-=
 -56									-512
--------------------------------------------- 32 byte cache line --------------
 -24									-480
struct skb_shared_info {
*0	unsigned char	nr_frags;					-456
*1	__u8		tx_flags;					-455
 2	unsigned short	gso_size;					-454
	/* Warning: this field is not always filled in (UFO)! */
 4	unsigned short	gso_segs;					-452
 6	unsigned short  gso_type;					-450
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 byte cache line =-=-=-=-=
*8	struct sk_buff	*frag_list;					-448
 16	struct skb_shared_hwtstamps hwtstamps;				-440
 32	__be32          ip6_frag_id;					-424

	/*
	 * Warning : all fields before dataref are cleared in __alloc_skb()
	 */
*36	atomic_t	dataref;					-420

	/* Intermediate layers must ensure that destructor_arg
	 * remains valid until skb destructor */
--------------------------------------------- 32 byte cache line --------------
 40	void *		destructor_arg;					-416

	/* must be last field, see pskb_expand_head() */
*48	skb_frag_t	frags[MAX_SKB_FRAGS];				-24*17
									= -408
									=  12.75x32
									or 6.375x64
									or 3.1875x128
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 & 128 byte cache line =-=
 456	= 14.25x32 or 7.125x64 or 3.5625x128				-0
};

skb patches + put destructor_arg first

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 byte cache line =-=-=-=-=
 -56									-512
--------------------------------------------- 32 byte cache line
 -24									-480
struct skb_shared_info {
	/* Intermediate layers must ensure that destructor_arg
	 * remains valid until skb destructor */
 0	void *		destructor_arg;					-456

	/* Warning all fields from here until dataref are cleared in __alloc_skb() */
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 byte cache line =-=-=-=-=
*8	unsigned char	nr_frags;					-448
*9	__u8		tx_flags;					-447
 10	unsigned short	gso_size;					-446
	/* Warning: this field is not always filled in (UFO)! */
 12	unsigned short	gso_segs;					-444
 14	unsigned short  gso_type;					-442
*16	struct sk_buff	*frag_list;					-440
 24	struct skb_shared_hwtstamps hwtstamps;				-432
--------------------------------------------- 32 byte cache line --------------
 40	__be32          ip6_frag_id;					-416

	/*
	 * Warning : all fields before dataref are cleared in __alloc_skb()
	 */
*44	atomic_t	dataref;					-412

	/* must be last field, see pskb_expand_head() */
*48	skb_frag_t	frags[MAX_SKB_FRAGS];				-24*17
									= -408
									=  12.75x32
									or 6.375x64
									or 3.1875x128
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 32 & 64 & 128 byte cache line =-=
}; 456 = 14.25x32 or 7.125x64 or 3.5625x128

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-06 11:20         ` Ian Campbell
@ 2012-01-06 12:33           ` Eric Dumazet
  2012-01-06 13:20             ` Ian Campbell
  2012-01-06 13:29             ` Ian Campbell
  0 siblings, 2 replies; 27+ messages in thread
From: Eric Dumazet @ 2012-01-06 12:33 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller

Le vendredi 06 janvier 2012 à 11:20 +0000, Ian Campbell a écrit :

> It doesn't fit in a single cache line today.

It really does, thanks to your (net: pack skb_shared_info more
efficiently) previous patch.

I dont understand your numbers, very hard to read.

Current net-next :

offsetof(struct skb_shared_info, nr_frags)=0x0
offsetof(struct skb_shared_info, frags[1])=0x40   (0x30 on 32bit arches)

So _all_ fields, including one frag, are included in a single cache line
on most machines (64-bytes cache line), IF struct skb_shared_info is
aligned.

Your patch obviously breaks this on 64bit arches, unless you make sure
sizeof(struct skb_shared_info) is a multiple of cache line.

[BTW, it is incidentaly the case after your 1/6 patch]

fields reordering is not going to change anything on this.

Or maybe I misread your patch ?

At least you claimed in Changelog : 

<quote>
 Reducing this overhead means that sometimes the tail end of
 the data can end up in the same cache line as the beginning of the shinfo but
 in many cases the allocation slop means that there is no overlap.
</quote>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-06 12:33           ` Eric Dumazet
@ 2012-01-06 13:20             ` Ian Campbell
  2012-01-06 13:37               ` Eric Dumazet
  2012-01-06 13:29             ` Ian Campbell
  1 sibling, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2012-01-06 13:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller

On Fri, 2012-01-06 at 12:33 +0000, Eric Dumazet wrote:
> Le vendredi 06 janvier 2012 à 11:20 +0000, Ian Campbell a écrit :
> 
> > It doesn't fit in a single cache line today.
> 
> It really does, thanks to your (net: pack skb_shared_info more
> efficiently) previous patch.
> 
> I dont understand your numbers, very hard to read.
> 
> Current net-next :
> 
> offsetof(struct skb_shared_info, nr_frags)=0x0
> offsetof(struct skb_shared_info, frags[1])=0x40   (0x30 on 32bit arches)

I see 0x48 here at cset 6386994e03ebbe60338ded3d586308a41e81c0dc:
(gdb) print &((struct skb_shared_info *)0)->nr_frags
$1 = (short unsigned int *) 0x0
(gdb) print &((struct skb_shared_info *)0)->frags[1]
$2 = (skb_frag_t *) 0x48
(gdb) print &((struct skb_shared_info *)0)->frags[0]
$3 = (skb_frag_t *) 0x38

(it's 0x34 and 0x2c on 32 bit) and these numbers match what I posted for
v3.1 (and I imagine earlier since this stuff doesn't seem to change very
often).

I provided the offsets of each field in struct skb_shared_info, which
one do you think is wrong?

Remember that the end shared info is explicitly aligned (to the end of
the allocation == a cache line) while the front just ends up at wherever
the size dictates so you can't measure the alignment of the fields from
the beginning  of the struct, you need to measure from the end.

Ian.

> So _all_ fields, including one frag, are included in a single cache line
> on most machines (64-bytes cache line), IF struct skb_shared_info is
> aligned.
> 
> Your patch obviously breaks this on 64bit arches, unless you make sure
> sizeof(struct skb_shared_info) is a multiple of cache line.
> 
> [BTW, it is incidentaly the case after your 1/6 patch]
> 
> fields reordering is not going to change anything on this.
> 
> Or maybe I misread your patch ?
> 
> At least you claimed in Changelog : 
> 
> <quote>
>  Reducing this overhead means that sometimes the tail end of
>  the data can end up in the same cache line as the beginning of the shinfo but
>  in many cases the allocation slop means that there is no overlap.
> </quote>
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-06 12:33           ` Eric Dumazet
  2012-01-06 13:20             ` Ian Campbell
@ 2012-01-06 13:29             ` Ian Campbell
  1 sibling, 0 replies; 27+ messages in thread
From: Ian Campbell @ 2012-01-06 13:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller

On Fri, 2012-01-06 at 12:33 +0000, Eric Dumazet wrote:
> Le vendredi 06 janvier 2012 à 11:20 +0000, Ian Campbell a écrit :
> 
> > It doesn't fit in a single cache line today.
> 
> It really does, thanks to your (net: pack skb_shared_info more
> efficiently) previous patch.
> 
> I dont understand your numbers, very hard to read.
> 
> Current net-next :
> 
> offsetof(struct skb_shared_info, nr_frags)=0x0
> offsetof(struct skb_shared_info, frags[1])=0x40   (0x30 on 32bit arches)
> 
> So _all_ fields, including one frag, are included in a single cache line
> on most machines (64-bytes cache line),

BTW, this is also true with my patch + put destructor_arg first in the
struct (at least for all interesting fields, since I chose
destructor_arg specifically because it did not seem interesting for
these purposes -- do you disagree?)

(gdb) print &((struct skb_shared_info *)0)->frags[1]
$1 = (skb_frag_t *) 0x48
but there is a cacheline boundary just before nr_frags:
(gdb) print &((struct skb_shared_info *)0)->nr_frags
$3 = (unsigned char *) 0x8

So the interesting fields total 0x48-0x8 = 0x40 bytes and the alignment
is such that this is a single cache line.

>  IF struct skb_shared_info is
> aligned.

Obviously the conditions for the above are a little different but they
are, AFAIK, met.

Ian.

> 
> Your patch obviously breaks this on 64bit arches, unless you make sure
> sizeof(struct skb_shared_info) is a multiple of cache line.
> 
> [BTW, it is incidentaly the case after your 1/6 patch]
> 
> fields reordering is not going to change anything on this.
> 
> Or maybe I misread your patch ?
> 
> At least you claimed in Changelog : 
> 
> <quote>
>  Reducing this overhead means that sometimes the tail end of
>  the data can end up in the same cache line as the beginning of the shinfo but
>  in many cases the allocation slop means that there is no overlap.
> </quote>
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-06 13:20             ` Ian Campbell
@ 2012-01-06 13:37               ` Eric Dumazet
  2012-01-06 13:49                 ` Ian Campbell
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2012-01-06 13:37 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David Miller

Le vendredi 06 janvier 2012 à 13:20 +0000, Ian Campbell a écrit :
> On Fri, 2012-01-06 at 12:33 +0000, Eric Dumazet wrote:
> > Le vendredi 06 janvier 2012 à 11:20 +0000, Ian Campbell a écrit :
> > 
> > > It doesn't fit in a single cache line today.
> > 
> > It really does, thanks to your (net: pack skb_shared_info more
> > efficiently) previous patch.
> > 
> > I dont understand your numbers, very hard to read.
> > 
> > Current net-next :
> > 
> > offsetof(struct skb_shared_info, nr_frags)=0x0
> > offsetof(struct skb_shared_info, frags[1])=0x40   (0x30 on 32bit arches)
> 
> I see 0x48 here at cset 6386994e03ebbe60338ded3d586308a41e81c0dc:

If you read my mail, I said "current net-next" 

Please "git pull" again ?


Here I really have your commit 9f42f126154786e6e76df513004800c8c633f020
in.



> (gdb) print &((struct skb_shared_info *)0)->nr_frags
> $1 = (short unsigned int *) 0x0
> (gdb) print &((struct skb_shared_info *)0)->frags[1]
> $2 = (skb_frag_t *) 0x48
> (gdb) print &((struct skb_shared_info *)0)->frags[0]
> $3 = (skb_frag_t *) 0x38
> 
> (it's 0x34 and 0x2c on 32 bit) and these numbers match what I posted for
> v3.1 (and I imagine earlier since this stuff doesn't seem to change very
> often).
> 
> I provided the offsets of each field in struct skb_shared_info, which
> one do you think is wrong?
> 
> Remember that the end shared info is explicitly aligned (to the end of
> the allocation == a cache line) while the front just ends up at wherever
> the size dictates so you can't measure the alignment of the fields from
> the beginning  of the struct, you need to measure from the end.

You're mistaken. Really.

Current code makes SURE skb->end starts at a cache line boundary
(assuminf kmalloc() returned an aligned bloc, this might be not true
with SLAB debugging)

size = SKB_WITH_OVERHEAD(ksize(data));
...
skb->end = skb->tail + size;

So there is possibility of padding (less than 64 bytes) _after_
skb_shared_info and before the end of allocated area.

After your 9f42f1261547 commit on 64bit, we have no padding anymore
since sizeof(struct skb_shared_info)=0x140

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-06 13:37               ` Eric Dumazet
@ 2012-01-06 13:49                 ` Ian Campbell
  0 siblings, 0 replies; 27+ messages in thread
From: Ian Campbell @ 2012-01-06 13:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller

On Fri, 2012-01-06 at 13:37 +0000, Eric Dumazet wrote:
> Le vendredi 06 janvier 2012 à 13:20 +0000, Ian Campbell a écrit :
> > On Fri, 2012-01-06 at 12:33 +0000, Eric Dumazet wrote:
> > > Le vendredi 06 janvier 2012 à 11:20 +0000, Ian Campbell a écrit :
> > > 
> > > > It doesn't fit in a single cache line today.
> > > 
> > > It really does, thanks to your (net: pack skb_shared_info more
> > > efficiently) previous patch.
> > > 
> > > I dont understand your numbers, very hard to read.
> > > 
> > > Current net-next :
> > > 
> > > offsetof(struct skb_shared_info, nr_frags)=0x0
> > > offsetof(struct skb_shared_info, frags[1])=0x40   (0x30 on 32bit arches)
> > 
> > I see 0x48 here at cset 6386994e03ebbe60338ded3d586308a41e81c0dc:
> 
> If you read my mail, I said "current net-next" 
> 
> Please "git pull" again ?

Oh, oops, I thought I'd checked I was up to date.

> Here I really have your commit 9f42f126154786e6e76df513004800c8c633f020
> in.
> 
> 
> 
> > (gdb) print &((struct skb_shared_info *)0)->nr_frags
> > $1 = (short unsigned int *) 0x0
> > (gdb) print &((struct skb_shared_info *)0)->frags[1]
> > $2 = (skb_frag_t *) 0x48
> > (gdb) print &((struct skb_shared_info *)0)->frags[0]
> > $3 = (skb_frag_t *) 0x38
> > 
> > (it's 0x34 and 0x2c on 32 bit) and these numbers match what I posted for
> > v3.1 (and I imagine earlier since this stuff doesn't seem to change very
> > often).
> > 
> > I provided the offsets of each field in struct skb_shared_info, which
> > one do you think is wrong?
> > 
> > Remember that the end shared info is explicitly aligned (to the end of
> > the allocation == a cache line) while the front just ends up at wherever
> > the size dictates so you can't measure the alignment of the fields from
> > the beginning  of the struct, you need to measure from the end.
> 
> You're mistaken. Really.
> 
> Current code makes SURE skb->end starts at a cache line boundary
> (assuminf kmalloc() returned an aligned bloc, this might be not true
> with SLAB debugging)
> 
> size = SKB_WITH_OVERHEAD(ksize(data));
> ...
> skb->end = skb->tail + size;
> 
> So there is possibility of padding (less than 64 bytes) _after_
> skb_shared_info and before the end of allocated area.

Oh, I somehow missed the SKB_DATA_ALIGN inside SKB_WITH_OVERHEAD when I
was reasoning about the before case (even though I touched it in my
patch!) thanks for prodding me sufficiently...

I guess the comment preceeding "size =" is therefore slightly misleading
since shinfo isn't "exactly at the end" but rather "at the end but still
aligned"?
	/* kmalloc(size) might give us more room than requested.
	 * Put skb_shared_info exactly at the end of allocated zone,
	 * to allow max possible filling before reallocation.
	 */

Anyhow, despite this I think with the patch to move destructor_arg to
the front ontop of this series the hot fields are all in a single cache
line, including the first frag. is that sufficient to alleviate your
concerns?

Ian.

> After your 9f42f1261547 commit on 64bit, we have no padding anymore
> since sizeof(struct skb_shared_info)=0x140
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually
  2012-01-05 20:00       ` Ian Campbell
@ 2012-01-07 18:18         ` David Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David Miller @ 2012-01-07 18:18 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: eric.dumazet, netdev

From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Thu, 5 Jan 2012 20:00:58 +0000

> Although this scheme works fine for Xen's netback I don't think it works
> for other use cases like the NFS one (or basically any kernel_sendpage
> usage). In those cases you don't want to wait for the last core mm ref
> on the page to get dropped, you just want to wait for the last ref due
> to the particular sendpage. If you use the core page_count() reference
> count then you end up waiting for the process to exit (and drop the core
> reference count) before the destructor fires and you can complete the
> write, which is obviously not what is desired!
> 
> There's also issues with things like two threads simultaneously doing
> I/O from the same page. If one lot of I/O is to NFS and the other to
> iSCSI (assuming they both use this facility in the future) then they
> will clash over the use of the struct page field. In fact even if they
> were both to NFS I bet nothing good would happen...

Ok, struct page doesn't seem the best place for this stuff then.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-01-07 18:18 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-05 17:09 [PATCH 0/6 v2] skb paged fragment destructors Ian Campbell
2012-01-05 17:13 ` [PATCH 1/6] net: pack skb_shared_info more efficiently Ian Campbell
2012-01-05 18:30   ` Eric Dumazet
2012-01-05 19:09     ` David Miller
2012-01-05 17:13 ` [PATCH 2/6] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
2012-01-05 17:46   ` Eric Dumazet
2012-01-05 19:13     ` David Miller
2012-01-05 20:00       ` Ian Campbell
2012-01-07 18:18         ` David Miller
2012-01-05 19:19     ` Ian Campbell
2012-01-05 20:22       ` Eric Dumazet
2012-01-06 11:20         ` Ian Campbell
2012-01-06 12:33           ` Eric Dumazet
2012-01-06 13:20             ` Ian Campbell
2012-01-06 13:37               ` Eric Dumazet
2012-01-06 13:49                 ` Ian Campbell
2012-01-06 13:29             ` Ian Campbell
2012-01-05 17:13 ` [PATCH 3/6] net: add support for per-paged-fragment destructors Ian Campbell
2012-01-06  1:15   ` Michał Mirosław
2012-01-06  8:50     ` Ian Campbell
2012-01-05 17:13 ` [PATCH 4/6] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
2012-01-05 17:13 ` [PATCH 5/6] net: add paged frag destructor support to kernel_sendpage Ian Campbell
2012-01-05 19:15   ` David Miller
2012-01-05 20:04     ` Ian Campbell
     [not found] ` <1325783399.25206.413.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2012-01-05 17:13   ` [PATCH 6/6] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell
2012-01-05 17:27 ` [PATCH 0/6 v2] skb paged fragment destructors David Laight
2012-01-05 17:34   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).