All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] skb paged fragment destructors
@ 2011-11-09 15:01 Ian Campbell
  2011-11-09 15:02 ` [PATCH 1/4] net: add support for per-paged-fragment destructors Ian Campbell
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 15:01 UTC (permalink / raw)
  To: David Miller; +Cc: Jesse Brandeburg, netdev

The following series makes use of the skb fragment API (which is in 3.2)
to add a per-paged-fragment destructor callback. This can be used by
creators of skbs who are interested in the lifecycle of the pages
included in that skb after they have handed it off to the network stack.
I think these have all been posted before, but have been backed up
behind the skb fragment API.

The mail at [0] contains some more background and rationale but
basically the completed series will allow entities which inject pages
into the networking stack to receive a notification when the stack has
really finished with those pages (i.e. including retransmissions,
clones, pull-ups etc) and not just when the original skb is finished
with, which is beneficial to many subsystems which wish to inject pages
into the network stack without giving up full ownership of those page's
lifecycle. It implements something broadly along the lines of what was
described in [1].

I have also included a patch to the RPC subsystem which uses this API to
fix the bug which I describe at [2].

I presented this work at LPC in September and there was a
question/concern raised (by Jesse Brandenburg IIRC) regarding the
overhead of adding this extra field per fragment. If I understand
correctly it seems that in the there have been performance regressions
in the past with allocations outgrowing one allocation size bucket and
therefore using the next. The change in datastructure size resulting
from this series is:
					  BEFORE	AFTER
AMD64:	sizeof(struct skb_frag_struct)	= 16		24
	sizeof(struct skb_shared_info)	= 344		488
	sizeof(struct sk_buff)		= 240		240

i386:	sizeof(struct skb_frag_struct)	= 8		12
	sizeof(struct skb_shared_info)	= 188		260
	sizeof(struct sk_buff)		= 192		192

(I think these are representative of 32 and 64 bit arches generally)

On amd64 this doesn't in itself push the shared_info over a slab
boundary but since the linear data is part of the same allocation the
size of the linear data which will push us into the next size is reduced
from 168 to 24 bytes, which is effectively the same thing as pushing
directly into the next size. On i386 we go straight to the next bucket
(although the 68 bytes available slack for linear area becomes 252 in
that larger size).

I'm not sure if this is a showstopper or the particular issue with slab
still exists (or maybe it was only slab/slub/slob specific?). I need to
find some benchmark which might demonstrate the issue (presumably
something where frames are commonly 24<size<168). Jesse, any hints on
how to test this or references to the previous occurrence(s) would be
gratefully accepted.

Possible solutions all seem a bit fugly:

      * suitably sized slab caches appropriate to these new sizes (no
        suitable number leaps out at me...)
      * split linear data allocation and shinfo allocation into two. I
        suspect this will have its own performance implications? On the
        positive side skb_shared_info could come from its own fixed size
        pool/cache which might have some benefits
      * steal a bit a pointer to indicate destructor pointer vs regular
        struct page pointer (moving the struct page into the destructor
        datastructure for that case). Stops us sharing a single
        destructor between multiple pages, but that isn't critical
      * add a bitmap to shinfo indicating the kind of each frag.
        Requires modification of anywhere which propagates the page
        member (doable, but another huge series)
      * some sort of pointer compression scheme?

I'm sure there are others I'm not seeing right now.

Cheers,
Ian.

[0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
[1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
[2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/4] net: add support for per-paged-fragment destructors
  2011-11-09 15:01 [PATCH 0/4] skb paged fragment destructors Ian Campbell
@ 2011-11-09 15:02 ` Ian Campbell
  2011-11-09 15:33   ` Michał Mirosław
  2011-11-09 15:02 ` [PATCH 2/4] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 15:02 UTC (permalink / raw)
  To: netdev
  Cc: Ian Campbell, David S. Miller, Eric Dumazet, Michał Mirosław

Entities which care about the complete lifecycle of pages which they inject
into the network stack via an skb paged fragment can choose to set this
destructor in order to receive a callback when the stack is really finished
with a page (including all clones, retransmits, pull-ups etc etc).

This destructor will always be propagated alongside the struct page when
copying skb_frag_t->page. This is the reason I chose to embed the destructor in
a "struct { } page" within the skb_frag_t, rather than as a separate field,
since it allows existing code which propagates ->frags[N].page to Just
Work(tm).

When the destructor is present the page reference counting is done slightly
differently. No references are held by the network stack on the struct page (it
is up to the caller to manage this as necessary) instead the network stack will
track references via the count embedded in the destructor structure. When this
reference count reaches zero then the destructor will be called and the caller
can take the necesary steps to release the page (i.e. release the struct page
reference itself).

The intention is that callers can use this callback to delay completion to
_their_ callers until the network stack has completely released the page, in
order to prevent use-after-free or modification of data pages which are still
in use by the stack.

It is allowable (indeed expected) for a caller to share a single destructor
instance between multiple pages injected into the stack e.g. a group of pages
included in a single higher level operation might share a destructor which is
used to complete that higher level operation.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/skbuff.c      |   17 +++++++++++++++++
 2 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6a6b352..5bdb74d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -139,9 +139,16 @@ struct sk_buff;
 
 typedef struct skb_frag_struct skb_frag_t;
 
+struct skb_frag_destructor {
+	atomic_t ref;
+	int (*destroy)(void *data);
+	void *data;
+};
+
 struct skb_frag_struct {
 	struct {
 		struct page *p;
+		struct skb_frag_destructor *destructor;
 	} page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
 	__u32 page_offset;
@@ -1160,6 +1167,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
 }
 
 /**
+ * skb_frag_set_destructor - set destructor for a paged fragment
+ * @skb: buffer containing fragment to be initialised
+ * @i: paged fragment index to initialise
+ * @destroy: the destructor to use for this fragment
+ *
+ * Sets @destroy as the destructor to be called when all references to
+ * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
+ * etc) are released.
+ *
+ * When a destructor is set then reference counting is performed on
+ * @destroy->ref. When the ref reaches zero then @destroy->destroy
+ * will be called. The caller is responsible for holding and managing
+ * any other references (such a the struct page reference count).
+ *
+ * This function must be called before any use of skb_frag_ref() or
+ * skb_frag_unref().
+ */
+static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
+					   struct skb_frag_destructor *destroy)
+{
+	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+	frag->page.destructor = destroy;
+}
+
+/**
  * __skb_fill_page_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
  * @i: paged fragment index to initialise
@@ -1178,6 +1210,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 	frag->page.p		  = page;
+	frag->page.destructor     = NULL;
 	frag->page_offset	  = off;
 	skb_frag_size_set(frag, size);
 }
@@ -1704,6 +1737,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
 	return frag->page.p;
 }
 
+extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
+extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
+
 /**
  * __skb_frag_ref - take an addition reference on a paged fragment.
  * @frag: the paged fragment
@@ -1712,6 +1748,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_ref(frag->page.destructor);
+		return;
+	}
 	get_page(skb_frag_page(frag));
 }
 
@@ -1735,6 +1775,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
  */
 static inline void __skb_frag_unref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_unref(frag->page.destructor);
+		return;
+	}
 	put_page(skb_frag_page(frag));
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ca4db40..1c5cc05 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -302,6 +302,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
 }
 EXPORT_SYMBOL(dev_alloc_skb);
 
+void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
+{
+	BUG_ON(destroy == NULL);
+	atomic_inc(&destroy->ref);
+}
+EXPORT_SYMBOL(skb_frag_destructor_ref);
+
+void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
+{
+	if (destroy == NULL)
+		return;
+
+	if (atomic_dec_and_test(&destroy->ref))
+		destroy->destroy(destroy->data);
+}
+EXPORT_SYMBOL(skb_frag_destructor_unref);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/4] net: only allow paged fragments with the same destructor to be coalesced.
  2011-11-09 15:01 [PATCH 0/4] skb paged fragment destructors Ian Campbell
  2011-11-09 15:02 ` [PATCH 1/4] net: add support for per-paged-fragment destructors Ian Campbell
@ 2011-11-09 15:02 ` Ian Campbell
       [not found] ` <1320850895.955.172.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 15:02 UTC (permalink / raw)
  To: netdev
  Cc: Ian Campbell, David S. Miller, Alexey Kuznetsov,
	Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Eric Dumazet,
	Michał Mirosław

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |    7 +++++--
 net/core/skbuff.c      |    1 +
 net/ipv4/ip_output.c   |    2 +-
 net/ipv4/tcp.c         |    4 ++--
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5bdb74d..3ba6f58 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1970,13 +1970,16 @@ static inline int skb_add_data(struct sk_buff *skb,
 }
 
 static inline int skb_can_coalesce(struct sk_buff *skb, int i,
-				   const struct page *page, int off)
+				   const struct page *page,
+				   const struct skb_frag_destructor *destroy,
+				   int off)
 {
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
 		return page == skb_frag_page(frag) &&
-		       off == frag->page_offset + skb_frag_size(frag);
+		       off == frag->page_offset + skb_frag_size(frag) &&
+		       frag->page.destructor == destroy;
 	}
 	return 0;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1c5cc05..c510740 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2275,6 +2275,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 	 */
 	if (!to ||
 	    !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
+			      fragfrom->page.destructor,
 			      fragfrom->page_offset)) {
 		merge = -1;
 	} else {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0bc95f3..3252e06 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1229,7 +1229,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, offset)) {
+		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
 			get_page(page);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 34f5db1..018de0c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -804,7 +804,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -1008,7 +1008,7 @@ new_segment:
 				struct page *page = TCP_PAGE(sk);
 				int off = TCP_OFF(sk);
 
-				if (skb_can_coalesce(skb, i, page, off) &&
+				if (skb_can_coalesce(skb, i, page, NULL, off) &&
 				    off != PAGE_SIZE) {
 					/* We can extend the last page
 					 * fragment. */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/4] net: add paged frag destructor support to kernel_sendpage.
  2011-11-09 15:01 [PATCH 0/4] skb paged fragment destructors Ian Campbell
@ 2011-11-09 15:02     ` Ian Campbell
  2011-11-09 15:02 ` [PATCH 2/4] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 15:02 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Ian Campbell, David S. Miller, Alexey Kuznetsov,
	Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman,
	drbd-user-cunTk1MwBs8qoQakbn7OcQ,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	ocfs2-devel-N0ozoZBvEnrZJqsBc5GL+g,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Alexey Kuznetsov <kuznet-v/Mj1YrvjDBInbfyfbPRSQ@public.gmane.org>
Cc: "Pekka Savola (ipv6)" <pekkas-UjJjq++bwZ7HOG6cAo2yLw@public.gmane.org>
Cc: James Morris <jmorris-gx6/JNMH7DfYtjvyW6yDsg@public.gmane.org>
Cc: Hideaki YOSHIFUJI <yoshfuji-VfPWfsRibaP+Ru+s062T9g@public.gmane.org>
Cc: Patrick McHardy <kaber-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
Cc: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Cc: Greg Kroah-Hartman <gregkh-l3A5Bk7waGM@public.gmane.org>
Cc: drbd-user-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
Cc: devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org
Cc: cluster-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: ocfs2-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: rds-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org
Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 drivers/block/drbd/drbd_main.c           |    1 +
 drivers/scsi/iscsi_tcp.c                 |    4 ++--
 drivers/scsi/iscsi_tcp.h                 |    3 ++-
 drivers/staging/pohmelfs/trans.c         |    3 ++-
 drivers/target/iscsi/iscsi_target_util.c |    3 ++-
 fs/dlm/lowcomms.c                        |    4 ++--
 fs/ocfs2/cluster/tcp.c                   |    1 +
 include/linux/net.h                      |    6 +++++-
 include/net/inet_common.h                |    4 +++-
 include/net/ip.h                         |    4 +++-
 include/net/sock.h                       |    8 +++++---
 include/net/tcp.h                        |    4 +++-
 net/ceph/messenger.c                     |    2 +-
 net/core/sock.c                          |    6 +++++-
 net/ipv4/af_inet.c                       |    9 ++++++---
 net/ipv4/ip_output.c                     |    6 ++++--
 net/ipv4/tcp.c                           |   24 ++++++++++++++++--------
 net/ipv4/udp.c                           |   11 ++++++-----
 net/ipv4/udp_impl.h                      |    5 +++--
 net/rds/tcp_send.c                       |    1 +
 net/socket.c                             |   11 +++++++----
 net/sunrpc/svcsock.c                     |    6 +++---
 net/sunrpc/xprtsock.c                    |    2 +-
 23 files changed, 84 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 0358e55..49c7346 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	set_fs(KERNEL_DS);
 	do {
 		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							NULL,
 							offset, len,
 							msg_flags);
 		if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 7724414..2eb6801 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -283,8 +283,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
 		if (!segment->data) {
 			sg = segment->sg;
 			offset += segment->sg_offset + sg->offset;
-			r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
-						  copy, flags);
+			r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+						  offset, copy, flags);
 		} else {
 			struct msghdr msg = { .msg_flags = flags };
 			struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
 	uint32_t		sendpage_failures_cnt;
 	uint32_t		discontiguous_hdr_cnt;
 
-	ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+	ssize_t (*sendpage)(struct socket *, struct page *,
+			    struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/staging/pohmelfs/trans.c b/drivers/staging/pohmelfs/trans.c
index 36a2535..f897fdb 100644
--- a/drivers/staging/pohmelfs/trans.c
+++ b/drivers/staging/pohmelfs/trans.c
@@ -104,7 +104,8 @@ static int netfs_trans_send_pages(struct netfs_trans *t, struct netfs_state *st)
 		msg.msg_flags = MSG_WAITALL | (attached_pages == 1 ? 0 :
 				MSG_MORE);
 
-		err = kernel_sendpage(st->socket, page, 0, size, msg.msg_flags);
+		err = kernel_sendpage(st->socket, page, NULL,
+				      0, size, msg.msg_flags);
 		if (err <= 0) {
 			printk("%s: %d/%d failed to send transaction page: t: %p, gen: %u, size: %u, err: %d.\n",
 					__func__, i, t->page_num, t, t->gen, size, err);
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index f00137f..1ce98e9 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1297,7 +1297,8 @@ send_hdr:
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+					sg_page(sg), NULL,
+					sg->offset + offset, sub_len, 0);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 990626e..98ace05 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1342,8 +1342,8 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
-					      msg_flags);
+			ret = kernel_sendpage(con->sock, e->page, NULL,
+					      offset, len, msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
 				    test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index db5ee4b..81366a0 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -982,6 +982,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		mutex_lock(&sc->sc_send_lock);
 		ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
 						 virt_to_page(kmalloced_virt),
+						 NULL,
 						 (long)kmalloced_virt & ~PAGE_MASK,
 						 size, MSG_DONTWAIT);
 		mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index b299230..db562ba 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -203,6 +204,7 @@ struct proto_ops {
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -273,7 +275,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
 			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index eca0ef7..d34030c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int		ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4,
+				struct page *page,
+				struct skb_frag_destructor *destroy,
 				int offset, size_t size, int flags);
 extern struct sk_buff  *__ip_make_skb(struct sock *sk,
 				      struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index 5ac682f..95ebafd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -774,6 +774,7 @@ struct proto {
 					size_t len, int noblock, int flags, 
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
@@ -1162,9 +1163,10 @@ extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
 extern ssize_t			sock_no_sendpage(struct socket *sock,
-						struct page *page,
-						int offset, size_t size, 
-						int flags);
+					struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size,
+					int flags);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e147f42..99fe8f3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -322,7 +322,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
 extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
 			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9918e9e..65b6fc0 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -849,7 +849,7 @@ static int write_partial_msg_pages(struct ceph_connection *con)
 				cpu_to_le32(crc32c(tmpcrc, base, len));
 			con->out_msg_pos.did_page_crc = 1;
 		}
-		ret = kernel_sendpage(con->sock, page,
+		ret = kernel_sendpage(con->sock, page, NULL,
 				      con->out_msg_pos.page_pos + page_shift,
 				      len,
 				      MSG_DONTWAIT | MSG_NOSIGNAL |
diff --git a/net/core/sock.c b/net/core/sock.c
index 5a08762..8fef547 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1860,7 +1860,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
 }
 EXPORT_SYMBOL(sock_no_mmap);
 
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+			 struct skb_frag_destructor *destroy,
+			 int offset, size_t size, int flags)
 {
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
@@ -1870,6 +1872,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
 	kunmap(page);
+	/* kernel_sendmsg copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1b5096a..99f7fd0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -745,7 +745,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -758,8 +760,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		return -EAGAIN;
 
 	if (sk->sk_prot->sendpage)
-		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
-	return sock_no_sendpage(sock, page, offset, size, flags);
+		return sk->sk_prot->sendpage(sk, page, destroy,
+					     offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(inet_sendpage);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3252e06..753dc7b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1116,6 +1116,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 }
 
 ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+		       struct skb_frag_destructor *destroy,
 		       int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1229,11 +1230,12 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+		if (skb_can_coalesce(skb, i, page, destroy, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		} else {
 			err = -EMSGSIZE;
 			goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 018de0c..56ef323 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -757,7 +757,10 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor **destructors,
+				int poffset,
 			 size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -783,6 +786,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	while (psize > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
 		struct page *page = pages[poffset / PAGE_SIZE];
+		struct skb_frag_destructor *destroy =
+			destructors ? destructors[poffset / PAGE_SIZE] : NULL;
 		int copy, i, can_coalesce;
 		int offset = poffset % PAGE_SIZE;
 		int size = min_t(size_t, psize, PAGE_SIZE - offset);
@@ -804,7 +809,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -815,8 +820,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -871,18 +877,20 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	ssize_t res;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
 	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-		return sock_no_sendpage(sk->sk_socket, page, offset, size,
-					flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page, &destroy,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ebaa96b..b653015 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1028,8 +1028,9 @@ do_confirm:
 }
 EXPORT_SYMBOL(udp_sendmsg);
 
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1057,11 +1058,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	}
 
 	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
+			     page, destroy, offset, size, flags);
 	if (ret == -EOPNOTSUPP) {
 		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 	}
 	if (ret < 0) {
 		udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index aaad650..4923d82 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int	compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 extern int	udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    size_t len, int noblock, int flags, int *addr_len);
-extern int	udp_sendpage(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+extern int	udp_sendpage(struct sock *sk, struct page *page,
+			     struct skb_frag_destructor *destroy,
+			     int offset, size_t size, int flags);
 extern int	udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb);
 extern void	udp_destroy_sock(struct sock *sk);
 
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	while (sg < rm->data.op_nents) {
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
+						NULL,
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
 						MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index 2877647..cbd5728 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -795,7 +795,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	if (more)
 		flags |= MSG_MORE;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3352,15 +3352,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
 	sock_update_classid(sock->sk);
 
 	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+		return sock->ops->sendpage(sock, page, destroy,
+					   offset, size, flags);
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 767d494..852a258 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -183,7 +183,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -196,7 +196,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -210,7 +210,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d7f97ef..f79e40e9 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/4] net: add paged frag destructor support to kernel_sendpage.
@ 2011-11-09 15:02     ` Ian Campbell
  0 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 15:02 UTC (permalink / raw)
  To: netdev
  Cc: Ian Campbell, David S. Miller, Alexey Kuznetsov,
	Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman, drbd-user, devel,
	cluster-devel, ocfs2-devel, ceph-devel, rds-devel, linux-nfs

This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: drbd-user@lists.linbit.com
Cc: devel@driverdev.osuosl.org
Cc: cluster-devel@redhat.com
Cc: ocfs2-devel@oss.oracle.com
Cc: netdev@vger.kernel.org
Cc: ceph-devel@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-nfs@vger.kernel.org
---
 drivers/block/drbd/drbd_main.c           |    1 +
 drivers/scsi/iscsi_tcp.c                 |    4 ++--
 drivers/scsi/iscsi_tcp.h                 |    3 ++-
 drivers/staging/pohmelfs/trans.c         |    3 ++-
 drivers/target/iscsi/iscsi_target_util.c |    3 ++-
 fs/dlm/lowcomms.c                        |    4 ++--
 fs/ocfs2/cluster/tcp.c                   |    1 +
 include/linux/net.h                      |    6 +++++-
 include/net/inet_common.h                |    4 +++-
 include/net/ip.h                         |    4 +++-
 include/net/sock.h                       |    8 +++++---
 include/net/tcp.h                        |    4 +++-
 net/ceph/messenger.c                     |    2 +-
 net/core/sock.c                          |    6 +++++-
 net/ipv4/af_inet.c                       |    9 ++++++---
 net/ipv4/ip_output.c                     |    6 ++++--
 net/ipv4/tcp.c                           |   24 ++++++++++++++++--------
 net/ipv4/udp.c                           |   11 ++++++-----
 net/ipv4/udp_impl.h                      |    5 +++--
 net/rds/tcp_send.c                       |    1 +
 net/socket.c                             |   11 +++++++----
 net/sunrpc/svcsock.c                     |    6 +++---
 net/sunrpc/xprtsock.c                    |    2 +-
 23 files changed, 84 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 0358e55..49c7346 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	set_fs(KERNEL_DS);
 	do {
 		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							NULL,
 							offset, len,
 							msg_flags);
 		if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 7724414..2eb6801 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -283,8 +283,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
 		if (!segment->data) {
 			sg = segment->sg;
 			offset += segment->sg_offset + sg->offset;
-			r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
-						  copy, flags);
+			r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+						  offset, copy, flags);
 		} else {
 			struct msghdr msg = { .msg_flags = flags };
 			struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
 	uint32_t		sendpage_failures_cnt;
 	uint32_t		discontiguous_hdr_cnt;
 
-	ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+	ssize_t (*sendpage)(struct socket *, struct page *,
+			    struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/staging/pohmelfs/trans.c b/drivers/staging/pohmelfs/trans.c
index 36a2535..f897fdb 100644
--- a/drivers/staging/pohmelfs/trans.c
+++ b/drivers/staging/pohmelfs/trans.c
@@ -104,7 +104,8 @@ static int netfs_trans_send_pages(struct netfs_trans *t, struct netfs_state *st)
 		msg.msg_flags = MSG_WAITALL | (attached_pages == 1 ? 0 :
 				MSG_MORE);
 
-		err = kernel_sendpage(st->socket, page, 0, size, msg.msg_flags);
+		err = kernel_sendpage(st->socket, page, NULL,
+				      0, size, msg.msg_flags);
 		if (err <= 0) {
 			printk("%s: %d/%d failed to send transaction page: t: %p, gen: %u, size: %u, err: %d.\n",
 					__func__, i, t->page_num, t, t->gen, size, err);
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index f00137f..1ce98e9 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1297,7 +1297,8 @@ send_hdr:
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+					sg_page(sg), NULL,
+					sg->offset + offset, sub_len, 0);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 990626e..98ace05 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1342,8 +1342,8 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
-					      msg_flags);
+			ret = kernel_sendpage(con->sock, e->page, NULL,
+					      offset, len, msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
 				    test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index db5ee4b..81366a0 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -982,6 +982,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		mutex_lock(&sc->sc_send_lock);
 		ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
 						 virt_to_page(kmalloced_virt),
+						 NULL,
 						 (long)kmalloced_virt & ~PAGE_MASK,
 						 size, MSG_DONTWAIT);
 		mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index b299230..db562ba 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -203,6 +204,7 @@ struct proto_ops {
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -273,7 +275,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
 			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index eca0ef7..d34030c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int		ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4,
+				struct page *page,
+				struct skb_frag_destructor *destroy,
 				int offset, size_t size, int flags);
 extern struct sk_buff  *__ip_make_skb(struct sock *sk,
 				      struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index 5ac682f..95ebafd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -774,6 +774,7 @@ struct proto {
 					size_t len, int noblock, int flags, 
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
@@ -1162,9 +1163,10 @@ extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
 extern ssize_t			sock_no_sendpage(struct socket *sock,
-						struct page *page,
-						int offset, size_t size, 
-						int flags);
+					struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size,
+					int flags);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e147f42..99fe8f3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -322,7 +322,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
 extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
 			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9918e9e..65b6fc0 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -849,7 +849,7 @@ static int write_partial_msg_pages(struct ceph_connection *con)
 				cpu_to_le32(crc32c(tmpcrc, base, len));
 			con->out_msg_pos.did_page_crc = 1;
 		}
-		ret = kernel_sendpage(con->sock, page,
+		ret = kernel_sendpage(con->sock, page, NULL,
 				      con->out_msg_pos.page_pos + page_shift,
 				      len,
 				      MSG_DONTWAIT | MSG_NOSIGNAL |
diff --git a/net/core/sock.c b/net/core/sock.c
index 5a08762..8fef547 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1860,7 +1860,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
 }
 EXPORT_SYMBOL(sock_no_mmap);
 
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+			 struct skb_frag_destructor *destroy,
+			 int offset, size_t size, int flags)
 {
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
@@ -1870,6 +1872,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
 	kunmap(page);
+	/* kernel_sendmsg copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1b5096a..99f7fd0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -745,7 +745,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -758,8 +760,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		return -EAGAIN;
 
 	if (sk->sk_prot->sendpage)
-		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
-	return sock_no_sendpage(sock, page, offset, size, flags);
+		return sk->sk_prot->sendpage(sk, page, destroy,
+					     offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(inet_sendpage);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3252e06..753dc7b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1116,6 +1116,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 }
 
 ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+		       struct skb_frag_destructor *destroy,
 		       int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1229,11 +1230,12 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+		if (skb_can_coalesce(skb, i, page, destroy, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		} else {
 			err = -EMSGSIZE;
 			goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 018de0c..56ef323 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -757,7 +757,10 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor **destructors,
+				int poffset,
 			 size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -783,6 +786,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	while (psize > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
 		struct page *page = pages[poffset / PAGE_SIZE];
+		struct skb_frag_destructor *destroy =
+			destructors ? destructors[poffset / PAGE_SIZE] : NULL;
 		int copy, i, can_coalesce;
 		int offset = poffset % PAGE_SIZE;
 		int size = min_t(size_t, psize, PAGE_SIZE - offset);
@@ -804,7 +809,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -815,8 +820,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -871,18 +877,20 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	ssize_t res;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
 	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-		return sock_no_sendpage(sk->sk_socket, page, offset, size,
-					flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page, &destroy,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ebaa96b..b653015 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1028,8 +1028,9 @@ do_confirm:
 }
 EXPORT_SYMBOL(udp_sendmsg);
 
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1057,11 +1058,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	}
 
 	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
+			     page, destroy, offset, size, flags);
 	if (ret == -EOPNOTSUPP) {
 		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 	}
 	if (ret < 0) {
 		udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index aaad650..4923d82 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int	compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 extern int	udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    size_t len, int noblock, int flags, int *addr_len);
-extern int	udp_sendpage(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+extern int	udp_sendpage(struct sock *sk, struct page *page,
+			     struct skb_frag_destructor *destroy,
+			     int offset, size_t size, int flags);
 extern int	udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb);
 extern void	udp_destroy_sock(struct sock *sk);
 
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	while (sg < rm->data.op_nents) {
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
+						NULL,
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
 						MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index 2877647..cbd5728 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -795,7 +795,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	if (more)
 		flags |= MSG_MORE;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3352,15 +3352,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
 	sock_update_classid(sock->sk);
 
 	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+		return sock->ops->sendpage(sock, page, destroy,
+					   offset, size, flags);
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 767d494..852a258 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -183,7 +183,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -196,7 +196,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -210,7 +210,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d7f97ef..f79e40e9 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2011-11-09 15:01 [PATCH 0/4] skb paged fragment destructors Ian Campbell
@ 2011-11-09 15:02     ` Ian Campbell
  2011-11-09 15:02 ` [PATCH 2/4] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 15:02 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Ian Campbell, David S. Miller, Neil Brown, J. Bruce Fields,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    3 ++-
 net/sunrpc/xprt.c           |   13 +++++++++++++
 net/sunrpc/xprtsock.c       |    3 ++-
 6 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index a20970e..172f81e 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 15518a1..75131eb 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index c5347d2..919538d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1113,6 +1114,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1276,6 +1279,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1309,8 +1313,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1383,7 +1387,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1423,7 +1428,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
@@ -1589,12 +1593,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1609,6 +1615,17 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	dprintk("RPC: %5u call_complete result %d\n",
+		task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 852a258..3685cad 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -196,7 +196,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor,
+					 base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index f4385e4..925aa0c 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1103,6 +1103,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(void *calldata)
+{
+	struct rpc_task *task = calldata;
+	struct rpc_rqst	*req = task->tk_rqstp;
+
+	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1115,6 +1125,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
+	req->destructor.data = task;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index f79e40e9..af3a106 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+					  base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
@ 2011-11-09 15:02     ` Ian Campbell
  0 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 15:02 UTC (permalink / raw)
  To: netdev
  Cc: Ian Campbell, David S. Miller, Neil Brown, J. Bruce Fields, linux-nfs

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    3 ++-
 net/sunrpc/xprt.c           |   13 +++++++++++++
 net/sunrpc/xprtsock.c       |    3 ++-
 6 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index a20970e..172f81e 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 15518a1..75131eb 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index c5347d2..919538d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1113,6 +1114,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1276,6 +1279,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1309,8 +1313,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1383,7 +1387,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1423,7 +1428,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
@@ -1589,12 +1593,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1609,6 +1615,17 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	dprintk("RPC: %5u call_complete result %d\n",
+		task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 852a258..3685cad 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -196,7 +196,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor,
+					 base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index f4385e4..925aa0c 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1103,6 +1103,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(void *calldata)
+{
+	struct rpc_task *task = calldata;
+	struct rpc_rqst	*req = task->tk_rqstp;
+
+	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1115,6 +1125,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
+	req->destructor.data = task;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index f79e40e9..af3a106 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+					  base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] net: add support for per-paged-fragment destructors
  2011-11-09 15:02 ` [PATCH 1/4] net: add support for per-paged-fragment destructors Ian Campbell
@ 2011-11-09 15:33   ` Michał Mirosław
  2011-11-09 16:25     ` Ian Campbell
  0 siblings, 1 reply; 52+ messages in thread
From: Michał Mirosław @ 2011-11-09 15:33 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David S. Miller, Eric Dumazet

On Wed, Nov 09, 2011 at 03:02:04PM +0000, Ian Campbell wrote:
> Entities which care about the complete lifecycle of pages which they inject
> into the network stack via an skb paged fragment can choose to set this
> destructor in order to receive a callback when the stack is really finished
> with a page (including all clones, retransmits, pull-ups etc etc).
[...]
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -139,9 +139,16 @@ struct sk_buff;
>  
>  typedef struct skb_frag_struct skb_frag_t;
>  
> +struct skb_frag_destructor {
> +	atomic_t ref;
> +	int (*destroy)(void *data);
> +	void *data;
> +};
> +
>  struct skb_frag_struct {
>  	struct {
>  		struct page *p;
> +		struct skb_frag_destructor *destructor;
>  	} page;

You can get rid of the data field of skb_frag_destructor: if destroy() gets
pointer to the destroyed struct skb_frag_set_destructor, its users can
get at containing struct via container_of() if needed and the memory
pointed to by data won't have to be managed separately.

Nice work, BTW!

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] net: add support for per-paged-fragment destructors
  2011-11-09 15:33   ` Michał Mirosław
@ 2011-11-09 16:25     ` Ian Campbell
  2011-11-09 17:24       ` Michał Mirosław
  0 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 16:25 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: netdev, David S. Miller, Eric Dumazet

On Wed, 2011-11-09 at 15:33 +0000, Michał Mirosław wrote:
> On Wed, Nov 09, 2011 at 03:02:04PM +0000, Ian Campbell wrote:
> > Entities which care about the complete lifecycle of pages which they inject
> > into the network stack via an skb paged fragment can choose to set this
> > destructor in order to receive a callback when the stack is really finished
> > with a page (including all clones, retransmits, pull-ups etc etc).
> [...]
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -139,9 +139,16 @@ struct sk_buff;
> >  
> >  typedef struct skb_frag_struct skb_frag_t;
> >  
> > +struct skb_frag_destructor {
> > +	atomic_t ref;
> > +	int (*destroy)(void *data);
> > +	void *data;
> > +};
> > +
> >  struct skb_frag_struct {
> >  	struct {
> >  		struct page *p;
> > +		struct skb_frag_destructor *destructor;
> >  	} page;
> 
> You can get rid of the data field of skb_frag_destructor: if destroy() gets
> pointer to the destroyed struct skb_frag_set_destructor, its users can
> get at containing struct via container_of() if needed and the memory
> pointed to by data won't have to be managed separately.

At the moment you can share one destructor between all the frags,
whereas data is specific to the frag. At the moment the only user is the
rpc patch which doesn't make use of this capability (i.e. it could use
container_of) but I'm not sure about other potential users yet.

> Nice work, BTW!

Thanks.

> 
> Best Regards,
> Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] net: add support for per-paged-fragment destructors
  2011-11-09 16:25     ` Ian Campbell
@ 2011-11-09 17:24       ` Michał Mirosław
  2011-11-09 17:28         ` Ian Campbell
  0 siblings, 1 reply; 52+ messages in thread
From: Michał Mirosław @ 2011-11-09 17:24 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, David S. Miller, Eric Dumazet

On Wed, Nov 09, 2011 at 04:25:14PM +0000, Ian Campbell wrote:
> On Wed, 2011-11-09 at 15:33 +0000, Michał Mirosław wrote:
> > On Wed, Nov 09, 2011 at 03:02:04PM +0000, Ian Campbell wrote:
> > > Entities which care about the complete lifecycle of pages which they inject
> > > into the network stack via an skb paged fragment can choose to set this
> > > destructor in order to receive a callback when the stack is really finished
> > > with a page (including all clones, retransmits, pull-ups etc etc).
> > [...]
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -139,9 +139,16 @@ struct sk_buff;
> > >  
> > >  typedef struct skb_frag_struct skb_frag_t;
> > >  
> > > +struct skb_frag_destructor {
> > > +	atomic_t ref;
> > > +	int (*destroy)(void *data);
> > > +	void *data;
> > > +};
> > > +
> > >  struct skb_frag_struct {
> > >  	struct {
> > >  		struct page *p;
> > > +		struct skb_frag_destructor *destructor;
> > >  	} page;
> > 
> > You can get rid of the data field of skb_frag_destructor: if destroy() gets
> > pointer to the destroyed struct skb_frag_set_destructor, its users can
> > get at containing struct via container_of() if needed and the memory
> > pointed to by data won't have to be managed separately.
> At the moment you can share one destructor between all the frags,
> whereas data is specific to the frag.
[...]

If you want distinct data pointers then you need to also have per-frag
skb_frag_destructor as you wrote in this patch. So removing 'data' field
saves memory but doesn't change anything else except the way to reference
the data (container_of() instead of pointer dereference).

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] net: add support for per-paged-fragment destructors
  2011-11-09 17:24       ` Michał Mirosław
@ 2011-11-09 17:28         ` Ian Campbell
  0 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-09 17:28 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: netdev, David S. Miller, Eric Dumazet

On Wed, 2011-11-09 at 17:24 +0000, Michał Mirosław wrote:
> On Wed, Nov 09, 2011 at 04:25:14PM +0000, Ian Campbell wrote:
> > On Wed, 2011-11-09 at 15:33 +0000, Michał Mirosław wrote:
> > > On Wed, Nov 09, 2011 at 03:02:04PM +0000, Ian Campbell wrote:
> > > > Entities which care about the complete lifecycle of pages which they inject
> > > > into the network stack via an skb paged fragment can choose to set this
> > > > destructor in order to receive a callback when the stack is really finished
> > > > with a page (including all clones, retransmits, pull-ups etc etc).
> > > [...]
> > > > --- a/include/linux/skbuff.h
> > > > +++ b/include/linux/skbuff.h
> > > > @@ -139,9 +139,16 @@ struct sk_buff;
> > > >  
> > > >  typedef struct skb_frag_struct skb_frag_t;
> > > >  
> > > > +struct skb_frag_destructor {
> > > > +	atomic_t ref;
> > > > +	int (*destroy)(void *data);
> > > > +	void *data;
> > > > +};
> > > > +
> > > >  struct skb_frag_struct {
> > > >  	struct {
> > > >  		struct page *p;
> > > > +		struct skb_frag_destructor *destructor;
> > > >  	} page;
> > > 
> > > You can get rid of the data field of skb_frag_destructor: if destroy() gets
> > > pointer to the destroyed struct skb_frag_set_destructor, its users can
> > > get at containing struct via container_of() if needed and the memory
> > > pointed to by data won't have to be managed separately.
> > At the moment you can share one destructor between all the frags,
> > whereas data is specific to the frag.
> [...]
> 
> If you want distinct data pointers then you need to also have per-frag
> skb_frag_destructor as you wrote in this patch. So removing 'data' field
> saves memory but doesn't change anything else except the way to reference
> the data (container_of() instead of pointer dereference).

Oh yes, you are absolutely right.

Ian.

> 
> Best Regards,
> Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-11-09 15:01 [PATCH 0/4] skb paged fragment destructors Ian Campbell
                   ` (2 preceding siblings ...)
       [not found] ` <1320850895.955.172.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
@ 2011-11-09 17:49 ` Eric Dumazet
  2011-11-10 10:39   ` Ian Campbell
  2011-11-17 14:45   ` Ian Campbell
  2011-12-06 11:57 ` Ian Campbell
  4 siblings, 2 replies; 52+ messages in thread
From: Eric Dumazet @ 2011-11-09 17:49 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, Jesse Brandeburg, netdev

Le mercredi 09 novembre 2011 à 15:01 +0000, Ian Campbell a écrit :
> The following series makes use of the skb fragment API (which is in 3.2)
> to add a per-paged-fragment destructor callback. This can be used by
> creators of skbs who are interested in the lifecycle of the pages
> included in that skb after they have handed it off to the network stack.
> I think these have all been posted before, but have been backed up
> behind the skb fragment API.
> 
> The mail at [0] contains some more background and rationale but
> basically the completed series will allow entities which inject pages
> into the networking stack to receive a notification when the stack has
> really finished with those pages (i.e. including retransmissions,
> clones, pull-ups etc) and not just when the original skb is finished
> with, which is beneficial to many subsystems which wish to inject pages
> into the network stack without giving up full ownership of those page's
> lifecycle. It implements something broadly along the lines of what was
> described in [1].
> 
> I have also included a patch to the RPC subsystem which uses this API to
> fix the bug which I describe at [2].
> 
> I presented this work at LPC in September and there was a
> question/concern raised (by Jesse Brandenburg IIRC) regarding the
> overhead of adding this extra field per fragment. If I understand
> correctly it seems that in the there have been performance regressions
> in the past with allocations outgrowing one allocation size bucket and
> therefore using the next. The change in datastructure size resulting
> from this series is:
> 					  BEFORE	AFTER
> AMD64:	sizeof(struct skb_frag_struct)	= 16		24
> 	sizeof(struct skb_shared_info)	= 344		488

Thats a real problem, because 488 is soo big. (its even rounded to 512
bytes)

Now, on x86, a half page (2048 bytes) wont be big enough to contain a
typical frame (MTU=1500)

NET_SKB_PAD (64) + 1500 + 14 + 512 > 2048


Even if we dont round 488 to 512, (no cache align skb_shared_info) we
have a problem.

NET_SKB_PAD (64) + 1500 + 14 + 488 > 2048

Why not using a low order bit to mark 'page' being a pointer to 

struct skb_frag_page_desc {
	struct page *p;
	atomic_t ref;
	int (*destroy)(void *data);
/*	void *data; */ /* no need, see container_of() */
};

struct skb_frag_struct {
        struct {
                union {
			struct page *p; /* low order bit not set */
			struct skb_frag_page_desc *skbpage; /* low order bit set */
		};
        } page;
...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] net: add paged frag destructor support to kernel_sendpage.
  2011-11-09 15:02     ` Ian Campbell
  (?)
@ 2011-11-09 18:02       ` Michał Mirosław
  -1 siblings, 0 replies; 52+ messages in thread
From: Michał Mirosław @ 2011-11-09 18:02 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman, drbd-user, devel,
	cluster-devel, ocfs2-devel, ceph-devel, rds-devel, linux-nfs

2011/11/9 Ian Campbell <ian.campbell@citrix.com>:
> This requires adding a new argument to various sendpage hooks up and down the
> stack. At the moment this parameter is always NULL.
[...]
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -203,6 +204,7 @@ struct proto_ops {
>        ssize_t         (*sendpage)  (struct socket *sock, struct page *page,
> +                                     struct skb_frag_destructor *destroy,
>                                      int offset, size_t size, int flags);

Maybe you could instead add new op like sendfrag() that would get
already prepared skb_frag_struct? In the end all page data ends up
described in skb_frag_struct, so this would reduce copying this
information all over network stack. This might be a bigger change,
though.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/4] net: add paged frag destructor support to kernel_sendpage.
@ 2011-11-09 18:02       ` Michał Mirosław
  0 siblings, 0 replies; 52+ messages in thread
From: Michał Mirosław @ 2011-11-09 18:02 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman, drbd-user, devel,
	cluster-devel, ocfs2-devel, ceph-devel, rds-devel, linux-nfs

2011/11/9 Ian Campbell <ian.campbell@citrix.com>:
> This requires adding a new argument to various sendpage hooks up and down the
> stack. At the moment this parameter is always NULL.
[...]
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -203,6 +204,7 @@ struct proto_ops {
>        ssize_t         (*sendpage)  (struct socket *sock, struct page *page,
> +                                     struct skb_frag_destructor *destroy,
>                                      int offset, size_t size, int flags);

Maybe you could instead add new op like sendfrag() that would get
already prepared skb_frag_struct? In the end all page data ends up
described in skb_frag_struct, so this would reduce copying this
information all over network stack. This might be a bigger change,
though.

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Ocfs2-devel] [PATCH 3/4] net: add paged frag destructor support to kernel_sendpage.
@ 2011-11-09 18:02       ` Michał Mirosław
  0 siblings, 0 replies; 52+ messages in thread
From: Michał Mirosław @ 2011-11-09 18:02 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Trond Myklebust, Greg Kroah-Hartman, drbd-user, devel,
	cluster-devel, ocfs2-devel, ceph-devel, rds-devel, linux-nfs

2011/11/9 Ian Campbell <ian.campbell@citrix.com>:
> This requires adding a new argument to various sendpage hooks up and down the
> stack. At the moment this parameter is always NULL.
[...]
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -203,6 +204,7 @@ struct proto_ops {
> ? ? ? ?ssize_t ? ? ? ? (*sendpage) ?(struct socket *sock, struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct skb_frag_destructor *destroy,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int offset, size_t size, int flags);

Maybe you could instead add new op like sendfrag() that would get
already prepared skb_frag_struct? In the end all page data ends up
described in skb_frag_struct, so this would reduce copying this
information all over network stack. This might be a bigger change,
though.

Best Regards,
Micha? Miros?aw

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-11-09 17:49 ` [PATCH 0/4] skb paged fragment destructors Eric Dumazet
@ 2011-11-10 10:39   ` Ian Campbell
  2011-11-17 14:45   ` Ian Campbell
  1 sibling, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-10 10:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Jesse Brandeburg, netdev

On Wed, 2011-11-09 at 17:49 +0000, Eric Dumazet wrote:
> Le mercredi 09 novembre 2011 à 15:01 +0000, Ian Campbell a écrit :
> > The following series makes use of the skb fragment API (which is in 3.2)
> > to add a per-paged-fragment destructor callback. This can be used by
> > creators of skbs who are interested in the lifecycle of the pages
> > included in that skb after they have handed it off to the network stack.
> > I think these have all been posted before, but have been backed up
> > behind the skb fragment API.
> > 
> > The mail at [0] contains some more background and rationale but
> > basically the completed series will allow entities which inject pages
> > into the networking stack to receive a notification when the stack has
> > really finished with those pages (i.e. including retransmissions,
> > clones, pull-ups etc) and not just when the original skb is finished
> > with, which is beneficial to many subsystems which wish to inject pages
> > into the network stack without giving up full ownership of those page's
> > lifecycle. It implements something broadly along the lines of what was
> > described in [1].
> > 
> > I have also included a patch to the RPC subsystem which uses this API to
> > fix the bug which I describe at [2].
> > 
> > I presented this work at LPC in September and there was a
> > question/concern raised (by Jesse Brandenburg IIRC) regarding the
> > overhead of adding this extra field per fragment. If I understand
> > correctly it seems that in the there have been performance regressions
> > in the past with allocations outgrowing one allocation size bucket and
> > therefore using the next. The change in datastructure size resulting
> > from this series is:
> > 					  BEFORE	AFTER
> > AMD64:	sizeof(struct skb_frag_struct)	= 16		24
> > 	sizeof(struct skb_shared_info)	= 344		488
> 
> Thats a real problem, because 488 is soo big. (its even rounded to 512
> bytes)
> 
> Now, on x86, a half page (2048 bytes) wont be big enough to contain a
> typical frame (MTU=1500)
> 
> NET_SKB_PAD (64) + 1500 + 14 + 512 > 2048
> 
> 
> Even if we dont round 488 to 512, (no cache align skb_shared_info) we
> have a problem.
> 
> NET_SKB_PAD (64) + 1500 + 14 + 488 > 2048

Thanks Eric, that makes perfect sense. I doubt we can find a way to save
the necessary 18 bytes (or more depending on how much NET_SKB_PAD adds)
to make that > into a <= so I'll need to find another way.

> Why not using a low order bit to mark 'page' being a pointer to 

Yes, that was what I meant by "steal a bit a pointer" (leaving aside my
mangled English there...). I think it's probably the best of the
options, I'll code it up.

Ian.

> 
> struct skb_frag_page_desc {
> 	struct page *p;
> 	atomic_t ref;
> 	int (*destroy)(void *data);
> /*	void *data; */ /* no need, see container_of() */
> };
> 
> struct skb_frag_struct {
>         struct {
>                 union {
> 			struct page *p; /* low order bit not set */
> 			struct skb_frag_page_desc *skbpage; /* low order bit set */
> 		};
>         } page;
> ...
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2011-11-09 15:02     ` Ian Campbell
@ 2011-11-11 12:38         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2011-11-11 12:38 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, Neil Brown,
	J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> This prevents an issue where an ACK is delayed, a retransmit is queued (either
> at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> wire. If this happens to an NFS WRITE RPC then the write() system call
> completes and the userspace process can continue, potentially modifying data
> referenced by the retransmission before the retransmission occurs.
> 
> Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
> Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

So this blocks the system call until all page references
are gone, right?
But, there's no upper limit on how long the
page is referenced, correct? consider a bridged setup
with an skb queued at a tap device - this cause one process
to block another one by virtue of not consuming a cloned skb?

> ---
>  include/linux/sunrpc/xdr.h  |    2 ++
>  include/linux/sunrpc/xprt.h |    5 ++++-
>  net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
>  net/sunrpc/svcsock.c        |    3 ++-
>  net/sunrpc/xprt.c           |   13 +++++++++++++
>  net/sunrpc/xprtsock.c       |    3 ++-
>  6 files changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
> index a20970e..172f81e 100644
> --- a/include/linux/sunrpc/xdr.h
> +++ b/include/linux/sunrpc/xdr.h
> @@ -16,6 +16,7 @@
>  #include <asm/byteorder.h>
>  #include <asm/unaligned.h>
>  #include <linux/scatterlist.h>
> +#include <linux/skbuff.h>
>  
>  /*
>   * Buffer adjustment
> @@ -57,6 +58,7 @@ struct xdr_buf {
>  			tail[1];	/* Appended after page data */
>  
>  	struct page **	pages;		/* Array of contiguous pages */
> +	struct skb_frag_destructor *destructor;
>  	unsigned int	page_base,	/* Start of page data */
>  			page_len,	/* Length of page data */
>  			flags;		/* Flags for data disposition */
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 15518a1..75131eb 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -92,7 +92,10 @@ struct rpc_rqst {
>  						/* A cookie used to track the
>  						   state of the transport
>  						   connection */
> -	
> +	struct skb_frag_destructor destructor;	/* SKB paged fragment
> +						 * destructor for
> +						 * transmitted pages*/
> +
>  	/*
>  	 * Partial send handling
>  	 */
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index c5347d2..919538d 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
>  static void	call_reserveresult(struct rpc_task *task);
>  static void	call_allocate(struct rpc_task *task);
>  static void	call_decode(struct rpc_task *task);
> +static void	call_complete(struct rpc_task *task);
>  static void	call_bind(struct rpc_task *task);
>  static void	call_bind_status(struct rpc_task *task);
>  static void	call_transmit(struct rpc_task *task);
> @@ -1113,6 +1114,8 @@ rpc_xdr_encode(struct rpc_task *task)
>  			 (char *)req->rq_buffer + req->rq_callsize,
>  			 req->rq_rcvsize);
>  
> +	req->rq_snd_buf.destructor = &req->destructor;
> +
>  	p = rpc_encode_header(task);
>  	if (p == NULL) {
>  		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
> @@ -1276,6 +1279,7 @@ call_connect_status(struct rpc_task *task)
>  static void
>  call_transmit(struct rpc_task *task)
>  {
> +	struct rpc_rqst *req = task->tk_rqstp;
>  	dprint_status(task);
>  
>  	task->tk_action = call_status;
> @@ -1309,8 +1313,8 @@ call_transmit(struct rpc_task *task)
>  	call_transmit_status(task);
>  	if (rpc_reply_expected(task))
>  		return;
> -	task->tk_action = rpc_exit_task;
> -	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
> +	task->tk_action = call_complete;
> +	skb_frag_destructor_unref(&req->destructor);
>  }
>  
>  /*
> @@ -1383,7 +1387,8 @@ call_bc_transmit(struct rpc_task *task)
>  		return;
>  	}
>  
> -	task->tk_action = rpc_exit_task;
> +	task->tk_action = call_complete;
> +	skb_frag_destructor_unref(&req->destructor);
>  	if (task->tk_status < 0) {
>  		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
>  			"error: %d\n", task->tk_status);
> @@ -1423,7 +1428,6 @@ call_bc_transmit(struct rpc_task *task)
>  			"error: %d\n", task->tk_status);
>  		break;
>  	}
> -	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
>  }
>  #endif /* CONFIG_SUNRPC_BACKCHANNEL */
>  
> @@ -1589,12 +1593,14 @@ call_decode(struct rpc_task *task)
>  		return;
>  	}
>  
> -	task->tk_action = rpc_exit_task;
> +	task->tk_action = call_complete;
>  
>  	if (decode) {
>  		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
>  						      task->tk_msg.rpc_resp);
>  	}
> +	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
> +	skb_frag_destructor_unref(&req->destructor);
>  	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
>  			task->tk_status);
>  	return;
> @@ -1609,6 +1615,17 @@ out_retry:
>  	}
>  }
>  
> +/*
> + * 8.	Wait for pages to be released by the network stack.
> + */
> +static void
> +call_complete(struct rpc_task *task)
> +{
> +	dprintk("RPC: %5u call_complete result %d\n",
> +		task->tk_pid, task->tk_status);
> +	task->tk_action = rpc_exit_task;
> +}
> +
>  static __be32 *
>  rpc_encode_header(struct rpc_task *task)
>  {
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 852a258..3685cad 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -196,7 +196,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
>  	while (pglen > 0) {
>  		if (slen == size)
>  			flags = 0;
> -		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
> +		result = kernel_sendpage(sock, *ppage, xdr->destructor,
> +					 base, size, flags);
>  		if (result > 0)
>  			len += result;
>  		if (result != size)
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index f4385e4..925aa0c 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -1103,6 +1103,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
>  	xprt->xid = net_random();
>  }
>  
> +static int xprt_complete_skb_pages(void *calldata)
> +{
> +	struct rpc_task *task = calldata;
> +	struct rpc_rqst	*req = task->tk_rqstp;
> +
> +	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
> +	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
> +	return 0;
> +}
> +
>  static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
>  {
>  	struct rpc_rqst	*req = task->tk_rqstp;
> @@ -1115,6 +1125,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
>  	req->rq_xid     = xprt_alloc_xid(xprt);
>  	req->rq_release_snd_buf = NULL;
>  	xprt_reset_majortimeo(req);
> +	atomic_set(&req->destructor.ref, 1);
> +	req->destructor.destroy = &xprt_complete_skb_pages;
> +	req->destructor.data = task;
>  	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
>  			req, ntohl(req->rq_xid));
>  }
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index f79e40e9..af3a106 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
>  		remainder -= len;
>  		if (remainder != 0 || more)
>  			flags |= MSG_MORE;
> -		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
> +		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
> +					  base, len, flags);
>  		if (remainder == 0 || err != len)
>  			break;
>  		sent += err;
> -- 
> 1.7.2.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
@ 2011-11-11 12:38         ` Michael S. Tsirkin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2011-11-11 12:38 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David S. Miller, Neil Brown, J. Bruce Fields, linux-nfs

On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> This prevents an issue where an ACK is delayed, a retransmit is queued (either
> at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> wire. If this happens to an NFS WRITE RPC then the write() system call
> completes and the userspace process can continue, potentially modifying data
> referenced by the retransmission before the retransmission occurs.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Neil Brown <neilb@suse.de>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Cc: linux-nfs@vger.kernel.org
> Cc: netdev@vger.kernel.org

So this blocks the system call until all page references
are gone, right?
But, there's no upper limit on how long the
page is referenced, correct? consider a bridged setup
with an skb queued at a tap device - this cause one process
to block another one by virtue of not consuming a cloned skb?

> ---
>  include/linux/sunrpc/xdr.h  |    2 ++
>  include/linux/sunrpc/xprt.h |    5 ++++-
>  net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
>  net/sunrpc/svcsock.c        |    3 ++-
>  net/sunrpc/xprt.c           |   13 +++++++++++++
>  net/sunrpc/xprtsock.c       |    3 ++-
>  6 files changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
> index a20970e..172f81e 100644
> --- a/include/linux/sunrpc/xdr.h
> +++ b/include/linux/sunrpc/xdr.h
> @@ -16,6 +16,7 @@
>  #include <asm/byteorder.h>
>  #include <asm/unaligned.h>
>  #include <linux/scatterlist.h>
> +#include <linux/skbuff.h>
>  
>  /*
>   * Buffer adjustment
> @@ -57,6 +58,7 @@ struct xdr_buf {
>  			tail[1];	/* Appended after page data */
>  
>  	struct page **	pages;		/* Array of contiguous pages */
> +	struct skb_frag_destructor *destructor;
>  	unsigned int	page_base,	/* Start of page data */
>  			page_len,	/* Length of page data */
>  			flags;		/* Flags for data disposition */
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 15518a1..75131eb 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -92,7 +92,10 @@ struct rpc_rqst {
>  						/* A cookie used to track the
>  						   state of the transport
>  						   connection */
> -	
> +	struct skb_frag_destructor destructor;	/* SKB paged fragment
> +						 * destructor for
> +						 * transmitted pages*/
> +
>  	/*
>  	 * Partial send handling
>  	 */
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index c5347d2..919538d 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
>  static void	call_reserveresult(struct rpc_task *task);
>  static void	call_allocate(struct rpc_task *task);
>  static void	call_decode(struct rpc_task *task);
> +static void	call_complete(struct rpc_task *task);
>  static void	call_bind(struct rpc_task *task);
>  static void	call_bind_status(struct rpc_task *task);
>  static void	call_transmit(struct rpc_task *task);
> @@ -1113,6 +1114,8 @@ rpc_xdr_encode(struct rpc_task *task)
>  			 (char *)req->rq_buffer + req->rq_callsize,
>  			 req->rq_rcvsize);
>  
> +	req->rq_snd_buf.destructor = &req->destructor;
> +
>  	p = rpc_encode_header(task);
>  	if (p == NULL) {
>  		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
> @@ -1276,6 +1279,7 @@ call_connect_status(struct rpc_task *task)
>  static void
>  call_transmit(struct rpc_task *task)
>  {
> +	struct rpc_rqst *req = task->tk_rqstp;
>  	dprint_status(task);
>  
>  	task->tk_action = call_status;
> @@ -1309,8 +1313,8 @@ call_transmit(struct rpc_task *task)
>  	call_transmit_status(task);
>  	if (rpc_reply_expected(task))
>  		return;
> -	task->tk_action = rpc_exit_task;
> -	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
> +	task->tk_action = call_complete;
> +	skb_frag_destructor_unref(&req->destructor);
>  }
>  
>  /*
> @@ -1383,7 +1387,8 @@ call_bc_transmit(struct rpc_task *task)
>  		return;
>  	}
>  
> -	task->tk_action = rpc_exit_task;
> +	task->tk_action = call_complete;
> +	skb_frag_destructor_unref(&req->destructor);
>  	if (task->tk_status < 0) {
>  		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
>  			"error: %d\n", task->tk_status);
> @@ -1423,7 +1428,6 @@ call_bc_transmit(struct rpc_task *task)
>  			"error: %d\n", task->tk_status);
>  		break;
>  	}
> -	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
>  }
>  #endif /* CONFIG_SUNRPC_BACKCHANNEL */
>  
> @@ -1589,12 +1593,14 @@ call_decode(struct rpc_task *task)
>  		return;
>  	}
>  
> -	task->tk_action = rpc_exit_task;
> +	task->tk_action = call_complete;
>  
>  	if (decode) {
>  		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
>  						      task->tk_msg.rpc_resp);
>  	}
> +	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
> +	skb_frag_destructor_unref(&req->destructor);
>  	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
>  			task->tk_status);
>  	return;
> @@ -1609,6 +1615,17 @@ out_retry:
>  	}
>  }
>  
> +/*
> + * 8.	Wait for pages to be released by the network stack.
> + */
> +static void
> +call_complete(struct rpc_task *task)
> +{
> +	dprintk("RPC: %5u call_complete result %d\n",
> +		task->tk_pid, task->tk_status);
> +	task->tk_action = rpc_exit_task;
> +}
> +
>  static __be32 *
>  rpc_encode_header(struct rpc_task *task)
>  {
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 852a258..3685cad 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -196,7 +196,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
>  	while (pglen > 0) {
>  		if (slen == size)
>  			flags = 0;
> -		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
> +		result = kernel_sendpage(sock, *ppage, xdr->destructor,
> +					 base, size, flags);
>  		if (result > 0)
>  			len += result;
>  		if (result != size)
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index f4385e4..925aa0c 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -1103,6 +1103,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
>  	xprt->xid = net_random();
>  }
>  
> +static int xprt_complete_skb_pages(void *calldata)
> +{
> +	struct rpc_task *task = calldata;
> +	struct rpc_rqst	*req = task->tk_rqstp;
> +
> +	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
> +	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
> +	return 0;
> +}
> +
>  static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
>  {
>  	struct rpc_rqst	*req = task->tk_rqstp;
> @@ -1115,6 +1125,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
>  	req->rq_xid     = xprt_alloc_xid(xprt);
>  	req->rq_release_snd_buf = NULL;
>  	xprt_reset_majortimeo(req);
> +	atomic_set(&req->destructor.ref, 1);
> +	req->destructor.destroy = &xprt_complete_skb_pages;
> +	req->destructor.data = task;
>  	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
>  			req, ntohl(req->rq_xid));
>  }
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index f79e40e9..af3a106 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
>  		remainder -= len;
>  		if (remainder != 0 || more)
>  			flags |= MSG_MORE;
> -		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
> +		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
> +					  base, len, flags);
>  		if (remainder == 0 || err != len)
>  			break;
>  		sent += err;
> -- 
> 1.7.2.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2011-11-11 12:38         ` Michael S. Tsirkin
@ 2011-11-11 13:20             ` Ian Campbell
  -1 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-11 13:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, Neil Brown,
	J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > wire. If this happens to an NFS WRITE RPC then the write() system call
> > completes and the userspace process can continue, potentially modifying data
> > referenced by the retransmission before the retransmission occurs.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> > Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> > Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
> > Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> > Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> 
> So this blocks the system call until all page references
> are gone, right?

Right. The alternative is to return to userspace while the network stack
still has a reference to the buffer which was passed in -- that's the
exact class of problem this patch is supposed to fix.

> But, there's no upper limit on how long the
> page is referenced, correct?

Correct.

>  consider a bridged setup
> with an skb queued at a tap device - this cause one process
> to block another one by virtue of not consuming a cloned skb?

Hmm, yes.

One approach might be to introduce the concept of an skb timeout to the
stack as a whole and cancel (or deep copy) after that timeout occurs.
That's going to be tricky though I suspect...

A simpler option would be to have an end points such as a tap device
which can swallow skbs for arbitrary times implement a policy in this
regard, either to deep copy or drop after a timeout?

Ian.

> 
> > ---
> >  include/linux/sunrpc/xdr.h  |    2 ++
> >  include/linux/sunrpc/xprt.h |    5 ++++-
> >  net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
> >  net/sunrpc/svcsock.c        |    3 ++-
> >  net/sunrpc/xprt.c           |   13 +++++++++++++
> >  net/sunrpc/xprtsock.c       |    3 ++-
> >  6 files changed, 45 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
> > index a20970e..172f81e 100644
> > --- a/include/linux/sunrpc/xdr.h
> > +++ b/include/linux/sunrpc/xdr.h
> > @@ -16,6 +16,7 @@
> >  #include <asm/byteorder.h>
> >  #include <asm/unaligned.h>
> >  #include <linux/scatterlist.h>
> > +#include <linux/skbuff.h>
> >  
> >  /*
> >   * Buffer adjustment
> > @@ -57,6 +58,7 @@ struct xdr_buf {
> >  			tail[1];	/* Appended after page data */
> >  
> >  	struct page **	pages;		/* Array of contiguous pages */
> > +	struct skb_frag_destructor *destructor;
> >  	unsigned int	page_base,	/* Start of page data */
> >  			page_len,	/* Length of page data */
> >  			flags;		/* Flags for data disposition */
> > diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> > index 15518a1..75131eb 100644
> > --- a/include/linux/sunrpc/xprt.h
> > +++ b/include/linux/sunrpc/xprt.h
> > @@ -92,7 +92,10 @@ struct rpc_rqst {
> >  						/* A cookie used to track the
> >  						   state of the transport
> >  						   connection */
> > -	
> > +	struct skb_frag_destructor destructor;	/* SKB paged fragment
> > +						 * destructor for
> > +						 * transmitted pages*/
> > +
> >  	/*
> >  	 * Partial send handling
> >  	 */
> > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> > index c5347d2..919538d 100644
> > --- a/net/sunrpc/clnt.c
> > +++ b/net/sunrpc/clnt.c
> > @@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
> >  static void	call_reserveresult(struct rpc_task *task);
> >  static void	call_allocate(struct rpc_task *task);
> >  static void	call_decode(struct rpc_task *task);
> > +static void	call_complete(struct rpc_task *task);
> >  static void	call_bind(struct rpc_task *task);
> >  static void	call_bind_status(struct rpc_task *task);
> >  static void	call_transmit(struct rpc_task *task);
> > @@ -1113,6 +1114,8 @@ rpc_xdr_encode(struct rpc_task *task)
> >  			 (char *)req->rq_buffer + req->rq_callsize,
> >  			 req->rq_rcvsize);
> >  
> > +	req->rq_snd_buf.destructor = &req->destructor;
> > +
> >  	p = rpc_encode_header(task);
> >  	if (p == NULL) {
> >  		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
> > @@ -1276,6 +1279,7 @@ call_connect_status(struct rpc_task *task)
> >  static void
> >  call_transmit(struct rpc_task *task)
> >  {
> > +	struct rpc_rqst *req = task->tk_rqstp;
> >  	dprint_status(task);
> >  
> >  	task->tk_action = call_status;
> > @@ -1309,8 +1313,8 @@ call_transmit(struct rpc_task *task)
> >  	call_transmit_status(task);
> >  	if (rpc_reply_expected(task))
> >  		return;
> > -	task->tk_action = rpc_exit_task;
> > -	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
> > +	task->tk_action = call_complete;
> > +	skb_frag_destructor_unref(&req->destructor);
> >  }
> >  
> >  /*
> > @@ -1383,7 +1387,8 @@ call_bc_transmit(struct rpc_task *task)
> >  		return;
> >  	}
> >  
> > -	task->tk_action = rpc_exit_task;
> > +	task->tk_action = call_complete;
> > +	skb_frag_destructor_unref(&req->destructor);
> >  	if (task->tk_status < 0) {
> >  		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
> >  			"error: %d\n", task->tk_status);
> > @@ -1423,7 +1428,6 @@ call_bc_transmit(struct rpc_task *task)
> >  			"error: %d\n", task->tk_status);
> >  		break;
> >  	}
> > -	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
> >  }
> >  #endif /* CONFIG_SUNRPC_BACKCHANNEL */
> >  
> > @@ -1589,12 +1593,14 @@ call_decode(struct rpc_task *task)
> >  		return;
> >  	}
> >  
> > -	task->tk_action = rpc_exit_task;
> > +	task->tk_action = call_complete;
> >  
> >  	if (decode) {
> >  		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
> >  						      task->tk_msg.rpc_resp);
> >  	}
> > +	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
> > +	skb_frag_destructor_unref(&req->destructor);
> >  	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
> >  			task->tk_status);
> >  	return;
> > @@ -1609,6 +1615,17 @@ out_retry:
> >  	}
> >  }
> >  
> > +/*
> > + * 8.	Wait for pages to be released by the network stack.
> > + */
> > +static void
> > +call_complete(struct rpc_task *task)
> > +{
> > +	dprintk("RPC: %5u call_complete result %d\n",
> > +		task->tk_pid, task->tk_status);
> > +	task->tk_action = rpc_exit_task;
> > +}
> > +
> >  static __be32 *
> >  rpc_encode_header(struct rpc_task *task)
> >  {
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 852a258..3685cad 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -196,7 +196,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
> >  	while (pglen > 0) {
> >  		if (slen == size)
> >  			flags = 0;
> > -		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
> > +		result = kernel_sendpage(sock, *ppage, xdr->destructor,
> > +					 base, size, flags);
> >  		if (result > 0)
> >  			len += result;
> >  		if (result != size)
> > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > index f4385e4..925aa0c 100644
> > --- a/net/sunrpc/xprt.c
> > +++ b/net/sunrpc/xprt.c
> > @@ -1103,6 +1103,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
> >  	xprt->xid = net_random();
> >  }
> >  
> > +static int xprt_complete_skb_pages(void *calldata)
> > +{
> > +	struct rpc_task *task = calldata;
> > +	struct rpc_rqst	*req = task->tk_rqstp;
> > +
> > +	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
> > +	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
> > +	return 0;
> > +}
> > +
> >  static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> >  {
> >  	struct rpc_rqst	*req = task->tk_rqstp;
> > @@ -1115,6 +1125,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> >  	req->rq_xid     = xprt_alloc_xid(xprt);
> >  	req->rq_release_snd_buf = NULL;
> >  	xprt_reset_majortimeo(req);
> > +	atomic_set(&req->destructor.ref, 1);
> > +	req->destructor.destroy = &xprt_complete_skb_pages;
> > +	req->destructor.data = task;
> >  	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
> >  			req, ntohl(req->rq_xid));
> >  }
> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > index f79e40e9..af3a106 100644
> > --- a/net/sunrpc/xprtsock.c
> > +++ b/net/sunrpc/xprtsock.c
> > @@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
> >  		remainder -= len;
> >  		if (remainder != 0 || more)
> >  			flags |= MSG_MORE;
> > -		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
> > +		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
> > +					  base, len, flags);
> >  		if (remainder == 0 || err != len)
> >  			break;
> >  		sent += err;
> > -- 
> > 1.7.2.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
@ 2011-11-11 13:20             ` Ian Campbell
  0 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-11 13:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Neil Brown, J. Bruce Fields, linux-nfs

On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > wire. If this happens to an NFS WRITE RPC then the write() system call
> > completes and the userspace process can continue, potentially modifying data
> > referenced by the retransmission before the retransmission occurs.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Neil Brown <neilb@suse.de>
> > Cc: "J. Bruce Fields" <bfields@fieldses.org>
> > Cc: linux-nfs@vger.kernel.org
> > Cc: netdev@vger.kernel.org
> 
> So this blocks the system call until all page references
> are gone, right?

Right. The alternative is to return to userspace while the network stack
still has a reference to the buffer which was passed in -- that's the
exact class of problem this patch is supposed to fix.

> But, there's no upper limit on how long the
> page is referenced, correct?

Correct.

>  consider a bridged setup
> with an skb queued at a tap device - this cause one process
> to block another one by virtue of not consuming a cloned skb?

Hmm, yes.

One approach might be to introduce the concept of an skb timeout to the
stack as a whole and cancel (or deep copy) after that timeout occurs.
That's going to be tricky though I suspect...

A simpler option would be to have an end points such as a tap device
which can swallow skbs for arbitrary times implement a policy in this
regard, either to deep copy or drop after a timeout?

Ian.

> 
> > ---
> >  include/linux/sunrpc/xdr.h  |    2 ++
> >  include/linux/sunrpc/xprt.h |    5 ++++-
> >  net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
> >  net/sunrpc/svcsock.c        |    3 ++-
> >  net/sunrpc/xprt.c           |   13 +++++++++++++
> >  net/sunrpc/xprtsock.c       |    3 ++-
> >  6 files changed, 45 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
> > index a20970e..172f81e 100644
> > --- a/include/linux/sunrpc/xdr.h
> > +++ b/include/linux/sunrpc/xdr.h
> > @@ -16,6 +16,7 @@
> >  #include <asm/byteorder.h>
> >  #include <asm/unaligned.h>
> >  #include <linux/scatterlist.h>
> > +#include <linux/skbuff.h>
> >  
> >  /*
> >   * Buffer adjustment
> > @@ -57,6 +58,7 @@ struct xdr_buf {
> >  			tail[1];	/* Appended after page data */
> >  
> >  	struct page **	pages;		/* Array of contiguous pages */
> > +	struct skb_frag_destructor *destructor;
> >  	unsigned int	page_base,	/* Start of page data */
> >  			page_len,	/* Length of page data */
> >  			flags;		/* Flags for data disposition */
> > diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> > index 15518a1..75131eb 100644
> > --- a/include/linux/sunrpc/xprt.h
> > +++ b/include/linux/sunrpc/xprt.h
> > @@ -92,7 +92,10 @@ struct rpc_rqst {
> >  						/* A cookie used to track the
> >  						   state of the transport
> >  						   connection */
> > -	
> > +	struct skb_frag_destructor destructor;	/* SKB paged fragment
> > +						 * destructor for
> > +						 * transmitted pages*/
> > +
> >  	/*
> >  	 * Partial send handling
> >  	 */
> > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> > index c5347d2..919538d 100644
> > --- a/net/sunrpc/clnt.c
> > +++ b/net/sunrpc/clnt.c
> > @@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
> >  static void	call_reserveresult(struct rpc_task *task);
> >  static void	call_allocate(struct rpc_task *task);
> >  static void	call_decode(struct rpc_task *task);
> > +static void	call_complete(struct rpc_task *task);
> >  static void	call_bind(struct rpc_task *task);
> >  static void	call_bind_status(struct rpc_task *task);
> >  static void	call_transmit(struct rpc_task *task);
> > @@ -1113,6 +1114,8 @@ rpc_xdr_encode(struct rpc_task *task)
> >  			 (char *)req->rq_buffer + req->rq_callsize,
> >  			 req->rq_rcvsize);
> >  
> > +	req->rq_snd_buf.destructor = &req->destructor;
> > +
> >  	p = rpc_encode_header(task);
> >  	if (p == NULL) {
> >  		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
> > @@ -1276,6 +1279,7 @@ call_connect_status(struct rpc_task *task)
> >  static void
> >  call_transmit(struct rpc_task *task)
> >  {
> > +	struct rpc_rqst *req = task->tk_rqstp;
> >  	dprint_status(task);
> >  
> >  	task->tk_action = call_status;
> > @@ -1309,8 +1313,8 @@ call_transmit(struct rpc_task *task)
> >  	call_transmit_status(task);
> >  	if (rpc_reply_expected(task))
> >  		return;
> > -	task->tk_action = rpc_exit_task;
> > -	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
> > +	task->tk_action = call_complete;
> > +	skb_frag_destructor_unref(&req->destructor);
> >  }
> >  
> >  /*
> > @@ -1383,7 +1387,8 @@ call_bc_transmit(struct rpc_task *task)
> >  		return;
> >  	}
> >  
> > -	task->tk_action = rpc_exit_task;
> > +	task->tk_action = call_complete;
> > +	skb_frag_destructor_unref(&req->destructor);
> >  	if (task->tk_status < 0) {
> >  		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
> >  			"error: %d\n", task->tk_status);
> > @@ -1423,7 +1428,6 @@ call_bc_transmit(struct rpc_task *task)
> >  			"error: %d\n", task->tk_status);
> >  		break;
> >  	}
> > -	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
> >  }
> >  #endif /* CONFIG_SUNRPC_BACKCHANNEL */
> >  
> > @@ -1589,12 +1593,14 @@ call_decode(struct rpc_task *task)
> >  		return;
> >  	}
> >  
> > -	task->tk_action = rpc_exit_task;
> > +	task->tk_action = call_complete;
> >  
> >  	if (decode) {
> >  		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
> >  						      task->tk_msg.rpc_resp);
> >  	}
> > +	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
> > +	skb_frag_destructor_unref(&req->destructor);
> >  	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
> >  			task->tk_status);
> >  	return;
> > @@ -1609,6 +1615,17 @@ out_retry:
> >  	}
> >  }
> >  
> > +/*
> > + * 8.	Wait for pages to be released by the network stack.
> > + */
> > +static void
> > +call_complete(struct rpc_task *task)
> > +{
> > +	dprintk("RPC: %5u call_complete result %d\n",
> > +		task->tk_pid, task->tk_status);
> > +	task->tk_action = rpc_exit_task;
> > +}
> > +
> >  static __be32 *
> >  rpc_encode_header(struct rpc_task *task)
> >  {
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 852a258..3685cad 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -196,7 +196,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
> >  	while (pglen > 0) {
> >  		if (slen == size)
> >  			flags = 0;
> > -		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
> > +		result = kernel_sendpage(sock, *ppage, xdr->destructor,
> > +					 base, size, flags);
> >  		if (result > 0)
> >  			len += result;
> >  		if (result != size)
> > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > index f4385e4..925aa0c 100644
> > --- a/net/sunrpc/xprt.c
> > +++ b/net/sunrpc/xprt.c
> > @@ -1103,6 +1103,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
> >  	xprt->xid = net_random();
> >  }
> >  
> > +static int xprt_complete_skb_pages(void *calldata)
> > +{
> > +	struct rpc_task *task = calldata;
> > +	struct rpc_rqst	*req = task->tk_rqstp;
> > +
> > +	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
> > +	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
> > +	return 0;
> > +}
> > +
> >  static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> >  {
> >  	struct rpc_rqst	*req = task->tk_rqstp;
> > @@ -1115,6 +1125,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> >  	req->rq_xid     = xprt_alloc_xid(xprt);
> >  	req->rq_release_snd_buf = NULL;
> >  	xprt_reset_majortimeo(req);
> > +	atomic_set(&req->destructor.ref, 1);
> > +	req->destructor.destroy = &xprt_complete_skb_pages;
> > +	req->destructor.data = task;
> >  	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
> >  			req, ntohl(req->rq_xid));
> >  }
> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > index f79e40e9..af3a106 100644
> > --- a/net/sunrpc/xprtsock.c
> > +++ b/net/sunrpc/xprtsock.c
> > @@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
> >  		remainder -= len;
> >  		if (remainder != 0 || more)
> >  			flags |= MSG_MORE;
> > -		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
> > +		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
> > +					  base, len, flags);
> >  		if (remainder == 0 || err != len)
> >  			break;
> >  		sent += err;
> > -- 
> > 1.7.2.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2011-11-11 13:20             ` Ian Campbell
@ 2011-11-11 20:00                 ` J. Bruce Fields
  -1 siblings, 0 replies; 52+ messages in thread
From: J. Bruce Fields @ 2011-11-11 20:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Michael S. Tsirkin, netdev-u79uwXL29TY76Z2rM5mHXA,
	David S. Miller, Neil Brown, linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 11, 2011 at 01:20:27PM +0000, Ian Campbell wrote:
> On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> > On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > completes and the userspace process can continue, potentially modifying data
> > > referenced by the retransmission before the retransmission occurs.
> > > 
> > > Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> > > Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> > > Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > > Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
> > > Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> > > Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > 
> > So this blocks the system call until all page references
> > are gone, right?
> 
> Right. The alternative is to return to userspace while the network stack
> still has a reference to the buffer which was passed in -- that's the
> exact class of problem this patch is supposed to fix.
> 
> > But, there's no upper limit on how long the
> > page is referenced, correct?
> 
> Correct.
> 
> >  consider a bridged setup
> > with an skb queued at a tap device - this cause one process
> > to block another one by virtue of not consuming a cloned skb?
> 
> Hmm, yes.
> 
> One approach might be to introduce the concept of an skb timeout to the
> stack as a whole and cancel (or deep copy) after that timeout occurs.
> That's going to be tricky though I suspect...
> 
> A simpler option would be to have an end points such as a tap device
> which can swallow skbs for arbitrary times implement a policy in this
> regard, either to deep copy or drop after a timeout?

Stupid question: Is it a requirement that you be safe against DOS by a
rogue process with a tap device?  (And if so, does current code satisfy
that requirement?)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
@ 2011-11-11 20:00                 ` J. Bruce Fields
  0 siblings, 0 replies; 52+ messages in thread
From: J. Bruce Fields @ 2011-11-11 20:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Michael S. Tsirkin, netdev, David S. Miller, Neil Brown, linux-nfs

On Fri, Nov 11, 2011 at 01:20:27PM +0000, Ian Campbell wrote:
> On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> > On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > completes and the userspace process can continue, potentially modifying data
> > > referenced by the retransmission before the retransmission occurs.
> > > 
> > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> > > Cc: "David S. Miller" <davem@davemloft.net>
> > > Cc: Neil Brown <neilb@suse.de>
> > > Cc: "J. Bruce Fields" <bfields@fieldses.org>
> > > Cc: linux-nfs@vger.kernel.org
> > > Cc: netdev@vger.kernel.org
> > 
> > So this blocks the system call until all page references
> > are gone, right?
> 
> Right. The alternative is to return to userspace while the network stack
> still has a reference to the buffer which was passed in -- that's the
> exact class of problem this patch is supposed to fix.
> 
> > But, there's no upper limit on how long the
> > page is referenced, correct?
> 
> Correct.
> 
> >  consider a bridged setup
> > with an skb queued at a tap device - this cause one process
> > to block another one by virtue of not consuming a cloned skb?
> 
> Hmm, yes.
> 
> One approach might be to introduce the concept of an skb timeout to the
> stack as a whole and cancel (or deep copy) after that timeout occurs.
> That's going to be tricky though I suspect...
> 
> A simpler option would be to have an end points such as a tap device
> which can swallow skbs for arbitrary times implement a policy in this
> regard, either to deep copy or drop after a timeout?

Stupid question: Is it a requirement that you be safe against DOS by a
rogue process with a tap device?  (And if so, does current code satisfy
that requirement?)

--b.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2011-11-11 13:20             ` Ian Campbell
  (?)
  (?)
@ 2011-11-13 10:17             ` Michael S. Tsirkin
       [not found]               ` <20111113101713.GB15322-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  -1 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2011-11-13 10:17 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David S. Miller, Neil Brown, J. Bruce Fields, linux-nfs

On Fri, Nov 11, 2011 at 01:20:27PM +0000, Ian Campbell wrote:
> On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> > On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > completes and the userspace process can continue, potentially modifying data
> > > referenced by the retransmission before the retransmission occurs.
> > > 
> > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> > > Cc: "David S. Miller" <davem@davemloft.net>
> > > Cc: Neil Brown <neilb@suse.de>
> > > Cc: "J. Bruce Fields" <bfields@fieldses.org>
> > > Cc: linux-nfs@vger.kernel.org
> > > Cc: netdev@vger.kernel.org
> > 
> > So this blocks the system call until all page references
> > are gone, right?
> 
> Right. The alternative is to return to userspace while the network stack
> still has a reference to the buffer which was passed in -- that's the
> exact class of problem this patch is supposed to fix.

BTW, the increased latency and the overhead extra wakeups might for some
workloads be greater than the cost of the data copy.

> > But, there's no upper limit on how long the
> > page is referenced, correct?
> 
> Correct.
> 
> >  consider a bridged setup
> > with an skb queued at a tap device - this cause one process
> > to block another one by virtue of not consuming a cloned skb?
> 
> Hmm, yes.
> 
> One approach might be to introduce the concept of an skb timeout to the
> stack as a whole and cancel (or deep copy) after that timeout occurs.
> That's going to be tricky though I suspect...

Further, an application might use signals such as SIGALARM,
delaying them significantly will cause trouble.

> A simpler option would be to have an end points such as a tap device

Which end points would that be? Doesn't this affect a packet socket
with matching filters? A userspace TCP socket that happens to
reside on the same box?  It also seems that at least with a tap device
an skb can get queued in a qdisk, also indefinitely, right?

> which can swallow skbs for arbitrary times implement a policy in this
> regard, either to deep copy or drop after a timeout?
> 
> Ian.

Or deep copy immediately?

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2011-11-13 10:17             ` Michael S. Tsirkin
@ 2011-11-14 13:07                   ` Ian Campbell
  0 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-14 13:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, Neil Brown,
	J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Sun, 2011-11-13 at 10:17 +0000, Michael S. Tsirkin wrote:
> On Fri, Nov 11, 2011 at 01:20:27PM +0000, Ian Campbell wrote:
> > On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> > > On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > > completes and the userspace process can continue, potentially modifying data
> > > > referenced by the retransmission before the retransmission occurs.
> > > > 
> > > > Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> > > > Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> > > > Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > > > Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
> > > > Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> > > > Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > 
> > > So this blocks the system call until all page references
> > > are gone, right?
> > 
> > Right. The alternative is to return to userspace while the network stack
> > still has a reference to the buffer which was passed in -- that's the
> > exact class of problem this patch is supposed to fix.
> 
> BTW, the increased latency and the overhead extra wakeups might for some
> workloads be greater than the cost of the data copy.

Under normal circumstances these paths should not be activated at all.
These only come into play if there are delays in the network coming from
somewhere and I would expect any negative effect from that to outweigh
either a copy or an additional wakeup.

> > 
> > >  consider a bridged setup
> > > with an skb queued at a tap device - this cause one process
> > > to block another one by virtue of not consuming a cloned skb?
> > 
> > Hmm, yes.
> > 
> > One approach might be to introduce the concept of an skb timeout to the
> > stack as a whole and cancel (or deep copy) after that timeout occurs.
> > That's going to be tricky though I suspect...
> 
> Further, an application might use signals such as SIGALARM,
> delaying them significantly will cause trouble.

AIUI there is nothing to stop the SIGALARM being delivered in a timely
manner, all which may need to be delayed is the write() returning
-EINTR. 

When -EINTR is returned the buffer must no longer be referenced either
implicitly or explicitly by the kernel and the write must not have
completed and nor should it complete in the future (that way lies
corruption of varying sorts) so copying the data pages is not helpful in
this case.

This patch ensures that the buffer is no longer referenced when the
write returns. It's possible that NFS might need to cancel a write
operation in order to not complete it after returning -EINTR (I don't
know if it does this or not) but I don't think this series impacts that
one way or the other.

> 
> > A simpler option would be to have an end points such as a tap device
> 
> Which end points would that be? Doesn't this affect a packet socket
> with matching filters? A userspace TCP socket that happens to
> reside on the same box?  It also seems that at least with a tap device
> an skb can get queued in a qdisk, also indefinitely, right?

Those are all possibilities.

In order for this patch to have any impact on any of these scenarios
those end points would have to currently be referencing and using pages
of data which have been "returned" to the originating user process and
may be changing under their feet. This is without a doubt a Bad Thing.

In the normal case it is likely that the same end point which is
injecting delay is also the one the originating process is actually
trying to talk to and so the delay would already have been present and
this patch doesn't delay things any further.

If there are parts of the stack which can end up holding onto an skb for
an arbitrary amount of time then I think that is something which needs
to be fixed up in those end points rather than in everyone who injects
an skb into the stack. Whether an immediate deep copy or a more lazy
approach is appropriate I suspect depends upon the exact use case of
each end point.

Having said that one idea which springs to mind would be to allow
someone who has injected a page into the stack to "cancel" it. Since I
am working on pushing the struct page * down into the struct
skb_destructor this could be as simple as setting the page to NULL.
However every end point would need to be taught to expect this. I'm sure
there are also all sorts of locking nightmares underlying this idea.
Perhaps you'd need separate reference counts for queued vs in active
use.

Ian.

> 
> > which can swallow skbs for arbitrary times implement a policy in this
> > regard, either to deep copy or drop after a timeout?
> > 
> > Ian.
> 
> Or deep copy immediately?
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
@ 2011-11-14 13:07                   ` Ian Campbell
  0 siblings, 0 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-14 13:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Neil Brown, J. Bruce Fields, linux-nfs

On Sun, 2011-11-13 at 10:17 +0000, Michael S. Tsirkin wrote:
> On Fri, Nov 11, 2011 at 01:20:27PM +0000, Ian Campbell wrote:
> > On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> > > On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > > completes and the userspace process can continue, potentially modifying data
> > > > referenced by the retransmission before the retransmission occurs.
> > > > 
> > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > > Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> > > > Cc: "David S. Miller" <davem@davemloft.net>
> > > > Cc: Neil Brown <neilb@suse.de>
> > > > Cc: "J. Bruce Fields" <bfields@fieldses.org>
> > > > Cc: linux-nfs@vger.kernel.org
> > > > Cc: netdev@vger.kernel.org
> > > 
> > > So this blocks the system call until all page references
> > > are gone, right?
> > 
> > Right. The alternative is to return to userspace while the network stack
> > still has a reference to the buffer which was passed in -- that's the
> > exact class of problem this patch is supposed to fix.
> 
> BTW, the increased latency and the overhead extra wakeups might for some
> workloads be greater than the cost of the data copy.

Under normal circumstances these paths should not be activated at all.
These only come into play if there are delays in the network coming from
somewhere and I would expect any negative effect from that to outweigh
either a copy or an additional wakeup.

> > 
> > >  consider a bridged setup
> > > with an skb queued at a tap device - this cause one process
> > > to block another one by virtue of not consuming a cloned skb?
> > 
> > Hmm, yes.
> > 
> > One approach might be to introduce the concept of an skb timeout to the
> > stack as a whole and cancel (or deep copy) after that timeout occurs.
> > That's going to be tricky though I suspect...
> 
> Further, an application might use signals such as SIGALARM,
> delaying them significantly will cause trouble.

AIUI there is nothing to stop the SIGALARM being delivered in a timely
manner, all which may need to be delayed is the write() returning
-EINTR. 

When -EINTR is returned the buffer must no longer be referenced either
implicitly or explicitly by the kernel and the write must not have
completed and nor should it complete in the future (that way lies
corruption of varying sorts) so copying the data pages is not helpful in
this case.

This patch ensures that the buffer is no longer referenced when the
write returns. It's possible that NFS might need to cancel a write
operation in order to not complete it after returning -EINTR (I don't
know if it does this or not) but I don't think this series impacts that
one way or the other.

> 
> > A simpler option would be to have an end points such as a tap device
> 
> Which end points would that be? Doesn't this affect a packet socket
> with matching filters? A userspace TCP socket that happens to
> reside on the same box?  It also seems that at least with a tap device
> an skb can get queued in a qdisk, also indefinitely, right?

Those are all possibilities.

In order for this patch to have any impact on any of these scenarios
those end points would have to currently be referencing and using pages
of data which have been "returned" to the originating user process and
may be changing under their feet. This is without a doubt a Bad Thing.

In the normal case it is likely that the same end point which is
injecting delay is also the one the originating process is actually
trying to talk to and so the delay would already have been present and
this patch doesn't delay things any further.

If there are parts of the stack which can end up holding onto an skb for
an arbitrary amount of time then I think that is something which needs
to be fixed up in those end points rather than in everyone who injects
an skb into the stack. Whether an immediate deep copy or a more lazy
approach is appropriate I suspect depends upon the exact use case of
each end point.

Having said that one idea which springs to mind would be to allow
someone who has injected a page into the stack to "cancel" it. Since I
am working on pushing the struct page * down into the struct
skb_destructor this could be as simple as setting the page to NULL.
However every end point would need to be taught to expect this. I'm sure
there are also all sorts of locking nightmares underlying this idea.
Perhaps you'd need separate reference counts for queued vs in active
use.

Ian.

> 
> > which can swallow skbs for arbitrary times implement a policy in this
> > regard, either to deep copy or drop after a timeout?
> > 
> > Ian.
> 
> Or deep copy immediately?
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
  2011-11-14 13:07                   ` Ian Campbell
  (?)
@ 2011-11-14 13:25                   ` David Laight
  -1 siblings, 0 replies; 52+ messages in thread
From: David Laight @ 2011-11-14 13:25 UTC (permalink / raw)
  To: Ian Campbell, Michael S. Tsirkin
  Cc: netdev, David S. Miller, Neil Brown, J. Bruce Fields, linux-nfs

 
> This prevents an issue where an ACK is delayed, a retransmit is queued
(either
> at the RPC or TCP level) and the ACK arrives before the retransmission
hits the
> wire. If this happens to an NFS WRITE RPC then the write() system call
> completes and the userspace process can continue, potentially
modifying data
> referenced by the retransmission before the retransmission occurs.

The only problem I see is that the source address might get
invalidated (assuming it is a user address mapped into kernel).
Not sure what effect the fault would have...

If it is an RPC retransmittion the NFS write won't be
terminated by the RPC ACK (because it is a stale ACK)
and the destination will see two copies of the NFS write.
(This is quite common with NFS/UDP.)

If it is a TCP retransmittion, then the receiving TCP stack
will see it as duplicate data and discard the contents
without passing them to RPC.

Interrupting an NFS write with a signal is problematic
anyway - I'm sure I've seen systems (SYSV?) when NFS
writes were uninterruptable. NFI how Linux handles this.

	David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-11-09 17:49 ` [PATCH 0/4] skb paged fragment destructors Eric Dumazet
  2011-11-10 10:39   ` Ian Campbell
@ 2011-11-17 14:45   ` Ian Campbell
  2011-11-17 14:51     ` Eric Dumazet
  2011-11-17 20:22     ` Michał Mirosław
  1 sibling, 2 replies; 52+ messages in thread
From: Ian Campbell @ 2011-11-17 14:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Jesse Brandeburg, netdev

On Wed, 2011-11-09 at 17:49 +0000, Eric Dumazet wrote:
> Le mercredi 09 novembre 2011 à 15:01 +0000, Ian Campbell a écrit :
> > The following series makes use of the skb fragment API (which is in 3.2)
> > to add a per-paged-fragment destructor callback. This can be used by
> > creators of skbs who are interested in the lifecycle of the pages
> > included in that skb after they have handed it off to the network stack.
> > I think these have all been posted before, but have been backed up
> > behind the skb fragment API.
> > 
> > The mail at [0] contains some more background and rationale but
> > basically the completed series will allow entities which inject pages
> > into the networking stack to receive a notification when the stack has
> > really finished with those pages (i.e. including retransmissions,
> > clones, pull-ups etc) and not just when the original skb is finished
> > with, which is beneficial to many subsystems which wish to inject pages
> > into the network stack without giving up full ownership of those page's
> > lifecycle. It implements something broadly along the lines of what was
> > described in [1].
> > 
> > I have also included a patch to the RPC subsystem which uses this API to
> > fix the bug which I describe at [2].
> > 
> > I presented this work at LPC in September and there was a
> > question/concern raised (by Jesse Brandenburg IIRC) regarding the
> > overhead of adding this extra field per fragment. If I understand
> > correctly it seems that in the there have been performance regressions
> > in the past with allocations outgrowing one allocation size bucket and
> > therefore using the next. The change in datastructure size resulting
> > from this series is:
> > 					  BEFORE	AFTER
> > AMD64:	sizeof(struct skb_frag_struct)	= 16		24
> > 	sizeof(struct skb_shared_info)	= 344		488
> 
> Thats a real problem, because 488 is soo big. (its even rounded to 512
> bytes)
> 
> Now, on x86, a half page (2048 bytes) wont be big enough to contain a
> typical frame (MTU=1500)
> 
> NET_SKB_PAD (64) + 1500 + 14 + 512 > 2048

Am I right that it's not so much a case of fitting into half a page but
more like not wasting nearly half a page if you end up requiring a 4096
byte allocation? (This is probably splitting hairs, I'm really just
curious)

> Even if we dont round 488 to 512, (no cache align skb_shared_info) we
> have a problem.
> 
> NET_SKB_PAD (64) + 1500 + 14 + 488 > 2048
> 
> Why not using a low order bit to mark 'page' being a pointer to 

I've given this a go but it makes the patches to the user of this
facility quite nasty (or at least it does for the sunrpc/NFS case). I'm
going to plug at it a bit more and see if I can't make it look cleaner
but I was starting to wonder about alternative approaches.

One idea was split the allocation of the data and the shinfo. Since
shinfo is a fixed size it would be easy to have a specific cache for
them. Does this sound even vaguely plausible?

> struct skb_frag_page_desc {
> 	struct page *p;
> 	atomic_t ref;
> 	int (*destroy)(void *data);
> /*	void *data; */ /* no need, see container_of() */

It turns out that container_of is not so useful here as the users
typically has a list of pages not a single page and hence has a list of
destructors too. What you actually need is the container of the pointer
to that list, IYSWIM, which you can't get at given only a pointer to an
element of the list. So you end up doing

	struct subsys_page_desc {
		struct subsys_container *container;
		struct sbk_frag_page_desc;
	}

*container here is basically the same as the void * so you might as well
include it in the base datastructure.

Ian.

> };

> 
> struct skb_frag_struct {
>         struct {
>                 union {
> 			struct page *p; /* low order bit not set */
> 			struct skb_frag_page_desc *skbpage; /* low order bit set */
> 		};
>         } page;
> ...
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-11-17 14:45   ` Ian Campbell
@ 2011-11-17 14:51     ` Eric Dumazet
  2011-11-17 20:22     ` Michał Mirosław
  1 sibling, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2011-11-17 14:51 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, Jesse Brandeburg, netdev

Le jeudi 17 novembre 2011 à 14:45 +0000, Ian Campbell a écrit :

> Am I right that it's not so much a case of fitting into half a page but
> more like not wasting nearly half a page if you end up requiring a 4096
> byte allocation? (This is probably splitting hairs, I'm really just
> curious)

Performance implications are quite high :

Here is what I did to bnx2x driver to reduce the space from 4096 to
2048, and you can see performance results :

http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commit;h=e52fcb2462ac484e6dd6e68869536609f0216938

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-11-17 14:45   ` Ian Campbell
  2011-11-17 14:51     ` Eric Dumazet
@ 2011-11-17 20:22     ` Michał Mirosław
  1 sibling, 0 replies; 52+ messages in thread
From: Michał Mirosław @ 2011-11-17 20:22 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Eric Dumazet, David Miller, Jesse Brandeburg, netdev

2011/11/17 Ian Campbell <Ian.Campbell@citrix.com>:
> On Wed, 2011-11-09 at 17:49 +0000, Eric Dumazet wrote:
>> struct skb_frag_page_desc {
>>       struct page *p;
>>       atomic_t ref;
>>       int (*destroy)(void *data);
>> /*    void *data; */ /* no need, see container_of() */
>
> It turns out that container_of is not so useful here as the users
> typically has a list of pages not a single page and hence has a list of
> destructors too. What you actually need is the container of the pointer
> to that list, IYSWIM, which you can't get at given only a pointer to an
> element of the list. So you end up doing
>
>        struct subsys_page_desc {
>                struct subsys_container *container;
>                struct sbk_frag_page_desc;
>        }
>
> *container here is basically the same as the void * so you might as well
> include it in the base datastructure.

If you reverse the order fields in subsys_page_desc then
container_of() compiles to no-op, and you don't have to impose this
extra field for all users, even if they don't need it.

If you see it useful, then there could be base skb_frag_page_desc and
then, skb_frag_page_desc_with_data (or othre) extending it. Core code
only ever needs the base structure.

BTW, destroy() prototype should be:

void destroy(struct skb_frag_page_desc *desc);

Because: 1. you know the parameter's type, 2. whatever would be
returned is not going to be useful.

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-11-09 15:01 [PATCH 0/4] skb paged fragment destructors Ian Campbell
                   ` (3 preceding siblings ...)
  2011-11-09 17:49 ` [PATCH 0/4] skb paged fragment destructors Eric Dumazet
@ 2011-12-06 11:57 ` Ian Campbell
  2011-12-06 13:24   ` Eric Dumazet
  4 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-12-06 11:57 UTC (permalink / raw)
  To: David Miller; +Cc: Jesse Brandeburg, netdev

On Wed, 2011-11-09 at 15:01 +0000, Ian Campbell wrote:
>       * split linear data allocation and shinfo allocation into two. I
>         suspect this will have its own performance implications? On the
>         positive side skb_shared_info could come from its own fixed size
>         pool/cache which might have some benefits

I played with this to see how it would look. Illustrative patch below. 

I figure that lots of small frames is the interesting workload for a
change such as this but I don't know if iperf is necessarily the best
benchmark for measuring that.
Before changing things I got:
        iperf -c qarun -m -t 60 -u -b 10000M -l 64
        ------------------------------------------------------------
        Client connecting to qarun, UDP port 5001
        Sending 64 byte datagrams
        UDP buffer size:   224 KByte (default)
        ------------------------------------------------------------
        [  3] local 10.80.225.63 port 45857 connected with 10.80.224.22 port 5001
        [ ID] Interval       Transfer     Bandwidth
        [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec
        [  3] Sent 13820376 datagrams
        [  3] Server Report:
        [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec  0.005 ms    0/13820375 (0%)
        [  3]  0.0-60.0 sec  1 datagrams received out-of-order
whereas with the patch:
        # iperf -c qarun -m -t 60 -u -b 10000M -l 64
        ------------------------------------------------------------
        Client connecting to qarun, UDP port 5001
        Sending 64 byte datagrams
        UDP buffer size:   224 KByte (default)
        ------------------------------------------------------------
        [  3] local 10.80.225.63 port 42504 connected with 10.80.224.22 port 5001
        [ ID] Interval       Transfer     Bandwidth
        [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec
        [  3] Sent 13645857 datagrams
        [  3] Server Report:
        [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec  0.005 ms    0/13645856 (0%)
        [  3]  0.0-60.0 sec  1 datagrams received out-of-order

With 1200 byte datagrams I get basically identical throughput.

(nb: none of the skb destructor stuff was present in either case)

>       * steal a bit a pointer to indicate destructor pointer vs regular
>         struct page pointer (moving the struct page into the destructor
>         datastructure for that case). Stops us sharing a single
>         destructor between multiple pages, but that isn't critical

I also implemented this and compared to the above the patch is pretty
fugly, especially its impact on the SUNRPC patches to actually use the
feature. It needs a bit more cleanup before I can post it for comparison
though.

Ian.

 drivers/net/ethernet/broadcom/bnx2.c        |    5 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |    4 +-
 drivers/net/ethernet/broadcom/tg3.c         |    3 +-
 include/linux/skbuff.h                      |    9 ++-
 net/core/skbuff.c                           |   67 ++++++++++++++++----------
 5 files changed, 51 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index 83d8cef..4e37e05 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -5320,8 +5320,7 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 	/* 8 for CRC and VLAN */
 	rx_size = bp->dev->mtu + ETH_HLEN + BNX2_RX_OFFSET + 8;
 
-	rx_space = SKB_DATA_ALIGN(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD +
-		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	rx_space = SKB_DATA_ALIGN(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD;
 
 	bp->rx_copy_thresh = BNX2_RX_COPY_THRESH;
 	bp->rx_pg_ring_size = 0;
@@ -5345,7 +5344,7 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 	bp->rx_buf_use_size = rx_size;
 	/* hw alignment + build_skb() overhead*/
 	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
-		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		NET_SKB_PAD;
 	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
 	bp->rx_ring_size = size;
 	bp->rx_max_ring = bnx2_find_max_ring(size, MAX_RX_RINGS);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 0f7b7a4..b3de327 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1203,9 +1203,7 @@ struct bnx2x {
 	 */
 #define BNX2X_FW_RX_ALIGN_START	(1UL << BNX2X_RX_ALIGN_SHIFT)
 
-#define BNX2X_FW_RX_ALIGN_END					\
-	max(1UL << BNX2X_RX_ALIGN_SHIFT, 			\
-	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+#define BNX2X_FW_RX_ALIGN_END	(1UL << BNX2X_RX_ALIGN_SHIFT)
 
 #define BNX2X_PXP_DRAM_ALIGN		(BNX2X_RX_ALIGN_SHIFT - 5)
 
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index d9e9c8c..dbea57b 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -5426,8 +5426,7 @@ static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
 	 * Callers depend upon this behavior and assume that
 	 * we leave everything unchanged if we fail.
 	 */
-	skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp)) +
-		   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp));
 	data = kmalloc(skb_size, GFP_ATOMIC);
 	if (!data)
 		return -ENOMEM;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 09b7ea5..48e91a0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,8 +40,7 @@
 
 #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
 				 ~(SMP_CACHE_BYTES - 1))
-#define SKB_WITH_OVERHEAD(X)	\
-	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+#define SKB_WITH_OVERHEAD(X)	((X))
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
 #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
@@ -476,6 +475,7 @@ struct sk_buff {
 	sk_buff_data_t		end;
 	unsigned char		*head,
 				*data;
+	struct skb_shared_info	*shinfo;
 	unsigned int		truesize;
 	atomic_t		users;
 };
@@ -634,7 +634,10 @@ static inline unsigned char *skb_end_pointer(const struct sk_buff *skb)
 #endif
 
 /* Internal */
-#define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
+static inline struct skb_shared_info *skb_shinfo(const struct sk_buff *skb)
+{
+	return skb->shinfo;
+}
 
 static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7dc05ec..cfbe8e5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,6 +73,7 @@
 
 static struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
+static struct kmem_cache *skbuff_shinfo_cache __read_mostly;
 
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
@@ -118,7 +119,7 @@ static const struct pipe_buf_operations sock_pipe_buf_ops = {
  *
  *	Out of line support code for skb_put(). Not user callable.
  */
-static void skb_over_panic(struct sk_buff *skb, int sz, void *here)
+static void skb_over_panic(struct sk_buff*skb, int sz, void *here)
 {
 	printk(KERN_EMERG "skb_over_panic: text:%p len:%d put:%d head:%p "
 			  "data:%p tail:%#lx end:%#lx dev:%s\n",
@@ -184,22 +185,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		goto out;
 	prefetchw(skb);
 
+	shinfo = kmem_cache_alloc_node(skbuff_shinfo_cache,
+				       gfp_mask & ~__GFP_DMA, node);
+	if (!shinfo)
+		goto noshinfo;
+	prefetchw(shinfo);
+
 	/* We do our best to align skb_shared_info on a separate cache
 	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
 	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
 	 * Both skb->head and skb_shared_info are cache line aligned.
 	 */
 	size = SKB_DATA_ALIGN(size);
-	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	data = kmalloc_node_track_caller(size, gfp_mask, node);
 	if (!data)
 		goto nodata;
-	/* kmalloc(size) might give us more room than requested.
-	 * Put skb_shared_info exactly at the end of allocated zone,
-	 * to allow max possible filling before reallocation.
-	 */
-	size = SKB_WITH_OVERHEAD(ksize(data));
-	prefetchw(data + size);
 
 	/*
 	 * Only clear those fields we need to clear, not those that we will
@@ -210,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	/* Account for allocated memory : skb + skb->head */
 	skb->truesize = SKB_TRUESIZE(size);
 	atomic_set(&skb->users, 1);
+	skb->shinfo = shinfo;
 	skb->head = data;
 	skb->data = data;
 	skb_reset_tail_pointer(skb);
@@ -219,7 +220,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 #endif
 
 	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
+	BUG_ON(shinfo != skb_shinfo(skb));
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
@@ -238,6 +239,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 out:
 	return skb;
 nodata:
+	kmem_cache_free(skbuff_shinfo_cache, shinfo);
+noshinfo:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
 	goto out;
@@ -254,8 +257,8 @@ EXPORT_SYMBOL(__alloc_skb);
  * On a failure the return is %NULL, and @data is not freed.
  * Notes :
  *  Before IO, driver allocates only data buffer where NIC put incoming frame
- *  Driver should add room at head (NET_SKB_PAD) and
- *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
+ *  Driver should add room at head (NET_SKB_PAD) and need not
+ *  add room at tail (SKB_DATA_ALIGN(skb_shared_info))
  *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
  *  before giving packet to stack.
  *  RX rings only contains data buffers, not full skbs.
@@ -270,11 +273,17 @@ struct sk_buff *build_skb(void *data)
 	if (!skb)
 		return NULL;
 
-	size = ksize(data) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	shinfo = kmem_cache_alloc(skbuff_shinfo_cache, GFP_ATOMIC);
+	if (!shinfo)
+		goto noshinfo;
+	prefetchw(shinfo);
+
+	size = ksize(data);
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = SKB_TRUESIZE(size);
 	atomic_set(&skb->users, 1);
+	skb->shinfo = shinfo;
 	skb->head = data;
 	skb->data = data;
 	skb_reset_tail_pointer(skb);
@@ -284,12 +293,17 @@ struct sk_buff *build_skb(void *data)
 #endif
 
 	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
+	BUG_ON(shinfo != skb_shinfo(skb));
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
 
+out:
 	return skb;
+noshinfo:
+	kmem_cache_free(skbuff_head_cache, skb);
+	skb = NULL;
+	goto out;
 }
 EXPORT_SYMBOL(build_skb);
 
@@ -378,7 +392,7 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 		skb_get(list);
 }
 
-static void skb_release_data(struct sk_buff *skb)
+static void skb_release_data(struct sk_buff *skb, int free_shinfo)
 {
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
@@ -404,6 +418,8 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frag_list(skb))
 			skb_drop_fraglist(skb);
 
+		if (free_shinfo)
+			kmem_cache_free(skbuff_shinfo_cache, skb_shinfo(skb));
 		kfree(skb->head);
 	}
 }
@@ -474,7 +490,7 @@ static void skb_release_head_state(struct sk_buff *skb)
 static void skb_release_all(struct sk_buff *skb)
 {
 	skb_release_head_state(skb);
-	skb_release_data(skb);
+	skb_release_data(skb, 1);
 }
 
 /**
@@ -646,6 +662,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	C(tail);
 	C(end);
 	C(head);
+	C(shinfo);
 	C(data);
 	C(truesize);
 	atomic_set(&n->users, 1);
@@ -941,18 +958,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
 	}
 
-	if (fastpath &&
-	    size + sizeof(struct skb_shared_info) <= ksize(skb->head)) {
-		memmove(skb->head + size, skb_shinfo(skb),
-			offsetof(struct skb_shared_info,
-				 frags[skb_shinfo(skb)->nr_frags]));
+	if (fastpath && size <= ksize(skb->head)) {
 		memmove(skb->head + nhead, skb->head,
 			skb_tail_pointer(skb) - skb->head);
 		off = nhead;
 		goto adjust_others;
 	}
 
-	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+	data = kmalloc(size, gfp_mask);
 	if (!data)
 		goto nodata;
 
@@ -961,10 +974,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	 */
 	memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
 
-	memcpy((struct skb_shared_info *)(data + size),
-	       skb_shinfo(skb),
-	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
-after apply
 	if (fastpath) {
 		kfree(skb->head);
 	} else {
@@ -979,7 +988,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		if (skb_has_frag_list(skb))
 			skb_clone_fraglist(skb);
 
-		skb_release_data(skb);
+		skb_release_data(skb, 0);
 	}
 	off = (data + nhead) - skb->head;
 
@@ -2956,6 +2965,12 @@ void __init skb_init(void)
 						0,
 						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
 						NULL);
+
+	skbuff_shinfo_cache = kmem_cache_create("skbuff_shinfo_cache",
+						sizeof(struct skb_shared_info),
+						0,
+						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+						NULL);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-06 11:57 ` Ian Campbell
@ 2011-12-06 13:24   ` Eric Dumazet
  2011-12-07 13:35     ` Ian Campbell
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-12-06 13:24 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, Jesse Brandeburg, netdev

Le mardi 06 décembre 2011 à 11:57 +0000, Ian Campbell a écrit :
> On Wed, 2011-11-09 at 15:01 +0000, Ian Campbell wrote:
> >       * split linear data allocation and shinfo allocation into two. I
> >         suspect this will have its own performance implications? On the
> >         positive side skb_shared_info could come from its own fixed size
> >         pool/cache which might have some benefits
> 
> I played with this to see how it would look. Illustrative patch below. 
> 
> I figure that lots of small frames is the interesting workload for a
> change such as this but I don't know if iperf is necessarily the best
> benchmark for measuring that.
> Before changing things I got:
>         iperf -c qarun -m -t 60 -u -b 10000M -l 64
>         ------------------------------------------------------------
>         Client connecting to qarun, UDP port 5001
>         Sending 64 byte datagrams
>         UDP buffer size:   224 KByte (default)
>         ------------------------------------------------------------
>         [  3] local 10.80.225.63 port 45857 connected with 10.80.224.22 port 5001
>         [ ID] Interval       Transfer     Bandwidth
>         [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec
>         [  3] Sent 13820376 datagrams
>         [  3] Server Report:
>         [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec  0.005 ms    0/13820375 (0%)
>         [  3]  0.0-60.0 sec  1 datagrams received out-of-order
> whereas with the patch:
>         # iperf -c qarun -m -t 60 -u -b 10000M -l 64
>         ------------------------------------------------------------
>         Client connecting to qarun, UDP port 5001
>         Sending 64 byte datagrams
>         UDP buffer size:   224 KByte (default)
>         ------------------------------------------------------------
>         [  3] local 10.80.225.63 port 42504 connected with 10.80.224.22 port 5001
>         [ ID] Interval       Transfer     Bandwidth
>         [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec
>         [  3] Sent 13645857 datagrams
>         [  3] Server Report:
>         [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec  0.005 ms    0/13645856 (0%)
>         [  3]  0.0-60.0 sec  1 datagrams received out-of-order
> 
> With 1200 byte datagrams I get basically identical throughput.
> 
> (nb: none of the skb destructor stuff was present in either case)

Sorry, but the real problem is that if skb producer and consumer are not
on same CPU, each skb will now hit SLUB slowpath three times instead of
two.

Some workloads are : One cpu fully handling IRQ from device, dispatching
skbs to consumers on other cpus.

Plus skb->truesize is wrong after your patch.
Not sure if cloning is correct either...

Anyway, do we _really_ need 16 frags per skb, I dont know....

This gives problems when/if skb must be linearized and we hit
PAGE_ALLOC_COSTLY_ORDER

Alternatively, we could use order-1 or order-2 pages on x86 to get
8192/16384 bytes frags. (fallback to order-0 pages in case of allocation
failures)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-06 13:24   ` Eric Dumazet
@ 2011-12-07 13:35     ` Ian Campbell
  2011-12-09 13:47       ` Ian Campbell
  0 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-12-07 13:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Jesse Brandeburg, netdev

On Tue, 2011-12-06 at 13:24 +0000, Eric Dumazet wrote:
> Le mardi 06 décembre 2011 à 11:57 +0000, Ian Campbell a écrit :
> > On Wed, 2011-11-09 at 15:01 +0000, Ian Campbell wrote:
> > >       * split linear data allocation and shinfo allocation into two. I
> > >         suspect this will have its own performance implications? On the
> > >         positive side skb_shared_info could come from its own fixed size
> > >         pool/cache which might have some benefits
> > 
> > I played with this to see how it would look. Illustrative patch below. 
> > 
> > I figure that lots of small frames is the interesting workload for a
> > change such as this but I don't know if iperf is necessarily the best
> > benchmark for measuring that.
> > Before changing things I got:
> >         iperf -c qarun -m -t 60 -u -b 10000M -l 64
> >         ------------------------------------------------------------
> >         Client connecting to qarun, UDP port 5001
> >         Sending 64 byte datagrams
> >         UDP buffer size:   224 KByte (default)
> >         ------------------------------------------------------------
> >         [  3] local 10.80.225.63 port 45857 connected with 10.80.224.22 port 5001
> >         [ ID] Interval       Transfer     Bandwidth
> >         [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec
> >         [  3] Sent 13820376 datagrams
> >         [  3] Server Report:
> >         [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec  0.005 ms    0/13820375 (0%)
> >         [  3]  0.0-60.0 sec  1 datagrams received out-of-order
> > whereas with the patch:
> >         # iperf -c qarun -m -t 60 -u -b 10000M -l 64
> >         ------------------------------------------------------------
> >         Client connecting to qarun, UDP port 5001
> >         Sending 64 byte datagrams
> >         UDP buffer size:   224 KByte (default)
> >         ------------------------------------------------------------
> >         [  3] local 10.80.225.63 port 42504 connected with 10.80.224.22 port 5001
> >         [ ID] Interval       Transfer     Bandwidth
> >         [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec
> >         [  3] Sent 13645857 datagrams
> >         [  3] Server Report:
> >         [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec  0.005 ms    0/13645856 (0%)
> >         [  3]  0.0-60.0 sec  1 datagrams received out-of-order
> > 
> > With 1200 byte datagrams I get basically identical throughput.
> > 
> > (nb: none of the skb destructor stuff was present in either case)
> 
> Sorry, but the real problem is that if skb producer and consumer are not
> on same CPU, each skb will now hit SLUB slowpath three times instead of
> two.
> 
> Some workloads are : One cpu fully handling IRQ from device, dispatching
> skbs to consumers on other cpus.

So something like a multithreaded apache benchmark would be interesting?

> Plus skb->truesize is wrong after your patch.
> Not sure if cloning is correct either...

Me neither, the patch was just one which happened to run well enough to
run some numbers on. I'll be sure to double-check/correct that stuff if
I persist with the approach.

> Anyway, do we _really_ need 16 frags per skb, I dont know....

MAX_SKB_FRAGS is:
        /* To allow 64K frame to be packed as single skb without frag_list. Since
         * GRO uses frags we allocate at least 16 regardless of page size.
         */
        #if (65536/PAGE_SIZE + 2) < 16
        #define MAX_SKB_FRAGS 16UL
        #else
        #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)
        #endif
        
So I think actually it turns out to be 18 on systems with 4k pages. If
we reduced that to be 16 that would save 48 bytes on amd64 which pulls
the shinfo size down to 440 and then

        NET_SKB_PAD (64) + 1500 + 14 + 440 = 2018

which does fit in half a page.

Using 16 still means that a 64k frame fits precisely in frags on a 4096
page size system. I don't know what the extra 2 was for (it predates
git), perhaps just to allow for some slop due to misalignment in the
first page of data?

> This gives problems when/if skb must be linearized and we hit
> PAGE_ALLOC_COSTLY_ORDER
> 
> Alternatively, we could use order-1 or order-2 pages on x86 to get
> 8192/16384 bytes frags. (fallback to order-0 pages in case of allocation
> failures)

Or fallback to separate allocation of shinfo?

Ian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-07 13:35     ` Ian Campbell
@ 2011-12-09 13:47       ` Ian Campbell
  2011-12-09 18:34         ` David Miller
  0 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-12-09 13:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Jesse Brandeburg, netdev

On Wed, 2011-12-07 at 13:35 +0000, Ian Campbell wrote:
> On Tue, 2011-12-06 at 13:24 +0000, Eric Dumazet wrote:
> > Le mardi 06 décembre 2011 à 11:57 +0000, Ian Campbell a écrit :
> > > On Wed, 2011-11-09 at 15:01 +0000, Ian Campbell wrote:
> > > >       * split linear data allocation and shinfo allocation into two. I
> > > >         suspect this will have its own performance implications? On the
> > > >         positive side skb_shared_info could come from its own fixed size
> > > >         pool/cache which might have some benefits
> > > 
> > > I played with this to see how it would look. Illustrative patch below. 
> > > 
> > > I figure that lots of small frames is the interesting workload for a
> > > change such as this but I don't know if iperf is necessarily the best
> > > benchmark for measuring that.
> > > Before changing things I got:
> > >         iperf -c qarun -m -t 60 -u -b 10000M -l 64
> > >         ------------------------------------------------------------
> > >         Client connecting to qarun, UDP port 5001
> > >         Sending 64 byte datagrams
> > >         UDP buffer size:   224 KByte (default)
> > >         ------------------------------------------------------------
> > >         [  3] local 10.80.225.63 port 45857 connected with 10.80.224.22 port 5001
> > >         [ ID] Interval       Transfer     Bandwidth
> > >         [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec
> > >         [  3] Sent 13820376 datagrams
> > >         [  3] Server Report:
> > >         [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec  0.005 ms    0/13820375 (0%)
> > >         [  3]  0.0-60.0 sec  1 datagrams received out-of-order
> > > whereas with the patch:
> > >         # iperf -c qarun -m -t 60 -u -b 10000M -l 64
> > >         ------------------------------------------------------------
> > >         Client connecting to qarun, UDP port 5001
> > >         Sending 64 byte datagrams
> > >         UDP buffer size:   224 KByte (default)
> > >         ------------------------------------------------------------
> > >         [  3] local 10.80.225.63 port 42504 connected with 10.80.224.22 port 5001
> > >         [ ID] Interval       Transfer     Bandwidth
> > >         [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec
> > >         [  3] Sent 13645857 datagrams
> > >         [  3] Server Report:
> > >         [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec  0.005 ms    0/13645856 (0%)
> > >         [  3]  0.0-60.0 sec  1 datagrams received out-of-order
> > > 
> > > With 1200 byte datagrams I get basically identical throughput.
> > > 
> > > (nb: none of the skb destructor stuff was present in either case)
> > 
> > Sorry, but the real problem is that if skb producer and consumer are not
> > on same CPU, each skb will now hit SLUB slowpath three times instead of
> > two.
> > 
> > Some workloads are : One cpu fully handling IRQ from device, dispatching
> > skbs to consumers on other cpus.
> 
> So something like a multithreaded apache benchmark would be interesting?
> 
> > Plus skb->truesize is wrong after your patch.
> > Not sure if cloning is correct either...
> 
> Me neither, the patch was just one which happened to run well enough to
> run some numbers on. I'll be sure to double-check/correct that stuff if
> I persist with the approach.
> 
> > Anyway, do we _really_ need 16 frags per skb, I dont know....
> 
> MAX_SKB_FRAGS is:
>         /* To allow 64K frame to be packed as single skb without frag_list. Since
>          * GRO uses frags we allocate at least 16 regardless of page size.
>          */
>         #if (65536/PAGE_SIZE + 2) < 16
>         #define MAX_SKB_FRAGS 16UL
>         #else
>         #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)
>         #endif
>         
> So I think actually it turns out to be 18 on systems with 4k pages. If
> we reduced that to be 16 that would save 48 bytes on amd64 which pulls
> the shinfo size down to 440 and then
> 
>         NET_SKB_PAD (64) + 1500 + 14 + 440 = 2018
> 
> which does fit in half a page.
> 
> Using 16 still means that a 64k frame fits precisely in frags on a 4096
> page size system. I don't know what the extra 2 was for (it predates
> git), perhaps just to allow for some slop due to misalignment in the
> first page of data?

Tracked it down in TGLX's historic tree as:
http://git.kernel.org/?p=linux/kernel/git/tglx/history.git;a=commitdiff;h=80223d5186f73bf42a7e260c66c9cb9f7d8ec9cf

Having found the title I did a bit of searching around for discussion
but all I could find was other folks not having any idea where the + 2
comes from either...

http://kerneltrap.org/mailarchive/linux-netdev/2008/4/21/1529914 sort of
wonders if the + 2 might be including a head which crosses a page
boundary. I'm not sure it is sane/useful to account for that in
MAX_SKB_FRAGS. If we had a concept like MAX_SKB_PAGES then it would
perhaps make sense to have + 2 there, but AFAICT drivers etc are already
accounting for this appropriately by adding a further + 2 (or sometimes
+ 1) to MAX_SKB_FRAGS.

Ian.

> 
> > This gives problems when/if skb must be linearized and we hit
> > PAGE_ALLOC_COSTLY_ORDER
> > 
> > Alternatively, we could use order-1 or order-2 pages on x86 to get
> > 8192/16384 bytes frags. (fallback to order-0 pages in case of allocation
> > failures)
> 
> Or fallback to separate allocation of shinfo?
> 
> Ian.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-09 13:47       ` Ian Campbell
@ 2011-12-09 18:34         ` David Miller
  2011-12-21 11:03           ` Ian Campbell
  0 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2011-12-09 18:34 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: eric.dumazet, jesse.brandeburg, netdev

From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Fri, 9 Dec 2011 13:47:07 +0000

> If we had a concept like MAX_SKB_PAGES then it would perhaps make
> sense to have + 2 there, but AFAICT drivers etc are already
> accounting for this appropriately by adding a further + 2 (or
> sometimes + 1) to MAX_SKB_FRAGS.

Any kind of code like this, including the "+ 2" in the skbuff header,
should be coded to use some kind of macro so we can track this
dependency instead of stumbling onto it and accidently breaking lots
of stuff if we want to change this "2" value.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-09 18:34         ` David Miller
@ 2011-12-21 11:03           ` Ian Campbell
  2011-12-21 11:08             ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-12-21 11:03 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, jesse.brandeburg, netdev

On Fri, 2011-12-09 at 18:34 +0000, David Miller wrote:
> From: Ian Campbell <Ian.Campbell@citrix.com>
> Date: Fri, 9 Dec 2011 13:47:07 +0000
> 
> > If we had a concept like MAX_SKB_PAGES then it would perhaps make
> > sense to have + 2 there, but AFAICT drivers etc are already
> > accounting for this appropriately by adding a further + 2 (or
> > sometimes + 1) to MAX_SKB_FRAGS.
> 
> Any kind of code like this, including the "+ 2" in the skbuff header,
> should be coded to use some kind of macro so we can track this
> dependency instead of stumbling onto it and accidently breaking lots
> of stuff if we want to change this "2" value.

Agreed.

Part of the problem is that no one seems to have any idea what this
particular + 2 means. My best hypothesis is that it accounts for the
pages used by the linear area (which potentially crosses a page
boundary).

On that basis I propose to do the following + a sweep of the tree to
determine who means which. There's quite a lot of MAX_SKB_FRAGS in the
tree but based on quick inspection many are related specifically to the
bounds of the frags list and so don't need to change.

        /* To allow 64K frame to be packed as single skb without frag_list. Since
         * GRO uses frags we allocate at least 16 regardless of page size.
         */
        #if (65536/PAGE_SIZE) < 16
        #define MAX_SKB_FRAGS 16UL
        #else
        #define MAX_SKB_FRAGS (65536/PAGE_SIZE)
        #endif
        
        /* The linear area can cross a page boundary */
        #define MAX_SKB_HEAD_PAGES 2UL
        #define MAX_SKB_PAGES (MAX_SKB_FRAGS + MAX_SKB_HEAD_PAGES)

This means that MAX_SKB_FRAGS will shrink by 2. The new symbol
MAX_SKB_PAGES will have the old value of MAX_SKB_FRAGS.

I'm also considering getting rid of the #if since I don't think we
support pages < 4K on any arch, do we?

To aid sequencing the transition it might be best to go with
MAX_SKB_FRAGS (16 + 2) and MAX_SKB_HEADER_PAGES (0) first, then convert
everyone to use the one they really meant and finally move the 2 from
SKB_FRAGS to SKB_HEADER_PAGES. I need to think about that.

Some drivers currently use MAX_SKB_FRAGS + 1 which I think is accounting
for the linear area in cases where the hardware cannot receive one cross
a page boundary. I'm considering defining MIN_SKB_HEAD_PAGES to use in
these circumstances.

Ian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-21 11:03           ` Ian Campbell
@ 2011-12-21 11:08             ` Eric Dumazet
  2011-12-21 11:18               ` Ian Campbell
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-12-21 11:08 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, jesse.brandeburg, netdev

Le mercredi 21 décembre 2011 à 11:03 +0000, Ian Campbell a écrit :
> On Fri, 2011-12-09 at 18:34 +0000, David Miller wrote:
> > From: Ian Campbell <Ian.Campbell@citrix.com>
> > Date: Fri, 9 Dec 2011 13:47:07 +0000
> > 
> > > If we had a concept like MAX_SKB_PAGES then it would perhaps make
> > > sense to have + 2 there, but AFAICT drivers etc are already
> > > accounting for this appropriately by adding a further + 2 (or
> > > sometimes + 1) to MAX_SKB_FRAGS.
> > 
> > Any kind of code like this, including the "+ 2" in the skbuff header,
> > should be coded to use some kind of macro so we can track this
> > dependency instead of stumbling onto it and accidently breaking lots
> > of stuff if we want to change this "2" value.
> 
> Agreed.
> 
> Part of the problem is that no one seems to have any idea what this
> particular + 2 means. My best hypothesis is that it accounts for the
> pages used by the linear area (which potentially crosses a page
> boundary).

I dont understand the point.

linear data is allocated with kmalloc(), so technically speaking its
located in a single page, but page order can be 0, 1, 2, 3, ...
MAX_ORDER.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-21 11:08             ` Eric Dumazet
@ 2011-12-21 11:18               ` Ian Campbell
  2011-12-21 12:30                 ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-12-21 11:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jesse.brandeburg, netdev

On Wed, 2011-12-21 at 11:08 +0000, Eric Dumazet wrote:
> Le mercredi 21 décembre 2011 à 11:03 +0000, Ian Campbell a écrit :
> > On Fri, 2011-12-09 at 18:34 +0000, David Miller wrote:
> > > From: Ian Campbell <Ian.Campbell@citrix.com>
> > > Date: Fri, 9 Dec 2011 13:47:07 +0000
> > > 
> > > > If we had a concept like MAX_SKB_PAGES then it would perhaps make
> > > > sense to have + 2 there, but AFAICT drivers etc are already
> > > > accounting for this appropriately by adding a further + 2 (or
> > > > sometimes + 1) to MAX_SKB_FRAGS.
> > > 
> > > Any kind of code like this, including the "+ 2" in the skbuff header,
> > > should be coded to use some kind of macro so we can track this
> > > dependency instead of stumbling onto it and accidently breaking lots
> > > of stuff if we want to change this "2" value.
> > 
> > Agreed.
> > 
> > Part of the problem is that no one seems to have any idea what this
> > particular + 2 means. My best hypothesis is that it accounts for the
> > pages used by the linear area (which potentially crosses a page
> > boundary).
> 
> I dont understand the point.

I did say hypothesis ;-) Do you know what that + 2 is actually all
about?

If no one knows what it is for maybe we should just remove it directly
instead of what I proposed?

> linear data is allocated with kmalloc(), so technically speaking its
> located in a single page, but page order can be 0, 1, 2, 3, ...
> MAX_ORDER.

I think I must misunderstand the terminology.

An order 1 allocation is in multiples of PAGE_SIZE, isn't it, even
though they happen to be contiguous?

Or are you considering the possibility of kmalloc returning a super page
of some description? Isn't that to some extent transparent to the caller
who (assuming PAGE_SIZE 4096) doesn't know if kmalloc(16384) returned
4*4096 contiguous pages or 1*16384 page?

Anyway I take your underlying point that 2*PAGE_SIZE is no kind of limit
on the size of the linear region.

Ian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-21 11:18               ` Ian Campbell
@ 2011-12-21 12:30                 ` Eric Dumazet
  2011-12-21 13:48                   ` Ian Campbell
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-12-21 12:30 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, jesse.brandeburg, netdev

Le mercredi 21 décembre 2011 à 11:18 +0000, Ian Campbell a écrit :

> 
> An order 1 allocation is in multiples of PAGE_SIZE, isn't it, even
> though they happen to be contiguous?

Really an order-1 allocation allocates one page, a compound one.

> 
> Or are you considering the possibility of kmalloc returning a super page
> of some description? Isn't that to some extent transparent to the caller
> who (assuming PAGE_SIZE 4096) doesn't know if kmalloc(16384) returned
> 4*4096 contiguous pages or 1*16384 page?
> 

Its a part of a _single_ page, or the total of it.

Its mandatory, or else many drivers would break their DMA handling.

> Anyway I take your underlying point that 2*PAGE_SIZE is no kind of limit
> on the size of the linear region.
> 

Sure, since net/unix/af_unix.c can happily allocate huge linear skbs, if
memory is not too fragmented of course...

(The stream unix socket limit their skbs to SKB_MAX_ALLOC (up to 4
pages), but datagram ones have no limit)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-21 12:30                 ` Eric Dumazet
@ 2011-12-21 13:48                   ` Ian Campbell
  2011-12-21 14:02                     ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-12-21 13:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jesse.brandeburg, netdev

On Wed, 2011-12-21 at 12:30 +0000, Eric Dumazet wrote:
> Le mercredi 21 décembre 2011 à 11:18 +0000, Ian Campbell a écrit :
> 
> > 
> > An order 1 allocation is in multiples of PAGE_SIZE, isn't it, even
> > though they happen to be contiguous?
> 
> Really an order-1 allocation allocates one page, a compound one.

Oh, right, I see what you mean.

You snipped the question about the + 2, I take it you have no idea what
it is all about either?

Ian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-21 13:48                   ` Ian Campbell
@ 2011-12-21 14:02                     ` Eric Dumazet
  2011-12-21 19:28                       ` David Miller
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-12-21 14:02 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, jesse.brandeburg, netdev

Le mercredi 21 décembre 2011 à 13:48 +0000, Ian Campbell a écrit :
> On Wed, 2011-12-21 at 12:30 +0000, Eric Dumazet wrote:
> > Le mercredi 21 décembre 2011 à 11:18 +0000, Ian Campbell a écrit :
> > 
> > > 
> > > An order 1 allocation is in multiples of PAGE_SIZE, isn't it, even
> > > though they happen to be contiguous?
> > 
> > Really an order-1 allocation allocates one page, a compound one.
> 
> Oh, right, I see what you mean.
> 
> You snipped the question about the + 2, I take it you have no idea what
> it is all about either?
> 

No idea on this +2 point.

I know some hardwares have limits on a fragment length (for example IGB
uses IGB_MAX_DATA_PER_TXD limit), but I dont know why we add extra frags
in generic skb.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-21 14:02                     ` Eric Dumazet
@ 2011-12-21 19:28                       ` David Miller
  2011-12-22 10:33                         ` Ian Campbell
  0 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2011-12-21 19:28 UTC (permalink / raw)
  To: eric.dumazet; +Cc: Ian.Campbell, jesse.brandeburg, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 21 Dec 2011 15:02:18 +0100

> No idea on this +2 point.

I think I know, and I believe I instructed Alexey Kuznetsov to do
this.

When sendfile() is performed, we might start the SKB with the last few
bytes of one page, and end the SKB with the first few bytes of another
page.

In order to fit a full 64K frame into an SKB in this situation we have
to accomodate this case.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-21 19:28                       ` David Miller
@ 2011-12-22 10:33                         ` Ian Campbell
  2011-12-22 18:20                           ` David Miller
  2011-12-22 18:34                           ` Michał Mirosław
  0 siblings, 2 replies; 52+ messages in thread
From: Ian Campbell @ 2011-12-22 10:33 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, jesse.brandeburg, netdev

On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 21 Dec 2011 15:02:18 +0100
> 
> > No idea on this +2 point.
> 
> I think I know, and I believe I instructed Alexey Kuznetsov to do
> this.
> 
> When sendfile() is performed, we might start the SKB with the last few
> bytes of one page, and end the SKB with the first few bytes of another
> page.
> 
> In order to fit a full 64K frame into an SKB in this situation we have
> to accomodate this case.

Thanks David, that makes sense.

However I think you only actually need 1 extra page for that. If the
data in frag[0] starts at $offset then frag[16] will need to have
$offset bytes in it. e.g.
	4096-$offset + 4096*15 + $offset = 65536
which == 17 pages rather than 18.

The following documents the status quo but I could update to switch to +
1 instead if there are no flaws in the above logic...

Ian.

>From 0a327232d6df809c4cf294edc8d02f6f88dd5678 Mon Sep 17 00:00:00 2001
From: Ian Campbell <ian.campbell@citrix.com>
Date: Thu, 22 Dec 2011 10:07:19 +0000
Subject: [PATCH] net: document reason for 2 additional pages in MAX_SKB_FRAGS

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/skbuff.h |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe86488..1c894a8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -128,8 +128,12 @@ struct sk_buff_head {
 
 struct sk_buff;
 
-/* To allow 64K frame to be packed as single skb without frag_list. Since
- * GRO uses frags we allocate at least 16 regardless of page size.
+/* To allow 64K frame to be packed as single skb without frag_list we
+ * require 64K/PAGE_SIZE pages plus 2 additional pages to allow for
+ * buffers which do not start on a page boundary.
+ *
+ * Since GRO uses frags we allocate at least 16 regardless of page
+ * size.
  */
 #if (65536/PAGE_SIZE + 2) < 16
 #define MAX_SKB_FRAGS 16UL
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-22 10:33                         ` Ian Campbell
@ 2011-12-22 18:20                           ` David Miller
  2011-12-23  9:35                             ` Ian Campbell
       [not found]                             ` <45B8991A987A4149B40F1A061BF49097B96C9EFF05@LONPMAILBOX01.citrite.net>
  2011-12-22 18:34                           ` Michał Mirosław
  1 sibling, 2 replies; 52+ messages in thread
From: David Miller @ 2011-12-22 18:20 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: eric.dumazet, jesse.brandeburg, netdev

From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Thu, 22 Dec 2011 10:33:36 +0000

> On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 21 Dec 2011 15:02:18 +0100
>> 
>> > No idea on this +2 point.
>> 
>> I think I know, and I believe I instructed Alexey Kuznetsov to do
>> this.
>> 
>> When sendfile() is performed, we might start the SKB with the last few
>> bytes of one page, and end the SKB with the first few bytes of another
>> page.
>> 
>> In order to fit a full 64K frame into an SKB in this situation we have
>> to accomodate this case.
> 
> Thanks David, that makes sense.
> 
> However I think you only actually need 1 extra page for that. If the
> data in frag[0] starts at $offset then frag[16] will need to have
> $offset bytes in it. e.g.
> 	4096-$offset + 4096*15 + $offset = 65536
> which == 17 pages rather than 18.
> 
> The following documents the status quo but I could update to switch to +
> 1 instead if there are no flaws in the above logic...

Indeed, you're right.  Please change this to 1 and document it, and we
can put that change into net-next, thanks a lot!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-22 10:33                         ` Ian Campbell
  2011-12-22 18:20                           ` David Miller
@ 2011-12-22 18:34                           ` Michał Mirosław
  2011-12-22 18:43                             ` David Miller
  1 sibling, 1 reply; 52+ messages in thread
From: Michał Mirosław @ 2011-12-22 18:34 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, eric.dumazet, jesse.brandeburg, netdev

2011/12/22 Ian Campbell <Ian.Campbell@citrix.com>:
> On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 21 Dec 2011 15:02:18 +0100
>> > No idea on this +2 point.
>> I think I know, and I believe I instructed Alexey Kuznetsov to do
>> this.
>>
>> When sendfile() is performed, we might start the SKB with the last few
>> bytes of one page, and end the SKB with the first few bytes of another
>> page.
>>
>> In order to fit a full 64K frame into an SKB in this situation we have
>> to accomodate this case.
> Thanks David, that makes sense.
>
> However I think you only actually need 1 extra page for that. If the
> data in frag[0] starts at $offset then frag[16] will need to have
> $offset bytes in it. e.g.
>        4096-$offset + 4096*15 + $offset = 65536
> which == 17 pages rather than 18.
>
> The following documents the status quo but I could update to switch to +
> 1 instead if there are no flaws in the above logic...

Since max IP datagram is 64K-1, adding ethernet and possibly VLAN
headers makes the max size slightly above 64K and then you have
64K/PAGE_SIZE+2 pages appear in worst case.

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-22 18:34                           ` Michał Mirosław
@ 2011-12-22 18:43                             ` David Miller
  2011-12-22 19:29                               ` Michał Mirosław
  0 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2011-12-22 18:43 UTC (permalink / raw)
  To: mirqus; +Cc: Ian.Campbell, eric.dumazet, jesse.brandeburg, netdev

From: Michał Mirosław <mirqus@gmail.com>
Date: Thu, 22 Dec 2011 19:34:21 +0100

> 2011/12/22 Ian Campbell <Ian.Campbell@citrix.com>:
>> On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>> Date: Wed, 21 Dec 2011 15:02:18 +0100
>>> > No idea on this +2 point.
>>> I think I know, and I believe I instructed Alexey Kuznetsov to do
>>> this.
>>>
>>> When sendfile() is performed, we might start the SKB with the last few
>>> bytes of one page, and end the SKB with the first few bytes of another
>>> page.
>>>
>>> In order to fit a full 64K frame into an SKB in this situation we have
>>> to accomodate this case.
>> Thanks David, that makes sense.
>>
>> However I think you only actually need 1 extra page for that. If the
>> data in frag[0] starts at $offset then frag[16] will need to have
>> $offset bytes in it. e.g.
>>        4096-$offset + 4096*15 + $offset = 65536
>> which == 17 pages rather than 18.
>>
>> The following documents the status quo but I could update to switch to +
>> 1 instead if there are no flaws in the above logic...
> 
> Since max IP datagram is 64K-1, adding ethernet and possibly VLAN
> headers makes the max size slightly above 64K and then you have
> 64K/PAGE_SIZE+2 pages appear in worst case.

Headers go into the linear area, so they are not relevant for these
calculations.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-22 18:43                             ` David Miller
@ 2011-12-22 19:29                               ` Michał Mirosław
  2011-12-22 19:34                                 ` David Miller
  0 siblings, 1 reply; 52+ messages in thread
From: Michał Mirosław @ 2011-12-22 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: Ian.Campbell, eric.dumazet, jesse.brandeburg, netdev

2011/12/22 David Miller <davem@davemloft.net>:
> From: Michał Mirosław <mirqus@gmail.com>
> Date: Thu, 22 Dec 2011 19:34:21 +0100
>
>> 2011/12/22 Ian Campbell <Ian.Campbell@citrix.com>:
>>> On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>> Date: Wed, 21 Dec 2011 15:02:18 +0100
>>>> > No idea on this +2 point.
>>>> I think I know, and I believe I instructed Alexey Kuznetsov to do
>>>> this.
>>>>
>>>> When sendfile() is performed, we might start the SKB with the last few
>>>> bytes of one page, and end the SKB with the first few bytes of another
>>>> page.
>>>>
>>>> In order to fit a full 64K frame into an SKB in this situation we have
>>>> to accomodate this case.
>>> Thanks David, that makes sense.
>>>
>>> However I think you only actually need 1 extra page for that. If the
>>> data in frag[0] starts at $offset then frag[16] will need to have
>>> $offset bytes in it. e.g.
>>>        4096-$offset + 4096*15 + $offset = 65536
>>> which == 17 pages rather than 18.
>>>
>>> The following documents the status quo but I could update to switch to +
>>> 1 instead if there are no flaws in the above logic...
>>
>> Since max IP datagram is 64K-1, adding ethernet and possibly VLAN
>> headers makes the max size slightly above 64K and then you have
>> 64K/PAGE_SIZE+2 pages appear in worst case.
>
> Headers go into the linear area, so they are not relevant for these
> calculations.

Does this hold for LRO'ed packets and packets sent via PF_PACKET socket?

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-22 19:29                               ` Michał Mirosław
@ 2011-12-22 19:34                                 ` David Miller
  2011-12-23 18:10                                   ` Michał Mirosław
  0 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2011-12-22 19:34 UTC (permalink / raw)
  To: mirqus; +Cc: Ian.Campbell, eric.dumazet, jesse.brandeburg, netdev

From: Michał Mirosław <mirqus@gmail.com>
Date: Thu, 22 Dec 2011 20:29:30 +0100

> 2011/12/22 David Miller <davem@davemloft.net>:
>> From: Michał Mirosław <mirqus@gmail.com>
>> Date: Thu, 22 Dec 2011 19:34:21 +0100
>>
>>> 2011/12/22 Ian Campbell <Ian.Campbell@citrix.com>:
>>>> On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
>>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>>> Date: Wed, 21 Dec 2011 15:02:18 +0100
>>>>> > No idea on this +2 point.
>>>>> I think I know, and I believe I instructed Alexey Kuznetsov to do
>>>>> this.
>>>>>
>>>>> When sendfile() is performed, we might start the SKB with the last few
>>>>> bytes of one page, and end the SKB with the first few bytes of another
>>>>> page.
>>>>>
>>>>> In order to fit a full 64K frame into an SKB in this situation we have
>>>>> to accomodate this case.
>>>> Thanks David, that makes sense.
>>>>
>>>> However I think you only actually need 1 extra page for that. If the
>>>> data in frag[0] starts at $offset then frag[16] will need to have
>>>> $offset bytes in it. e.g.
>>>>        4096-$offset + 4096*15 + $offset = 65536
>>>> which == 17 pages rather than 18.
>>>>
>>>> The following documents the status quo but I could update to switch to +
>>>> 1 instead if there are no flaws in the above logic...
>>>
>>> Since max IP datagram is 64K-1, adding ethernet and possibly VLAN
>>> headers makes the max size slightly above 64K and then you have
>>> 64K/PAGE_SIZE+2 pages appear in worst case.
>>
>> Headers go into the linear area, so they are not relevant for these
>> calculations.
> 
> Does this hold for LRO'ed packets and packets sent via PF_PACKET socket?

LRO is for receive, not packets we build on transmit, and in any event
have their headers also pulled into the linear area before the stack
sees it.

For PF_PACKET, in the packet_snd() case it uses a linear SKB and in
the tpacket_snd() case it uses a linear SKB as well.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-22 18:20                           ` David Miller
@ 2011-12-23  9:35                             ` Ian Campbell
  2012-01-03  9:36                               ` David Laight
       [not found]                             ` <45B8991A987A4149B40F1A061BF49097B96C9EFF05@LONPMAILBOX01.citrite.net>
  1 sibling, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-12-23  9:35 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, jesse.brandeburg, netdev

On Thu, 2011-12-22 at 18:20 +0000, David Miller wrote:
> From: Ian Campbell <Ian.Campbell@citrix.com>
> Date: Thu, 22 Dec 2011 10:33:36 +0000
> 
> > On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
> >> From: Eric Dumazet <eric.dumazet@gmail.com>
> >> Date: Wed, 21 Dec 2011 15:02:18 +0100
> >> 
> >> > No idea on this +2 point.
> >> 
> >> I think I know, and I believe I instructed Alexey Kuznetsov to do
> >> this.
> >> 
> >> When sendfile() is performed, we might start the SKB with the last few
> >> bytes of one page, and end the SKB with the first few bytes of another
> >> page.
> >> 
> >> In order to fit a full 64K frame into an SKB in this situation we have
> >> to accomodate this case.
> > 
> > Thanks David, that makes sense.
> > 
> > However I think you only actually need 1 extra page for that. If the
> > data in frag[0] starts at $offset then frag[16] will need to have
> > $offset bytes in it. e.g.
> > 	4096-$offset + 4096*15 + $offset = 65536
> > which == 17 pages rather than 18.
> > 
> > The following documents the status quo but I could update to switch to +
> > 1 instead if there are no flaws in the above logic...
> 
> Indeed, you're right.  Please change this to 1 and document it, and we
> can put that change into net-next, thanks a lot!

Please see below.

This change actually makes the size increase due to adding the
destructor small enough that a maximum MTU frame + shared_info still
fits in 2048 bytes (just!). I think I can see some scope for shaving off
a few more bytes as well but that'll have to wait for next year.

8<-----------------------------------------

>From 75d3ee297af89b7f801bb0a13517f51181ef4d3c Mon Sep 17 00:00:00 2001
From: Ian Campbell <ian.campbell@citrix.com>
Date: Thu, 22 Dec 2011 10:07:19 +0000
Subject: [PATCH] net: document reason for 2 additional pages in MAX_SKB_FRAGS

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/skbuff.h |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe86488..0592b3d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -128,13 +128,17 @@ struct sk_buff_head {
 
 struct sk_buff;
 
-/* To allow 64K frame to be packed as single skb without frag_list. Since
- * GRO uses frags we allocate at least 16 regardless of page size.
+/* To allow 64K frame to be packed as single skb without frag_list we
+ * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
+ * buffers which do not start on a page boundary.
+ *
+ * Since GRO uses frags we allocate at least 16 regardless of page
+ * size.
  */
-#if (65536/PAGE_SIZE + 2) < 16
+#if (65536/PAGE_SIZE + 1) < 16
 #define MAX_SKB_FRAGS 16UL
 #else
-#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)
+#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
 #endif
 
 typedef struct skb_frag_struct skb_frag_t;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
       [not found]                             ` <45B8991A987A4149B40F1A061BF49097B96C9EFF05@LONPMAILBOX01.citrite.net>
@ 2011-12-23  9:39                               ` Ian Campbell
  2011-12-23 21:52                                 ` David Miller
  0 siblings, 1 reply; 52+ messages in thread
From: Ian Campbell @ 2011-12-23  9:39 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, jesse.brandeburg, netdev

On Fri, 2011-12-23 at 09:35 +0000, Ian Campbell wrote:
> On Thu, 2011-12-22 at 18:20 +0000, David Miller wrote:
> > From: Ian Campbell <Ian.Campbell@citrix.com>
> > Date: Thu, 22 Dec 2011 10:33:36 +0000
> > 
> > > On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
> > >> From: Eric Dumazet <eric.dumazet@gmail.com>
> > >> Date: Wed, 21 Dec 2011 15:02:18 +0100
> > >> 
> > >> > No idea on this +2 point.
> > >> 
> > >> I think I know, and I believe I instructed Alexey Kuznetsov to do
> > >> this.
> > >> 
> > >> When sendfile() is performed, we might start the SKB with the last few
> > >> bytes of one page, and end the SKB with the first few bytes of another
> > >> page.
> > >> 
> > >> In order to fit a full 64K frame into an SKB in this situation we have
> > >> to accomodate this case.
> > > 
> > > Thanks David, that makes sense.
> > > 
> > > However I think you only actually need 1 extra page for that. If the
> > > data in frag[0] starts at $offset then frag[16] will need to have
> > > $offset bytes in it. e.g.
> > > 	4096-$offset + 4096*15 + $offset = 65536
> > > which == 17 pages rather than 18.
> > > 
> > > The following documents the status quo but I could update to switch to +
> > > 1 instead if there are no flaws in the above logic...
> > 
> > Indeed, you're right.  Please change this to 1 and document it, and we
> > can put that change into net-next, thanks a lot!
> 
> Please see below.

-ENOCOFFEE. I forgot to update the changelog message :-( sorry. Another
attempt follows.

8<-----------------------------------------

>From 333b3165ad464b68c8fca87a759adea83f6ff6a6 Mon Sep 17 00:00:00 2001
From: Ian Campbell <ian.campbell@citrix.com>
Date: Thu, 22 Dec 2011 10:07:19 +0000
Subject: [PATCH] net: only use a single page of slop in MAX_SKB_FRAGS

In order to accommodate a 64K buffer we need 64K/PAGE_SIZE plus one more page
in order to allow for a buffer which does not start on a page boundary.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/skbuff.h |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe86488..0592b3d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -128,13 +128,17 @@ struct sk_buff_head {
 
 struct sk_buff;
 
-/* To allow 64K frame to be packed as single skb without frag_list. Since
- * GRO uses frags we allocate at least 16 regardless of page size.
+/* To allow 64K frame to be packed as single skb without frag_list we
+ * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
+ * buffers which do not start on a page boundary.
+ *
+ * Since GRO uses frags we allocate at least 16 regardless of page
+ * size.
  */
-#if (65536/PAGE_SIZE + 2) < 16
+#if (65536/PAGE_SIZE + 1) < 16
 #define MAX_SKB_FRAGS 16UL
 #else
-#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)
+#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
 #endif
 
 typedef struct skb_frag_struct skb_frag_t;
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-22 19:34                                 ` David Miller
@ 2011-12-23 18:10                                   ` Michał Mirosław
  0 siblings, 0 replies; 52+ messages in thread
From: Michał Mirosław @ 2011-12-23 18:10 UTC (permalink / raw)
  To: David Miller; +Cc: Ian.Campbell, eric.dumazet, jesse.brandeburg, netdev

2011/12/22 David Miller <davem@davemloft.net>:
> From: Michał Mirosław <mirqus@gmail.com>
> Date: Thu, 22 Dec 2011 20:29:30 +0100
>
>> 2011/12/22 David Miller <davem@davemloft.net>:
>>> From: Michał Mirosław <mirqus@gmail.com>
>>> Date: Thu, 22 Dec 2011 19:34:21 +0100
>>>
>>>> 2011/12/22 Ian Campbell <Ian.Campbell@citrix.com>:
>>>>> On Wed, 2011-12-21 at 19:28 +0000, David Miller wrote:
>>>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>>>> Date: Wed, 21 Dec 2011 15:02:18 +0100
>>>>>> > No idea on this +2 point.
>>>>>> I think I know, and I believe I instructed Alexey Kuznetsov to do
>>>>>> this.
>>>>>>
>>>>>> When sendfile() is performed, we might start the SKB with the last few
>>>>>> bytes of one page, and end the SKB with the first few bytes of another
>>>>>> page.
>>>>>>
>>>>>> In order to fit a full 64K frame into an SKB in this situation we have
>>>>>> to accomodate this case.
>>>>> Thanks David, that makes sense.
>>>>>
>>>>> However I think you only actually need 1 extra page for that. If the
>>>>> data in frag[0] starts at $offset then frag[16] will need to have
>>>>> $offset bytes in it. e.g.
>>>>>        4096-$offset + 4096*15 + $offset = 65536
>>>>> which == 17 pages rather than 18.
>>>>>
>>>>> The following documents the status quo but I could update to switch to +
>>>>> 1 instead if there are no flaws in the above logic...
>>>>
>>>> Since max IP datagram is 64K-1, adding ethernet and possibly VLAN
>>>> headers makes the max size slightly above 64K and then you have
>>>> 64K/PAGE_SIZE+2 pages appear in worst case.
>>>
>>> Headers go into the linear area, so they are not relevant for these
>>> calculations.
>>
>> Does this hold for LRO'ed packets and packets sent via PF_PACKET socket?
>
> LRO is for receive, not packets we build on transmit, and in any event
> have their headers also pulled into the linear area before the stack
> sees it.

Yes, but the drivers seem to first fill the data to frags[] and only
then move the packets header (I browsed myri10ge).

> For PF_PACKET, in the packet_snd() case it uses a linear SKB and in
> the tpacket_snd() case it uses a linear SKB as well.

I saw skb_fill_page_desc() in af_packet.c (called by
tpacket_fill_skb() which is called by tpacket_snd()), that's why I'm
asking.

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] skb paged fragment destructors
  2011-12-23  9:39                               ` Ian Campbell
@ 2011-12-23 21:52                                 ` David Miller
  0 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2011-12-23 21:52 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: eric.dumazet, jesse.brandeburg, netdev

From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Fri, 23 Dec 2011 09:39:14 +0000

> Subject: [PATCH] net: only use a single page of slop in MAX_SKB_FRAGS
> 
> In order to accommodate a 64K buffer we need 64K/PAGE_SIZE plus one more page
> in order to allow for a buffer which does not start on a page boundary.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

Applied.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH 0/4] skb paged fragment destructors
  2011-12-23  9:35                             ` Ian Campbell
@ 2012-01-03  9:36                               ` David Laight
  0 siblings, 0 replies; 52+ messages in thread
From: David Laight @ 2012-01-03  9:36 UTC (permalink / raw)
  To: Ian Campbell, David Miller; +Cc: eric.dumazet, jesse.brandeburg, netdev

 
> -#if (65536/PAGE_SIZE + 2) < 16
> +#if (65536/PAGE_SIZE + 1) < 16
>  #define MAX_SKB_FRAGS 16UL
>  #else
> -#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)
> +#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
>  #endif

Wouldn't it be better to only have the expression once?
Either using a MAX() define or maybe:

#define MAX_SKB_FRAGS (65536UL/PAGE_SIZE + 1)
#if MAX_SKB_FRAGS < 16UL
#undef MAX_SKB_FRAGS
#define MAX_SKB_FRAGS 16UL
#endif

Although that replicates the 16 instead :-(

	David

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2012-01-03  9:38 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-09 15:01 [PATCH 0/4] skb paged fragment destructors Ian Campbell
2011-11-09 15:02 ` [PATCH 1/4] net: add support for per-paged-fragment destructors Ian Campbell
2011-11-09 15:33   ` Michał Mirosław
2011-11-09 16:25     ` Ian Campbell
2011-11-09 17:24       ` Michał Mirosław
2011-11-09 17:28         ` Ian Campbell
2011-11-09 15:02 ` [PATCH 2/4] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
     [not found] ` <1320850895.955.172.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2011-11-09 15:02   ` [PATCH 3/4] net: add paged frag destructor support to kernel_sendpage Ian Campbell
2011-11-09 15:02     ` Ian Campbell
2011-11-09 18:02     ` Michał Mirosław
2011-11-09 18:02       ` [Ocfs2-devel] " Michał Mirosław
2011-11-09 18:02       ` Michał Mirosław
2011-11-09 15:02   ` [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell
2011-11-09 15:02     ` Ian Campbell
     [not found]     ` <1320850927-30240-4-git-send-email-ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
2011-11-11 12:38       ` Michael S. Tsirkin
2011-11-11 12:38         ` Michael S. Tsirkin
     [not found]         ` <20111111123824.GA23902-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2011-11-11 13:20           ` Ian Campbell
2011-11-11 13:20             ` Ian Campbell
     [not found]             ` <1321017627.955.254.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2011-11-11 20:00               ` J. Bruce Fields
2011-11-11 20:00                 ` J. Bruce Fields
2011-11-13 10:17             ` Michael S. Tsirkin
     [not found]               ` <20111113101713.GB15322-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2011-11-14 13:07                 ` Ian Campbell
2011-11-14 13:07                   ` Ian Campbell
2011-11-14 13:25                   ` David Laight
2011-11-09 17:49 ` [PATCH 0/4] skb paged fragment destructors Eric Dumazet
2011-11-10 10:39   ` Ian Campbell
2011-11-17 14:45   ` Ian Campbell
2011-11-17 14:51     ` Eric Dumazet
2011-11-17 20:22     ` Michał Mirosław
2011-12-06 11:57 ` Ian Campbell
2011-12-06 13:24   ` Eric Dumazet
2011-12-07 13:35     ` Ian Campbell
2011-12-09 13:47       ` Ian Campbell
2011-12-09 18:34         ` David Miller
2011-12-21 11:03           ` Ian Campbell
2011-12-21 11:08             ` Eric Dumazet
2011-12-21 11:18               ` Ian Campbell
2011-12-21 12:30                 ` Eric Dumazet
2011-12-21 13:48                   ` Ian Campbell
2011-12-21 14:02                     ` Eric Dumazet
2011-12-21 19:28                       ` David Miller
2011-12-22 10:33                         ` Ian Campbell
2011-12-22 18:20                           ` David Miller
2011-12-23  9:35                             ` Ian Campbell
2012-01-03  9:36                               ` David Laight
     [not found]                             ` <45B8991A987A4149B40F1A061BF49097B96C9EFF05@LONPMAILBOX01.citrite.net>
2011-12-23  9:39                               ` Ian Campbell
2011-12-23 21:52                                 ` David Miller
2011-12-22 18:34                           ` Michał Mirosław
2011-12-22 18:43                             ` David Miller
2011-12-22 19:29                               ` Michał Mirosław
2011-12-22 19:34                                 ` David Miller
2011-12-23 18:10                                   ` Michał Mirosław

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.