linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility
@ 2011-07-15 11:06 Ian Campbell
  2011-07-15 11:07 ` [PATCH 01/10] mm: Make some struct page's const Ian Campbell
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:06 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs

Hi,

The following is my attempt to allow entities which inject pages into
the networking stack to receive a notification when the stack has really
finished with those pages (i.e. including retransmissions, clones,
pull-ups etc) and not just when the original skb is finished with. It
implements something broadly along the lines of what was described in
[0].

The series is a proof-of-concept but I have used it to implement a fix
for the NFS issue which I described in [1] (for O_DIRECT writes only, I
presume non O_DIRECT writes would benefit from the same treatment), by
delaying completion of the write() until the pages are no longer
referenced by the network stack (which can happen due to retransmissions
or cloning). I expect that other block and filesystem users of the
network subsystem (e.g. iSCSI) would also benefit from this
functionality since they will suffer from the same class of issue.

Although I've not rebased onto it yet (this series is on 3.0-rc5) I also
expect it would be possible to remove the need to copy on clone which
was recently added to support the SKBTX_DEV_ZEROCOPY stuff by Shirley
Ma. I also expect that this functionality will be useful in my attempts
to add foreign page mapping to Xen's netback (per [2]).

Lastly I think the AF_PACKET mmap'd TX ring completion could also
benefit, although I wasn't able to cause an actual failure in that case,
it seems like cloning of skb's would cause pages which are still
referenced by the stack to be released back to userspace.

In order to do this I have introduced an API to manipulate an SKBs paged
fragments (which unfortunately necessitated changing each driver),
including an explicit fragment ref and unref API to replace the direct
use of get/put_page. Using those I was then able to add an optional
extra layer of reference counting to the paged fragments which can be
used by the creator of the fragment to receive a callback at the time
the page would normally be freed.

What is the general feeling regarding this approach?

The series has been built allmodconfig on x86_64 so I have likely missed
some arch-specific drivers etc. I'll take care of that in future
postings, as well as addressing the issues mentioned in some of the
commit messages.

Ian.

[0] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
[1] http://marc.info/?l=linux-nfs&m=122424132729720&w=2
[2] http://marc.info/?l=linux-netdev&m=130893020922848&w=2




^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 01/10] mm: Make some struct page's const.
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 11:07 ` [PATCH 02/10] mm: use const struct page for r/o page-flag accessor methods Ian Campbell
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/mm.h |   10 +++++-----
 mm/sparse.c        |    2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..550ec8f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -636,7 +636,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
-static inline enum zone_type page_zonenum(struct page *page)
+static inline enum zone_type page_zonenum(const struct page *page)
 {
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
 }
@@ -664,15 +664,15 @@ static inline int zone_to_nid(struct zone *zone)
 }
 
 #ifdef NODE_NOT_IN_PAGE_FLAGS
-extern int page_to_nid(struct page *page);
+extern int page_to_nid(const struct page *page);
 #else
-static inline int page_to_nid(struct page *page)
+static inline int page_to_nid(const struct page *page)
 {
 	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
 }
 #endif
 
-static inline struct zone *page_zone(struct page *page)
+static inline struct zone *page_zone(const struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
 }
@@ -717,7 +717,7 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
  */
 #include <linux/vmstat.h>
 
-static __always_inline void *lowmem_page_address(struct page *page)
+static __always_inline void *lowmem_page_address(const struct page *page)
 {
 	return __va(PFN_PHYS(page_to_pfn(page)));
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index aa64b12..858e1df 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -40,7 +40,7 @@ static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
 static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
 #endif
 
-int page_to_nid(struct page *page)
+int page_to_nid(const struct page *page)
 {
 	return section_to_node_table[page_to_section(page)];
 }
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 02/10] mm: use const struct page for r/o page-flag accessor methods
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
  2011-07-15 11:07 ` [PATCH 01/10] mm: Make some struct page's const Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 11:07 ` [PATCH 03/10] net: add APIs for manipulating skb page fragments Ian Campbell
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/page-flags.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6081493..7d632cc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,7 +135,7 @@ enum pageflags {
  * Macros to create function definitions for page flags
  */
 #define TESTPAGEFLAG(uname, lname)					\
-static inline int Page##uname(struct page *page) 			\
+static inline int Page##uname(const struct page *page) 			\
 			{ return test_bit(PG_##lname, &page->flags); }
 
 #define SETPAGEFLAG(uname, lname)					\
@@ -173,7 +173,7 @@ static inline int __TestClearPage##uname(struct page *page)		\
 	__SETPAGEFLAG(uname, lname)  __CLEARPAGEFLAG(uname, lname)
 
 #define PAGEFLAG_FALSE(uname) 						\
-static inline int Page##uname(struct page *page) 			\
+static inline int Page##uname(const struct page *page) 			\
 			{ return 0; }
 
 #define TESTSCFLAG(uname, lname)					\
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 03/10] net: add APIs for manipulating skb page fragments.
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
  2011-07-15 11:07 ` [PATCH 01/10] mm: Make some struct page's const Ian Campbell
  2011-07-15 11:07 ` [PATCH 02/10] mm: use const struct page for r/o page-flag accessor methods Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 22:34   ` Michał Mirosław
  2011-07-15 11:07 ` [PATCH 04/10] net: convert core to skb paged frag APIs Ian Campbell
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

The primary aim is to add skb_frag_(ref|unref) in order to remove the use of
bare get/put_page on SKB pages fragments and to isolate users from subsequent
changes to the skb_frag_t data structure.

The API also includes an accessor for the struct page itself. The default
variant of this returns a *const* struct page in an attempt to catch bare uses
of get/put_page (which take a non-const struct page).

Also included are helper APIs for passing a paged fragment to kmap and
(pci|dma)_map_page since I was seeing the same pattern a lot.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/skbuff.h |  225 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 223 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c0a4f3a..c061257 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -29,6 +29,8 @@
 #include <linux/rcupdate.h>
 #include <linux/dmaengine.h>
 #include <linux/hrtimer.h>
+#include <linux/highmem.h>
+#include <linux/pci.h>
 
 /* Don't change this without changing skb_csum_unnecessary! */
 #define CHECKSUM_NONE 0
@@ -1109,14 +1111,47 @@ static inline int skb_pagelen(const struct sk_buff *skb)
 	return len + skb_headlen(skb);
 }
 
-static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
-				      struct page *page, int off, int size)
+/**
+ * __skb_fill_page_desc - initialise a paged fragment in an skb
+ * @skb: buffer containing fragment to be initialised
+ * @i: paged fragment index to initialise
+ * @page: the page to use for this fragment
+ * @off: the offset to the data with @page
+ * @size: the length of the data
+ *
+ * Initialises the @i'th fragment of @skb to point to &size bytes at
+ * offset @off within @page.
+ *
+ * Does not take any additional reference on the fragment.
+ */
+static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
+					struct page *page, int off, int size)
 {
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 	frag->page		  = page;
 	frag->page_offset	  = off;
 	frag->size		  = size;
+}
+
+/**
+ * skb_fill_page_desc - initialise a paged fragment in an skb
+ * @skb: buffer containing fragment to be initialised
+ * @i: paged fragment index to initialise
+ * @page: the page to use for this fragment
+ * @off: the offset to the data with @page
+ * @size: the length of the data
+ *
+ * As per __skb_fill_page_desc() -- initialises the @i'th fragment of
+ * @skb to point to &size bytes at offset @off within @page. In
+ * addition updates @skb such that @i is the last fragment.
+ *
+ * Does not take any additional reference on the fragment.
+ */
+static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
+				      struct page *page, int off, int size)
+{
+	__skb_fill_page_desc(skb, i, page, off, size);
 	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
@@ -1605,6 +1640,192 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page)
 }
 
 /**
+ * __skb_frag_page - retrieve the page refered to by a paged fragment
+ * @frag: the paged fragment
+ *
+ * Returns the &struct page associated with @frag. Where possible you
+ * should use skb_frag_page() which returns a const &struct page.
+ */
+static inline struct page *__skb_frag_page(const skb_frag_t *frag)
+{
+	return frag->page;
+}
+
+/**
+ * __skb_frag_page - retrieve the page refered to by a paged fragment
+ * @frag: the paged fragment
+ *
+ * Returns the &struct page associated with @frag as a const.
+ */
+static inline const struct page *skb_frag_page(const skb_frag_t *frag)
+{
+	return frag->page;
+}
+
+/**
+ * __skb_frag_ref - take an addition reference on a paged fragment.
+ * @frag: the paged fragment
+ *
+ * Takes an additional reference on the paged fragment @frag.
+ */
+static inline void __skb_frag_ref(skb_frag_t *frag)
+{
+	get_page(__skb_frag_page(frag));
+}
+
+/**
+ * skb_frag_ref - take an addition reference on a paged fragment of an skb.
+ * @skb: the buffer
+ * @f: the fragment offset.
+ *
+ * Takes an additional reference on the @f'th paged fragment of @skb.
+ */
+static inline void skb_frag_ref(struct sk_buff *skb, int f)
+{
+	__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
+}
+
+/**
+ * __skb_frag_unref - release a reference on a paged fragment.
+ * @frag: the paged fragment
+ *
+ * Releases a reference on the paged fragment @frag.
+ */
+static inline void __skb_frag_unref(skb_frag_t *frag)
+{
+	put_page(__skb_frag_page(frag));
+}
+
+/**
+ * skb_frag_unref - release a reference on a paged fragment of an skb.
+ * @skb: the buffer
+ * @f: the fragment offset
+ *
+ * Releases a reference on the @f'th paged fragment of @skb.
+ */
+static inline void skb_frag_unref(struct sk_buff *skb, int f)
+{
+	__skb_frag_unref(&skb_shinfo(skb)->frags[f]);
+}
+
+/**
+ * skb_frag_address - gets the address of the data contained in a paged fragment
+ * @frag: the paged fragment buffer
+ *
+ * Returns the address of the data within @frag. The page must already
+ * be mapped.
+ */
+static inline void *skb_frag_address(const skb_frag_t *frag)
+{
+	return page_address(skb_frag_page(frag)) + frag->page_offset;
+}
+
+/**
+ * skb_frag_address_safe - gets the address of the data contained in a paged fragment
+ * @frag: the paged fragment buffer
+ *
+ * Returns the address of the data within @frag. Checks that the page
+ * is mapped and returns %NULL otherwise.
+ */
+static inline void *skb_frag_address_safe(const skb_frag_t *frag)
+{
+	void *ptr = page_address(skb_frag_page(frag));
+	if (unlikely(!ptr))
+		return NULL;
+
+	return ptr + frag->page_offset;
+}
+
+/**
+ * __skb_frag_set_page - sets the page contained in a paged fragment
+ * @frag: the paged fragment
+ * @page: the page to set
+ *
+ * Sets the fragment @frag to contain @page.
+ */
+static inline void __skb_frag_set_page(skb_frag_t *frag, struct page *page)
+{
+	frag->page = page;
+	__skb_frag_ref(frag);
+}
+
+/**
+ * skb_frag_set_page - sets the page contained in a paged fragment of an skb
+ * @skb: the buffer
+ * @f: the fragment offset
+ * @page: the page to set
+ *
+ * Sets the @f'th fragment of @skb to contain @page.
+ */
+static inline void skb_frag_set_page(struct sk_buff *skb, int f,
+				     struct page *page)
+{
+	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
+}
+
+/**
+ * skb_frag_kmap - kmaps a paged fragment
+ * @frag: the paged fragment
+ *
+ * kmap()s the paged fragment @frag and returns the virtual address.
+ */
+static inline void *skb_frag_kmap(skb_frag_t *frag)
+{
+	return kmap(__skb_frag_page(frag));
+}
+
+/**
+ * skb_frag_kmap - kunmaps a paged fragment
+ * @frag: the paged fragment
+ *
+ * kunmap()s the paged fragment @frag.
+ */
+static inline void skb_frag_kunmap(skb_frag_t *frag)
+{
+	kunmap(__skb_frag_page(frag));
+}
+
+/**
+ * skb_frag_pci_map - maps a paged fragment to a PCI device
+ * @hwdev: the PCI device to map the fragment to
+ * @frag: the paged fragment to map
+ * @offset: the offset within the fragment (starting at the fragments own offset)
+ * @size: the number of bytes to map
+ * @direction: the direction of the mapping (%PCI_DMA_*)
+ *
+ * Maps the page associated with @frag to the PCI device @hwdev.
+ */
+static inline dma_addr_t skb_frag_pci_map(struct pci_dev *hwdev,
+					  const skb_frag_t *frag,
+					  unsigned long offset,
+					  size_t size,
+					  int direction)
+
+{
+	return pci_map_page(hwdev, __skb_frag_page(frag),
+			    frag->page_offset + offset, size, direction);
+}
+
+/**
+ * skb_frag_dma_map - maps a paged fragment via the DMA API
+ * @device: the device to map the fragment to
+ * @frag: the paged fragment to map
+ * @offset: the offset within the fragment (starting at the fragments own offset)
+ * @size: the number of bytes to map
+ * @direction: the direction of the mapping (%PCI_DMA_*)
+ *
+ * Maps the page associated with @frag to @device.
+ */
+static inline dma_addr_t skb_frag_dma_map(struct device *dev,
+					  const skb_frag_t *frag,
+					  size_t offset, size_t size,
+					  enum dma_data_direction dir)
+{
+	return dma_map_page(dev, __skb_frag_page(frag),
+			    frag->page_offset + offset, size, dir);
+}
+
+/**
  *	skb_clone_writable - is the header of a clone writable
  *	@skb: buffer to check
  *	@len: length up to which to write
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 04/10] net: convert core to skb paged frag APIs
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
                   ` (2 preceding siblings ...)
  2011-07-15 11:07 ` [PATCH 03/10] net: add APIs for manipulating skb page fragments Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 11:07 ` [PATCH 05/10] net: convert protocols to SKB " Ian Campbell
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/skbuff.h |    4 ++--
 net/core/datagram.c    |   20 ++++++++------------
 net/core/dev.c         |    7 +++----
 net/core/kmap_skb.h    |    2 +-
 net/core/pktgen.c      |    3 +--
 net/core/skbuff.c      |   31 +++++++++++++++++--------------
 net/core/sock.c        |   12 +++++-------
 net/core/user_dma.c    |    2 +-
 8 files changed, 38 insertions(+), 43 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c061257..982c6a3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1927,12 +1927,12 @@ static inline int skb_add_data(struct sk_buff *skb,
 }
 
 static inline int skb_can_coalesce(struct sk_buff *skb, int i,
-				   struct page *page, int off)
+				   const struct page *page, int off)
 {
 	if (i) {
 		struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
-		return page == frag->page &&
+		return page == skb_frag_page(frag) &&
 		       off == frag->page_offset + frag->size;
 	}
 	return 0;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 18ac112..f0dcaa2 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -332,14 +332,13 @@ int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset,
 			int err;
 			u8  *vaddr;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = frag->page;
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
+			vaddr = skb_frag_kmap(frag);
 			err = memcpy_toiovec(to, vaddr + frag->page_offset +
 					     offset - start, copy);
-			kunmap(page);
+			skb_frag_kunmap(frag);
 			if (err)
 				goto fault;
 			if (!(len -= copy))
@@ -418,14 +417,13 @@ int skb_copy_datagram_const_iovec(const struct sk_buff *skb, int offset,
 			int err;
 			u8  *vaddr;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = frag->page;
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
+			vaddr = skb_frag_kmap(frag);
 			err = memcpy_toiovecend(to, vaddr + frag->page_offset +
 						offset - start, to_offset, copy);
-			kunmap(page);
+			skb_frag_kunmap(frag);
 			if (err)
 				goto fault;
 			if (!(len -= copy))
@@ -508,15 +506,14 @@ int skb_copy_datagram_from_iovec(struct sk_buff *skb, int offset,
 			int err;
 			u8  *vaddr;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = frag->page;
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
+			vaddr = skb_frag_kmap(frag);
 			err = memcpy_fromiovecend(vaddr + frag->page_offset +
 						  offset - start,
 						  from, from_offset, copy);
-			kunmap(page);
+			skb_frag_kunmap(frag);
 			if (err)
 				goto fault;
 
@@ -594,16 +591,15 @@ static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 			int err = 0;
 			u8  *vaddr;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = frag->page;
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
+			vaddr = skb_frag_kmap(frag);
 			csum2 = csum_and_copy_to_user(vaddr +
 							frag->page_offset +
 							offset - start,
 						      to, copy, 0, &err);
-			kunmap(page);
+			skb_frag_kunmap(frag);
 			if (err)
 				goto fault;
 			*csump = csum_block_add(*csump, csum2, pos);
diff --git a/net/core/dev.c b/net/core/dev.c
index 9c58c1e..9ab39c0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3414,7 +3414,7 @@ pull:
 		skb_shinfo(skb)->frags[0].size -= grow;
 
 		if (unlikely(!skb_shinfo(skb)->frags[0].size)) {
-			put_page(skb_shinfo(skb)->frags[0].page);
+			skb_frag_unref(skb, 0);
 			memmove(skb_shinfo(skb)->frags,
 				skb_shinfo(skb)->frags + 1,
 				--skb_shinfo(skb)->nr_frags * sizeof(skb_frag_t));
@@ -3478,10 +3478,9 @@ void skb_gro_reset_offset(struct sk_buff *skb)
 	NAPI_GRO_CB(skb)->frag0_len = 0;
 
 	if (skb->mac_header == skb->tail &&
-	    !PageHighMem(skb_shinfo(skb)->frags[0].page)) {
+	    !PageHighMem(skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
 		NAPI_GRO_CB(skb)->frag0 =
-			page_address(skb_shinfo(skb)->frags[0].page) +
-			skb_shinfo(skb)->frags[0].page_offset;
+			skb_frag_address(&skb_shinfo(skb)->frags[0]);
 		NAPI_GRO_CB(skb)->frag0_len = skb_shinfo(skb)->frags[0].size;
 	}
 }
diff --git a/net/core/kmap_skb.h b/net/core/kmap_skb.h
index 283c2b9..b1e9711 100644
--- a/net/core/kmap_skb.h
+++ b/net/core/kmap_skb.h
@@ -7,7 +7,7 @@ static inline void *kmap_skb_frag(const skb_frag_t *frag)
 
 	local_bh_disable();
 #endif
-	return kmap_atomic(frag->page, KM_SKB_DATA_SOFTIRQ);
+	return kmap_atomic(__skb_frag_page(frag), KM_SKB_DATA_SOFTIRQ);
 }
 
 static inline void kunmap_skb_frag(void *vaddr)
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index f76079c..989b2b6 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2600,8 +2600,7 @@ static void pktgen_finalize_skb(struct pktgen_dev *pkt_dev, struct sk_buff *skb,
 				if (!pkt_dev->page)
 					break;
 			}
-			skb_shinfo(skb)->frags[i].page = pkt_dev->page;
-			get_page(pkt_dev->page);
+			skb_frag_set_page(skb, i, pkt_dev->page);
 			skb_shinfo(skb)->frags[i].page_offset = 0;
 			/*last fragment, fill rest of data*/
 			if (i == (frags - 1))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 46cbd28..2133600 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -326,7 +326,7 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
 			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-				put_page(skb_shinfo(skb)->frags[i].page);
+				skb_frag_unref(skb, i);
 		}
 
 		if (skb_has_frag_list(skb))
@@ -733,7 +733,7 @@ struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			get_page(skb_shinfo(n)->frags[i].page);
+			skb_frag_ref(skb, i);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -820,7 +820,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		kfree(skb->head);
 	} else {
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-			get_page(skb_shinfo(skb)->frags[i].page);
+			skb_frag_ref(skb, i);
 
 		if (skb_has_frag_list(skb))
 			skb_clone_fraglist(skb);
@@ -1098,7 +1098,7 @@ drop_pages:
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_frag_unref(skb, i);
 
 		if (skb_has_frag_list(skb))
 			skb_drop_fraglist(skb);
@@ -1267,7 +1267,7 @@ pull_pages:
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_frag_unref(skb, i);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1512,7 +1512,9 @@ static int __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 	for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) {
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
 
-		if (__splice_segment(f->page, f->page_offset, f->size,
+		/* XXX */
+		if (__splice_segment(__skb_frag_page(f),
+				     f->page_offset, f->size,
 				     offset, len, skb, spd, 0, sk, pipe))
 			return 1;
 	}
@@ -2057,7 +2059,7 @@ static inline void skb_split_no_header(struct sk_buff *skb,
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				get_page(skb_shinfo(skb)->frags[i].page);
+				skb_frag_ref(skb, i);
 				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
 				skb_shinfo(skb1)->frags[0].size -= len - pos;
 				skb_shinfo(skb)->frags[i].size	= len - pos;
@@ -2132,7 +2134,8 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 	 * commit all, so that we don't have to undo partial changes
 	 */
 	if (!to ||
-	    !skb_can_coalesce(tgt, to, fragfrom->page, fragfrom->page_offset)) {
+	    !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
+			      fragfrom->page_offset)) {
 		merge = -1;
 	} else {
 		merge = to - 1;
@@ -2179,7 +2182,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 			to++;
 
 		} else {
-			get_page(fragfrom->page);
+			__skb_frag_ref(fragfrom);
 			fragto->page = fragfrom->page;
 			fragto->page_offset = fragfrom->page_offset;
 			fragto->size = todo;
@@ -2201,7 +2204,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 		fragto = &skb_shinfo(tgt)->frags[merge];
 
 		fragto->size += fragfrom->size;
-		put_page(fragfrom->page);
+		__skb_frag_unref(fragfrom);
 	}
 
 	/* Reposition in the original skb */
@@ -2446,8 +2449,7 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 		left = PAGE_SIZE - frag->page_offset;
 		copy = (length > left)? left : length;
 
-		ret = getfrag(from, (page_address(frag->page) +
-			    frag->page_offset + frag->size),
+		ret = getfrag(from, skb_frag_address(frag) + frag->size,
 			    offset, copy, 0, skb);
 		if (ret < 0)
 			return -EFAULT;
@@ -2599,7 +2601,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 features)
 
 		while (pos < offset + len && i < nfrags) {
 			*frag = skb_shinfo(skb)->frags[i];
-			get_page(frag->page);
+			__skb_frag_ref(frag);
 			size = frag->size;
 
 			if (pos < offset) {
@@ -2822,7 +2824,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len)
 
 			if (copy > len)
 				copy = len;
-			sg_set_page(&sg[elt], frag->page, copy,
+			/* XXX */
+			sg_set_page(&sg[elt], __skb_frag_page(frag), copy,
 					frag->page_offset+offset-start);
 			elt++;
 			if (!(len -= copy))
diff --git a/net/core/sock.c b/net/core/sock.c
index 6e81978..0fb2160 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1530,7 +1530,6 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
 				skb_shinfo(skb)->nr_frags = npages;
 				for (i = 0; i < npages; i++) {
 					struct page *page;
-					skb_frag_t *frag;
 
 					page = alloc_pages(sk->sk_allocation, 0);
 					if (!page) {
@@ -1540,12 +1539,11 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
 						goto failure;
 					}
 
-					frag = &skb_shinfo(skb)->frags[i];
-					frag->page = page;
-					frag->page_offset = 0;
-					frag->size = (data_len >= PAGE_SIZE ?
-						      PAGE_SIZE :
-						      data_len);
+					__skb_fill_page_desc(skb, i,
+							page, 0,
+							(data_len >= PAGE_SIZE ?
+							 PAGE_SIZE :
+							 data_len));
 					data_len -= PAGE_SIZE;
 				}
 
diff --git a/net/core/user_dma.c b/net/core/user_dma.c
index 25d717e..d22ec3e 100644
--- a/net/core/user_dma.c
+++ b/net/core/user_dma.c
@@ -78,7 +78,7 @@ int dma_skb_copy_datagram_iovec(struct dma_chan *chan,
 		copy = end - offset;
 		if (copy > 0) {
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = frag->page;
+			struct page *page = __skb_frag_page(frag); /* XXX */
 
 			if (copy > len)
 				copy = len;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 05/10] net: convert protocols to SKB frag APIs
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
                   ` (3 preceding siblings ...)
  2011-07-15 11:07 ` [PATCH 04/10] net: convert core to skb paged frag APIs Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 11:07 ` [PATCH 06/10] net: convert drivers to paged frag API Ian Campbell
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

NB: should likely be split into per-protocol patches.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 net/ipv4/inet_lro.c    |    2 +-
 net/ipv4/ip_output.c   |    8 +++++---
 net/ipv4/tcp.c         |    3 ++-
 net/ipv4/tcp_output.c  |    2 +-
 net/ipv6/ip6_output.c  |    8 +++++---
 net/xfrm/xfrm_ipcomp.c |   11 +++++++----
 6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c
index 85a0f75..63f3def 100644
--- a/net/ipv4/inet_lro.c
+++ b/net/ipv4/inet_lro.c
@@ -449,7 +449,7 @@ static struct sk_buff *__lro_proc_segment(struct net_lro_mgr *lro_mgr,
 	if (!lro_mgr->get_frag_header ||
 	    lro_mgr->get_frag_header(frags, (void *)&mac_hdr, (void *)&iph,
 				     (void *)&tcph, &flags, priv)) {
-		mac_hdr = page_address(frags->page) + frags->page_offset;
+		mac_hdr = skb_frag_address(frags);
 		goto out1;
 	}
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index a8024ea..29941ec 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -985,14 +985,14 @@ alloc_new_skb:
 			if (page && (left = PAGE_SIZE - off) > 0) {
 				if (copy >= left)
 					copy = left;
-				if (page != frag->page) {
+				if (page != skb_frag_page(frag)) {
 					if (i == MAX_SKB_FRAGS) {
 						err = -EMSGSIZE;
 						goto error;
 					}
-					get_page(page);
 					skb_fill_page_desc(skb, i, page, off, 0);
 					frag = &skb_shinfo(skb)->frags[i];
+					__skb_frag_ref(frag);
 				}
 			} else if (i < MAX_SKB_FRAGS) {
 				if (copy > PAGE_SIZE)
@@ -1005,13 +1005,15 @@ alloc_new_skb:
 				cork->page = page;
 				cork->off = 0;
 
+				/* XXX no ref ? */
 				skb_fill_page_desc(skb, i, page, 0, 0);
 				frag = &skb_shinfo(skb)->frags[i];
 			} else {
 				err = -EMSGSIZE;
 				goto error;
 			}
-			if (getfrag(from, page_address(frag->page)+frag->page_offset+frag->size, offset, copy, skb->len, skb) < 0) {
+			if (getfrag(from, skb_frag_address(frag)+frag->size,
+				    offset, copy, skb->len, skb) < 0) {
 				err = -EFAULT;
 				goto error;
 			}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 054a59d..3a3703c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3035,7 +3035,8 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp,
 
 	for (i = 0; i < shi->nr_frags; ++i) {
 		const struct skb_frag_struct *f = &shi->frags[i];
-		sg_set_page(&sg, f->page, f->size, f->page_offset);
+		struct page *page = __skb_frag_page(f); /* XXX */
+		sg_set_page(&sg, page, f->size, f->page_offset);
 		if (crypto_hash_update(desc, &sg, f->size))
 			return 1;
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..0377c06 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1095,7 +1095,7 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_frag_unref(skb, i);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9d4b165..fdd4f61 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1441,14 +1441,14 @@ alloc_new_skb:
 			if (page && (left = PAGE_SIZE - off) > 0) {
 				if (copy >= left)
 					copy = left;
-				if (page != frag->page) {
+				if (page != skb_frag_page(frag)) {
 					if (i == MAX_SKB_FRAGS) {
 						err = -EMSGSIZE;
 						goto error;
 					}
-					get_page(page);
 					skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
 					frag = &skb_shinfo(skb)->frags[i];
+					__skb_frag_ref(frag);
 				}
 			} else if(i < MAX_SKB_FRAGS) {
 				if (copy > PAGE_SIZE)
@@ -1461,13 +1461,15 @@ alloc_new_skb:
 				sk->sk_sndmsg_page = page;
 				sk->sk_sndmsg_off = 0;
 
+				/* XXX no ref ? */
 				skb_fill_page_desc(skb, i, page, 0, 0);
 				frag = &skb_shinfo(skb)->frags[i];
 			} else {
 				err = -EMSGSIZE;
 				goto error;
 			}
-			if (getfrag(from, page_address(frag->page)+frag->page_offset+frag->size, offset, copy, skb->len, skb) < 0) {
+			if (getfrag(from, skb_frag_address(frag)+frag->size,
+				    offset, copy, skb->len, skb) < 0) {
 				err = -EFAULT;
 				goto error;
 			}
diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
index fc91ad7..f781b9a 100644
--- a/net/xfrm/xfrm_ipcomp.c
+++ b/net/xfrm/xfrm_ipcomp.c
@@ -70,26 +70,29 @@ static int ipcomp_decompress(struct xfrm_state *x, struct sk_buff *skb)
 
 	while ((scratch += len, dlen -= len) > 0) {
 		skb_frag_t *frag;
+		struct page *page;
 
 		err = -EMSGSIZE;
 		if (WARN_ON(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS))
 			goto out;
 
 		frag = skb_shinfo(skb)->frags + skb_shinfo(skb)->nr_frags;
-		frag->page = alloc_page(GFP_ATOMIC);
+		page = alloc_page(GFP_ATOMIC);
 
 		err = -ENOMEM;
-		if (!frag->page)
+		if (!page)
 			goto out;
 
+		__skb_frag_set_page(frag, page);
+
 		len = PAGE_SIZE;
 		if (dlen < len)
 			len = dlen;
 
-		memcpy(page_address(frag->page), scratch, len);
-
 		frag->page_offset = 0;
 		frag->size = len;
+		memcpy(skb_frag_address(frag), scratch, len);
+
 		skb->truesize += len;
 		skb->data_len += len;
 		skb->len += len;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 06/10] net: convert drivers to paged frag API.
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
                   ` (4 preceding siblings ...)
  2011-07-15 11:07 ` [PATCH 05/10] net: convert protocols to SKB " Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 11:07 ` [PATCH 07/10] net: add support for per-paged-fragment destructors Ian Campbell
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

Coccinelle was quite useful in the initial stages of this conversion but a) my
spatch was ugly as sin and b) I've done several rounds of updates since then so
they no longer actually represent the resultant changes anyway.

NB: should be split into individual patches to be acked by relevant driver
maintainers.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/atm/eni.c                       |    4 +-
 drivers/infiniband/hw/amso1100/c2.c     |    7 +----
 drivers/infiniband/hw/nes/nes_nic.c     |   10 +++-----
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    6 +++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |    6 +++-
 drivers/net/3c59x.c                     |    4 +-
 drivers/net/8139cp.c                    |    3 +-
 drivers/net/acenic.c                    |    6 ++--
 drivers/net/atl1c/atl1c_main.c          |    3 +-
 drivers/net/atl1e/atl1e_main.c          |    9 +++----
 drivers/net/atlx/atl1.c                 |    7 ++---
 drivers/net/benet/be_main.c             |   10 ++++----
 drivers/net/bna/bnad.c                  |    4 +-
 drivers/net/bnx2.c                      |    8 +++---
 drivers/net/bnx2x/bnx2x_cmn.c           |    3 +-
 drivers/net/cassini.c                   |   15 ++++++-------
 drivers/net/chelsio/sge.c               |    5 +--
 drivers/net/cxgb3/sge.c                 |    6 ++--
 drivers/net/cxgb4/sge.c                 |   14 +++++++-----
 drivers/net/cxgb4vf/sge.c               |   13 ++++++-----
 drivers/net/e1000/e1000_main.c          |    9 +++----
 drivers/net/e1000e/netdev.c             |    7 ++---
 drivers/net/enic/enic_main.c            |   12 ++++------
 drivers/net/forcedeth.c                 |   10 +++++---
 drivers/net/gianfar.c                   |   10 ++++----
 drivers/net/greth.c                     |    9 ++-----
 drivers/net/ibmveth.c                   |    5 +--
 drivers/net/igb/igb_main.c              |    5 +---
 drivers/net/igbvf/netdev.c              |    5 +---
 drivers/net/ixgb/ixgb_main.c            |    6 ++--
 drivers/net/ixgbe/ixgbe_main.c          |    9 +++----
 drivers/net/ixgbevf/ixgbevf_main.c      |   10 +++-----
 drivers/net/jme.c                       |    5 ++-
 drivers/net/ksz884x.c                   |    3 +-
 drivers/net/mlx4/en_rx.c                |   22 +++++++++-----------
 drivers/net/mlx4/en_tx.c                |   17 +-------------
 drivers/net/mv643xx_eth.c               |    8 +++---
 drivers/net/myri10ge/myri10ge.c         |   12 +++++-----
 drivers/net/netxen/netxen_nic_main.c    |    2 +-
 drivers/net/niu.c                       |    4 +-
 drivers/net/ns83820.c                   |    5 +--
 drivers/net/pasemi_mac.c                |    5 +--
 drivers/net/qla3xxx.c                   |    5 +--
 drivers/net/qlcnic/qlcnic_main.c        |    2 +-
 drivers/net/qlge/qlge_main.c            |    8 ++----
 drivers/net/r8169.c                     |    2 +-
 drivers/net/s2io.c                      |    7 ++---
 drivers/net/sfc/rx.c                    |    2 +-
 drivers/net/sfc/tx.c                    |    9 +------
 drivers/net/skge.c                      |    4 +-
 drivers/net/sky2.c                      |   13 +++++------
 drivers/net/starfire.c                  |    2 +-
 drivers/net/stmmac/stmmac_main.c        |    5 +--
 drivers/net/sungem.c                    |    7 ++---
 drivers/net/sunhme.c                    |    5 +--
 drivers/net/tehuti.c                    |    4 +-
 drivers/net/tg3.c                       |    6 +---
 drivers/net/tsi108_eth.c                |    7 +++--
 drivers/net/typhoon.c                   |    3 +-
 drivers/net/via-velocity.c              |    6 ++--
 drivers/net/virtio_net.c                |    2 +-
 drivers/net/vmxnet3/vmxnet3_drv.c       |    7 ++---
 drivers/net/vxge/vxge-main.c            |    6 ++--
 drivers/net/xen-netback/netback.c       |   34 +++++++++++++++++++++----------
 drivers/net/xen-netfront.c              |   28 +++++++++++++++----------
 drivers/scsi/bnx2fc/bnx2fc_fcoe.c       |    2 +-
 drivers/scsi/cxgbi/libcxgbi.c           |    6 ++--
 drivers/scsi/fcoe/fcoe.c                |    2 +-
 drivers/scsi/fcoe/fcoe_transport.c      |    5 ++-
 drivers/staging/et131x/et1310_tx.c      |   11 ++++-----
 drivers/staging/hv/netvsc_drv.c         |    2 +-
 71 files changed, 245 insertions(+), 280 deletions(-)

diff --git a/drivers/atm/eni.c b/drivers/atm/eni.c
index 3230ea0..60bc9c5 100644
--- a/drivers/atm/eni.c
+++ b/drivers/atm/eni.c
@@ -1133,8 +1133,8 @@ DPRINTK("doing direct send\n"); /* @@@ well, this doesn't work anyway */
 				    skb->data,
 				    skb_headlen(skb));
 			else
-				put_dma(tx->index,eni_dev->dma,&j,(unsigned long)
-				    skb_shinfo(skb)->frags[i].page + skb_shinfo(skb)->frags[i].page_offset,
+				put_dma(tx->index,eni_dev->dma,&j,
+				    (unsigned long)skb_frag_address(&skb_shinfo(skb)->frags[i]),
 				    skb_shinfo(skb)->frags[i].size);
 	}
 	if (skb->len & 3)
diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c
index 0cfc455..0c411b4 100644
--- a/drivers/infiniband/hw/amso1100/c2.c
+++ b/drivers/infiniband/hw/amso1100/c2.c
@@ -801,11 +801,8 @@ static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 			maplen = frag->size;
-			mapaddr =
-			    pci_map_page(c2dev->pcidev, frag->page,
-					 frag->page_offset, maplen,
-					 PCI_DMA_TODEVICE);
-
+			mapaddr = skb_frag_pci_map(c2dev->pcidev, frag, 0,
+						   maplen, PCI_DMA_TODEVICE);
 			elem = elem->next;
 			elem->skb = NULL;
 			elem->mapaddr = mapaddr;
diff --git a/drivers/infiniband/hw/nes/nes_nic.c b/drivers/infiniband/hw/nes/nes_nic.c
index d3a1c41..171a1ff 100644
--- a/drivers/infiniband/hw/nes/nes_nic.c
+++ b/drivers/infiniband/hw/nes/nes_nic.c
@@ -441,9 +441,8 @@ static int nes_nic_send(struct sk_buff *skb, struct net_device *netdev)
 		nesnic->tx_skb[nesnic->sq_head] = skb;
 		for (skb_fragment_index = 0; skb_fragment_index < skb_shinfo(skb)->nr_frags;
 				skb_fragment_index++) {
-			bus_address = pci_map_page( nesdev->pcidev,
-					skb_shinfo(skb)->frags[skb_fragment_index].page,
-					skb_shinfo(skb)->frags[skb_fragment_index].page_offset,
+			bus_address = skb_frag_pci_map( nesdev->pcidev,
+					&skb_shinfo(skb)->frags[skb_fragment_index], 0,
 					skb_shinfo(skb)->frags[skb_fragment_index].size,
 					PCI_DMA_TODEVICE);
 			wqe_fragment_length[wqe_fragment_index] =
@@ -561,9 +560,8 @@ tso_sq_no_longer_full:
 			/* Map all the buffers */
 			for (tso_frag_count=0; tso_frag_count < skb_shinfo(skb)->nr_frags;
 					tso_frag_count++) {
-				tso_bus_address[tso_frag_count] = pci_map_page( nesdev->pcidev,
-						skb_shinfo(skb)->frags[tso_frag_count].page,
-						skb_shinfo(skb)->frags[tso_frag_count].page_offset,
+				tso_bus_address[tso_frag_count] = skb_frag_pci_map( nesdev->pcidev,
+						&skb_shinfo(skb)->frags[tso_frag_count], 0,
 						skb_shinfo(skb)->frags[tso_frag_count].size,
 						PCI_DMA_TODEVICE);
 			}
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 39913a0..1f20e40 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -169,7 +169,8 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
 			goto partial_error;
 		skb_fill_page_desc(skb, i, page, 0, PAGE_SIZE);
 
-		mapping[i + 1] = ib_dma_map_page(priv->ca, skb_shinfo(skb)->frags[i].page,
+		mapping[i + 1] = ib_dma_map_page(priv->ca,
+						 __skb_frag_page(&skb_shinfo(skb)->frags[i]),
 						 0, PAGE_SIZE, DMA_FROM_DEVICE);
 		if (unlikely(ib_dma_mapping_error(priv->ca, mapping[i + 1])))
 			goto partial_error;
@@ -537,7 +538,8 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
 
 		if (length == 0) {
 			/* don't need this page */
-			skb_fill_page_desc(toskb, i, frag->page, 0, PAGE_SIZE);
+			skb_fill_page_desc(toskb, i, __skb_frag_page(frag),
+					   0, PAGE_SIZE);/* XXX */
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 81ae61d..f6ef6c2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -182,7 +182,8 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 			goto partial_error;
 		skb_fill_page_desc(skb, 0, page, 0, PAGE_SIZE);
 		mapping[1] =
-			ib_dma_map_page(priv->ca, skb_shinfo(skb)->frags[0].page,
+			ib_dma_map_page(priv->ca,
+					__skb_frag_page(&skb_shinfo(skb)->frags[0]),
 					0, PAGE_SIZE, DMA_FROM_DEVICE);
 		if (unlikely(ib_dma_mapping_error(priv->ca, mapping[1])))
 			goto partial_error;
@@ -323,7 +324,8 @@ static int ipoib_dma_map_tx(struct ib_device *ca,
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; ++i) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-		mapping[i + off] = ib_dma_map_page(ca, frag->page,
+		mapping[i + off] = ib_dma_map_page(ca,
+						 __skb_frag_page(frag),
 						 frag->page_offset, frag->size,
 						 DMA_TO_DEVICE);
 		if (unlikely(ib_dma_mapping_error(ca, mapping[i + off])))
diff --git a/drivers/net/3c59x.c b/drivers/net/3c59x.c
index 8cc2256..fcd3820 100644
--- a/drivers/net/3c59x.c
+++ b/drivers/net/3c59x.c
@@ -2180,8 +2180,8 @@ boomerang_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 			vp->tx_ring[entry].frag[i+1].addr =
 					cpu_to_le32(pci_map_single(VORTEX_PCI(vp),
-											   (void*)page_address(frag->page) + frag->page_offset,
-											   frag->size, PCI_DMA_TODEVICE));
+								   (void*)skb_frag_address(frag),
+								   frag->size, PCI_DMA_TODEVICE));
 
 			if (i == skb_shinfo(skb)->nr_frags-1)
 					vp->tx_ring[entry].frag[i+1].length = cpu_to_le32(frag->size|LAST_FRAG);
diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c
index 10c4505..c418aa3 100644
--- a/drivers/net/8139cp.c
+++ b/drivers/net/8139cp.c
@@ -815,8 +815,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 
 			len = this_frag->size;
 			mapping = dma_map_single(&cp->pdev->dev,
-						 ((void *) page_address(this_frag->page) +
-						  this_frag->page_offset),
+						 skb_frag_address(this_frag),
 						 len, PCI_DMA_TODEVICE);
 			eor = (entry == (CP_TX_RING_SIZE - 1)) ? RingEnd : 0;
 
diff --git a/drivers/net/acenic.c b/drivers/net/acenic.c
index d7c1bfe4..f848691 100644
--- a/drivers/net/acenic.c
+++ b/drivers/net/acenic.c
@@ -2528,9 +2528,9 @@ restart:
 			info = ap->skb->tx_skbuff + idx;
 			desc = ap->tx_ring + idx;
 
-			mapping = pci_map_page(ap->pdev, frag->page,
-					       frag->page_offset, frag->size,
-					       PCI_DMA_TODEVICE);
+			mapping = skb_frag_pci_map(ap->pdev, frag, 0,
+						   frag->size,
+						   PCI_DMA_TODEVICE);
 
 			flagsize = (frag->size << 16);
 			if (skb->ip_summed == CHECKSUM_PARTIAL)
diff --git a/drivers/net/atl1c/atl1c_main.c b/drivers/net/atl1c/atl1c_main.c
index 1269ba5..162a38a 100644
--- a/drivers/net/atl1c/atl1c_main.c
+++ b/drivers/net/atl1c/atl1c_main.c
@@ -2162,8 +2162,7 @@ static void atl1c_tx_map(struct atl1c_adapter *adapter,
 		buffer_info = atl1c_get_tx_buffer(adapter, use_tpd);
 		buffer_info->length = frag->size;
 		buffer_info->dma =
-			pci_map_page(adapter->pdev, frag->page,
-					frag->page_offset,
+			skb_frag_pci_map(adapter->pdev, frag, 0,
 					buffer_info->length,
 					PCI_DMA_TODEVICE);
 		ATL1C_SET_BUFFER_STATE(buffer_info, ATL1C_BUFFER_BUSY);
diff --git a/drivers/net/atl1e/atl1e_main.c b/drivers/net/atl1e/atl1e_main.c
index 86a9122..f48fa37 100644
--- a/drivers/net/atl1e/atl1e_main.c
+++ b/drivers/net/atl1e/atl1e_main.c
@@ -1746,11 +1746,10 @@ static void atl1e_tx_map(struct atl1e_adapter *adapter,
 			buf_len -= tx_buffer->length;
 
 			tx_buffer->dma =
-				pci_map_page(adapter->pdev, frag->page,
-						frag->page_offset +
-						(i * MAX_TX_BUF_LEN),
-						tx_buffer->length,
-						PCI_DMA_TODEVICE);
+				skb_frag_pci_map(adapter->pdev, frag,
+						 (i * MAX_TX_BUF_LEN),
+						 tx_buffer->length,
+						 PCI_DMA_TODEVICE);
 			ATL1E_SET_PCIMAP_TYPE(tx_buffer, ATL1E_TX_PCIMAP_PAGE);
 			use_tpd->buffer_addr = cpu_to_le64(tx_buffer->dma);
 			use_tpd->word2 = (use_tpd->word2 & (~TPD_BUFLEN_MASK)) |
diff --git a/drivers/net/atlx/atl1.c b/drivers/net/atlx/atl1.c
index cd5789f..96caf1a 100644
--- a/drivers/net/atlx/atl1.c
+++ b/drivers/net/atlx/atl1.c
@@ -2283,11 +2283,10 @@ static void atl1_tx_map(struct atl1_adapter *adapter, struct sk_buff *skb,
 			buffer_info->length = (buf_len > ATL1_MAX_TX_BUF_LEN) ?
 				ATL1_MAX_TX_BUF_LEN : buf_len;
 			buf_len -= buffer_info->length;
-			buffer_info->dma = pci_map_page(adapter->pdev,
-				frag->page,
-				frag->page_offset + (i * ATL1_MAX_TX_BUF_LEN),
+			buffer_info->dma = skb_frag_pci_map(
+				adapter->pdev, frag,
+				(i * ATL1_MAX_TX_BUF_LEN),
 				buffer_info->length, PCI_DMA_TODEVICE);
-
 			if (++next_to_use == tpd_ring->count)
 				next_to_use = 0;
 		}
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index a485f7f..3e4f643 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -715,8 +715,8 @@ static int make_tx_wrbs(struct be_adapter *adapter,
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		struct skb_frag_struct *frag =
 			&skb_shinfo(skb)->frags[i];
-		busaddr = dma_map_page(dev, frag->page, frag->page_offset,
-				       frag->size, DMA_TO_DEVICE);
+		busaddr = skb_frag_dma_map(dev, frag, 0,
+					   frag->size, DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, busaddr))
 			goto dma_err;
 		wrb = queue_head_node(txq);
@@ -1122,7 +1122,7 @@ static void skb_fill_rx_data(struct be_adapter *adapter, struct be_rx_obj *rxo,
 		skb->tail += curr_frag_len;
 	} else {
 		skb_shinfo(skb)->nr_frags = 1;
-		skb_shinfo(skb)->frags[0].page = page_info->page;
+		skb_frag_set_page(skb, 0, page_info->page);
 		skb_shinfo(skb)->frags[0].page_offset =
 					page_info->page_offset + hdr_len;
 		skb_shinfo(skb)->frags[0].size = curr_frag_len - hdr_len;
@@ -1147,7 +1147,7 @@ static void skb_fill_rx_data(struct be_adapter *adapter, struct be_rx_obj *rxo,
 		if (page_info->page_offset == 0) {
 			/* Fresh page */
 			j++;
-			skb_shinfo(skb)->frags[j].page = page_info->page;
+			skb_frag_set_page(skb, j, page_info->page);
 			skb_shinfo(skb)->frags[j].page_offset =
 							page_info->page_offset;
 			skb_shinfo(skb)->frags[j].size = 0;
@@ -1236,7 +1236,7 @@ static void be_rx_compl_process_gro(struct be_adapter *adapter,
 		if (i == 0 || page_info->page_offset == 0) {
 			/* First frag or Fresh page */
 			j++;
-			skb_shinfo(skb)->frags[j].page = page_info->page;
+			skb_frag_set_page(skb, j, page_info->page);
 			skb_shinfo(skb)->frags[j].page_offset =
 							page_info->page_offset;
 			skb_shinfo(skb)->frags[j].size = 0;
diff --git a/drivers/net/bna/bnad.c b/drivers/net/bna/bnad.c
index 7d25a97..a79f69b 100644
--- a/drivers/net/bna/bnad.c
+++ b/drivers/net/bna/bnad.c
@@ -2636,8 +2636,8 @@ bnad_start_xmit(struct sk_buff *skb, struct net_device *netdev)
 
 		BUG_ON(!(size <= BFI_TX_MAX_DATA_PER_VECTOR));
 		txqent->vector[vect_id].length = htons(size);
-		dma_addr = dma_map_page(&bnad->pcidev->dev, frag->page,
-					frag->page_offset, size, DMA_TO_DEVICE);
+		dma_addr = skb_frag_dma_map(&bnad->pcidev->dev, frag,
+					    0, size, DMA_TO_DEVICE);
 		dma_unmap_addr_set(&unmap_q->unmap_array[unmap_prod], dma_addr,
 				   dma_addr);
 		BNA_SET_DMA_ADDR(dma_addr, &txqent->vector[vect_id].host_addr);
diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 57d3293..ff90b13 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2880,8 +2880,8 @@ bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr,
 
 		shinfo = skb_shinfo(skb);
 		shinfo->nr_frags--;
-		page = shinfo->frags[shinfo->nr_frags].page;
-		shinfo->frags[shinfo->nr_frags].page = NULL;
+		page = __skb_frag_page(&shinfo->frags[shinfo->nr_frags]);
+		__skb_frag_set_page(&shinfo->frags[shinfo->nr_frags], NULL);
 
 		cons_rx_pg->page = page;
 		dev_kfree_skb(skb);
@@ -6461,8 +6461,8 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		txbd = &txr->tx_desc_ring[ring_prod];
 
 		len = frag->size;
-		mapping = dma_map_page(&bp->pdev->dev, frag->page, frag->page_offset,
-				       len, PCI_DMA_TODEVICE);
+		mapping = skb_frag_dma_map(&bp->pdev->dev, frag, 0, len,
+					   PCI_DMA_TODEVICE);
 		if (dma_mapping_error(&bp->pdev->dev, mapping))
 			goto dma_error;
 		dma_unmap_addr_set(&txr->tx_buf_ring[ring_prod], mapping,
diff --git a/drivers/net/bnx2x/bnx2x_cmn.c b/drivers/net/bnx2x/bnx2x_cmn.c
index 28904433..dee09d7 100644
--- a/drivers/net/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/bnx2x/bnx2x_cmn.c
@@ -2406,8 +2406,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (total_pkt_bd == NULL)
 			total_pkt_bd = &fp->tx_desc_ring[bd_prod].reg_bd;
 
-		mapping = dma_map_page(&bp->pdev->dev, frag->page,
-				       frag->page_offset,
+		mapping = skb_frag_dma_map(&bp->pdev->dev, frag, 0,
 				       frag->size, DMA_TO_DEVICE);
 
 		tx_data_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
diff --git a/drivers/net/cassini.c b/drivers/net/cassini.c
index 22ce03e..5261673 100644
--- a/drivers/net/cassini.c
+++ b/drivers/net/cassini.c
@@ -2047,8 +2047,8 @@ static int cas_rx_process_pkt(struct cas *cp, struct cas_rx_comp *rxc,
 		skb->truesize += hlen - swivel;
 		skb->len      += hlen - swivel;
 
-		get_page(page->buffer);
-		frag->page = page->buffer;
+		__skb_frag_set_page(frag, page->buffer);
+		__skb_frag_ref(frag);
 		frag->page_offset = off;
 		frag->size = hlen - swivel;
 
@@ -2071,8 +2071,8 @@ static int cas_rx_process_pkt(struct cas *cp, struct cas_rx_comp *rxc,
 			skb->len      += hlen;
 			frag++;
 
-			get_page(page->buffer);
-			frag->page = page->buffer;
+			__skb_frag_set_page(frag, page->buffer);
+			__skb_frag_ref(frag);
 			frag->page_offset = 0;
 			frag->size = hlen;
 			RX_USED_ADD(page, hlen + cp->crc_size);
@@ -2829,9 +2829,8 @@ static inline int cas_xmit_tx_ringN(struct cas *cp, int ring,
 		skb_frag_t *fragp = &skb_shinfo(skb)->frags[frag];
 
 		len = fragp->size;
-		mapping = pci_map_page(cp->pdev, fragp->page,
-				       fragp->page_offset, len,
-				       PCI_DMA_TODEVICE);
+		mapping = skb_frag_pci_map(cp->pdev, fragp, 0, len,
+					   PCI_DMA_TODEVICE);
 
 		tabort = cas_calc_tabort(cp, fragp->page_offset, len);
 		if (unlikely(tabort)) {
@@ -2842,7 +2841,7 @@ static inline int cas_xmit_tx_ringN(struct cas *cp, int ring,
 				      ctrl, 0);
 			entry = TX_DESC_NEXT(ring, entry);
 
-			addr = cas_page_map(fragp->page);
+			addr = cas_page_map(__skb_frag_page(fragp));
 			memcpy(tx_tiny_buf(cp, ring, entry),
 			       addr + fragp->page_offset + len - tabort,
 			       tabort);
diff --git a/drivers/net/chelsio/sge.c b/drivers/net/chelsio/sge.c
index 58380d2..7215fe5 100644
--- a/drivers/net/chelsio/sge.c
+++ b/drivers/net/chelsio/sge.c
@@ -1276,9 +1276,8 @@ static inline void write_tx_descs(struct adapter *adapter, struct sk_buff *skb,
 			ce = q->centries;
 		}
 
-		mapping = pci_map_page(adapter->pdev, frag->page,
-				       frag->page_offset, frag->size,
-				       PCI_DMA_TODEVICE);
+		mapping = skb_frag_pci_map(adapter->pdev, frag, 0,
+					   frag->size, PCI_DMA_TODEVICE);
 		desc_mapping = mapping;
 		desc_len = frag->size;
 
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
index 3f562ba..d8a0d81 100644
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -979,8 +979,8 @@ static inline unsigned int make_sgl(const struct sk_buff *skb,
 	for (i = 0; i < nfrags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-		mapping = pci_map_page(pdev, frag->page, frag->page_offset,
-				       frag->size, PCI_DMA_TODEVICE);
+		mapping = skb_frag_pci_map(pdev, frag, 0,
+					   frag->size, PCI_DMA_TODEVICE);
 		sgp->len[j] = cpu_to_be32(frag->size);
 		sgp->addr[j] = cpu_to_be64(mapping);
 		j ^= 1;
@@ -2133,7 +2133,7 @@ static void lro_add_page(struct adapter *adap, struct sge_qset *qs,
 	len -= offset;
 
 	rx_frag += nr_frags;
-	rx_frag->page = sd->pg_chunk.page;
+	__skb_frag_set_page(rx_frag, sd->pg_chunk.page);
 	rx_frag->page_offset = sd->pg_chunk.offset + offset;
 	rx_frag->size = len;
 
diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
index 56adf44..f1813b5 100644
--- a/drivers/net/cxgb4/sge.c
+++ b/drivers/net/cxgb4/sge.c
@@ -215,8 +215,8 @@ static int map_skb(struct device *dev, const struct sk_buff *skb,
 	end = &si->frags[si->nr_frags];
 
 	for (fp = si->frags; fp < end; fp++) {
-		*++addr = dma_map_page(dev, fp->page, fp->page_offset, fp->size,
-				       DMA_TO_DEVICE);
+		*++addr = skb_frag_dma_map(dev, fp, 0, fp->size,
+					   DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, *addr))
 			goto unwind;
 	}
@@ -1409,13 +1409,14 @@ int cxgb4_ofld_send(struct net_device *dev, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(cxgb4_ofld_send);
 
-static inline void copy_frags(struct skb_shared_info *ssi,
+static inline void copy_frags(struct sk_buff *skb,
 			      const struct pkt_gl *gl, unsigned int offset)
 {
+	struct skb_shared_info *ssi = skb_shinfo(skb);
 	unsigned int n;
 
 	/* usually there's just one frag */
-	ssi->frags[0].page = gl->frags[0].page;
+	skb_frag_set_page(skb, 0, gl->frags[0].page);
 	ssi->frags[0].page_offset = gl->frags[0].page_offset + offset;
 	ssi->frags[0].size = gl->frags[0].size - offset;
 	ssi->nr_frags = gl->nfrags;
@@ -1459,7 +1460,7 @@ struct sk_buff *cxgb4_pktgl_to_skb(const struct pkt_gl *gl,
 		__skb_put(skb, pull_len);
 		skb_copy_to_linear_data(skb, gl->va, pull_len);
 
-		copy_frags(skb_shinfo(skb), gl, pull_len);
+		copy_frags(skb, gl, pull_len);
 		skb->len = gl->tot_len;
 		skb->data_len = skb->len - pull_len;
 		skb->truesize += skb->data_len;
@@ -1522,7 +1523,7 @@ static void do_gro(struct sge_eth_rxq *rxq, const struct pkt_gl *gl,
 		return;
 	}
 
-	copy_frags(skb_shinfo(skb), gl, RX_PKT_PAD);
+	copy_frags(skb, gl, RX_PKT_PAD);
 	skb->len = gl->tot_len - RX_PKT_PAD;
 	skb->data_len = skb->len;
 	skb->truesize += skb->data_len;
@@ -1735,6 +1736,7 @@ static int process_responses(struct sge_rspq *q, int budget)
 
 			si.va = page_address(si.frags[0].page) +
 				si.frags[0].page_offset;
+
 			prefetch(si.va);
 
 			si.nfrags = frags + 1;
diff --git a/drivers/net/cxgb4vf/sge.c b/drivers/net/cxgb4vf/sge.c
index 5fd75fd..f4c4480 100644
--- a/drivers/net/cxgb4vf/sge.c
+++ b/drivers/net/cxgb4vf/sge.c
@@ -296,8 +296,8 @@ static int map_skb(struct device *dev, const struct sk_buff *skb,
 	si = skb_shinfo(skb);
 	end = &si->frags[si->nr_frags];
 	for (fp = si->frags; fp < end; fp++) {
-		*++addr = dma_map_page(dev, fp->page, fp->page_offset, fp->size,
-				       DMA_TO_DEVICE);
+		*++addr = skb_frag_dma_map(dev, fp, 0, fp->size,
+					   DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, *addr))
 			goto unwind;
 	}
@@ -1397,7 +1397,7 @@ struct sk_buff *t4vf_pktgl_to_skb(const struct pkt_gl *gl,
 		skb_copy_to_linear_data(skb, gl->va, pull_len);
 
 		ssi = skb_shinfo(skb);
-		ssi->frags[0].page = gl->frags[0].page;
+		skb_frag_set_page(skb, 0, gl->frags[0].page);
 		ssi->frags[0].page_offset = gl->frags[0].page_offset + pull_len;
 		ssi->frags[0].size = gl->frags[0].size - pull_len;
 		if (gl->nfrags > 1)
@@ -1442,14 +1442,15 @@ void t4vf_pktgl_free(const struct pkt_gl *gl)
  *	Copy an internal packet gather list into a Linux skb_shared_info
  *	structure.
  */
-static inline void copy_frags(struct skb_shared_info *si,
+static inline void copy_frags(struct sk_buff *skb,
 			      const struct pkt_gl *gl,
 			      unsigned int offset)
 {
+	struct skb_shared_info *si = skb_shinfo(skb);
 	unsigned int n;
 
 	/* usually there's just one frag */
-	si->frags[0].page = gl->frags[0].page;
+	skb_frag_set_page(skb, 0, gl->frags[0].page);
 	si->frags[0].page_offset = gl->frags[0].page_offset + offset;
 	si->frags[0].size = gl->frags[0].size - offset;
 	si->nr_frags = gl->nfrags;
@@ -1484,7 +1485,7 @@ static void do_gro(struct sge_eth_rxq *rxq, const struct pkt_gl *gl,
 		return;
 	}
 
-	copy_frags(skb_shinfo(skb), gl, PKTSHIFT);
+	copy_frags(skb, gl, PKTSHIFT);
 	skb->len = gl->tot_len - PKTSHIFT;
 	skb->data_len = skb->len;
 	skb->truesize += skb->data_len;
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 76e8af0..e902cd0 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2861,7 +2861,7 @@ static int e1000_tx_map(struct e1000_adapter *adapter,
 
 		frag = &skb_shinfo(skb)->frags[f];
 		len = frag->size;
-		offset = frag->page_offset;
+		offset = 0;
 
 		while (len) {
 			i++;
@@ -2878,7 +2878,7 @@ static int e1000_tx_map(struct e1000_adapter *adapter,
 			 * Avoid terminating buffers within evenly-aligned
 			 * dwords. */
 			if (unlikely(adapter->pcix_82544 &&
-			    !((unsigned long)(page_to_phys(frag->page) + offset
+			    !((unsigned long)(page_to_phys(__skb_frag_page(frag)) + offset
 			                      + size - 1) & 4) &&
 			    size > 4))
 				size -= 4;
@@ -2886,9 +2886,8 @@ static int e1000_tx_map(struct e1000_adapter *adapter,
 			buffer_info->length = size;
 			buffer_info->time_stamp = jiffies;
 			buffer_info->mapped_as_page = true;
-			buffer_info->dma = dma_map_page(&pdev->dev, frag->page,
-							offset,	size,
-							DMA_TO_DEVICE);
+			buffer_info->dma = skb_frag_dma_map(&pdev->dev, frag,
+						offset, size, DMA_TO_DEVICE);
 			if (dma_mapping_error(&pdev->dev, buffer_info->dma))
 				goto dma_error;
 			buffer_info->next_to_watch = i;
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 3310c3d..30f8a5c 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4599,7 +4599,7 @@ static int e1000_tx_map(struct e1000_adapter *adapter,
 
 		frag = &skb_shinfo(skb)->frags[f];
 		len = frag->size;
-		offset = frag->page_offset;
+		offset = 0;
 
 		while (len) {
 			i++;
@@ -4612,9 +4612,8 @@ static int e1000_tx_map(struct e1000_adapter *adapter,
 			buffer_info->length = size;
 			buffer_info->time_stamp = jiffies;
 			buffer_info->next_to_watch = i;
-			buffer_info->dma = dma_map_page(&pdev->dev, frag->page,
-							offset, size,
-							DMA_TO_DEVICE);
+			buffer_info->dma = skb_frag_dma_map(&pdev->dev, frag,
+						offset, size, DMA_TO_DEVICE);
 			buffer_info->mapped_as_page = true;
 			if (dma_mapping_error(&pdev->dev, buffer_info->dma))
 				goto dma_error;
diff --git a/drivers/net/enic/enic_main.c b/drivers/net/enic/enic_main.c
index 2f433fb..bbab7cd 100644
--- a/drivers/net/enic/enic_main.c
+++ b/drivers/net/enic/enic_main.c
@@ -584,9 +584,8 @@ static inline void enic_queue_wq_skb_cont(struct enic *enic,
 	for (frag = skb_shinfo(skb)->frags; len_left; frag++) {
 		len_left -= frag->size;
 		enic_queue_wq_desc_cont(wq, skb,
-			pci_map_page(enic->pdev, frag->page,
-				frag->page_offset, frag->size,
-				PCI_DMA_TODEVICE),
+			skb_frag_pci_map(enic->pdev, frag, 0, frag->size,
+					 PCI_DMA_TODEVICE),
 			frag->size,
 			(len_left == 0),	/* EOP? */
 			loopback);
@@ -698,14 +697,13 @@ static inline void enic_queue_wq_skb_tso(struct enic *enic,
 	for (frag = skb_shinfo(skb)->frags; len_left; frag++) {
 		len_left -= frag->size;
 		frag_len_left = frag->size;
-		offset = frag->page_offset;
+		offset = 0;
 
 		while (frag_len_left) {
 			len = min(frag_len_left,
 				(unsigned int)WQ_ENET_MAX_DESC_LEN);
-			dma_addr = pci_map_page(enic->pdev, frag->page,
-				offset, len,
-				PCI_DMA_TODEVICE);
+			dma_addr = skb_frag_pci_map(enic->pdev, frag,
+				offset, len, PCI_DMA_TODEVICE);
 			enic_queue_wq_desc_cont(wq, skb,
 				dma_addr,
 				len,
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 537b695..062eccd 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -2149,8 +2149,9 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			prev_tx = put_tx;
 			prev_tx_ctx = np->put_tx_ctx;
 			bcnt = (size > NV_TX2_TSO_MAX_SIZE) ? NV_TX2_TSO_MAX_SIZE : size;
-			np->put_tx_ctx->dma = pci_map_page(np->pci_dev, frag->page, frag->page_offset+offset, bcnt,
-							   PCI_DMA_TODEVICE);
+			np->put_tx_ctx->dma =
+				skb_frag_pci_map(np->pci_dev, frag, offset, bcnt,
+						 PCI_DMA_TODEVICE);
 			np->put_tx_ctx->dma_len = bcnt;
 			np->put_tx_ctx->dma_single = 0;
 			put_tx->buf = cpu_to_le32(np->put_tx_ctx->dma);
@@ -2260,8 +2261,9 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff *skb,
 			prev_tx = put_tx;
 			prev_tx_ctx = np->put_tx_ctx;
 			bcnt = (size > NV_TX2_TSO_MAX_SIZE) ? NV_TX2_TSO_MAX_SIZE : size;
-			np->put_tx_ctx->dma = pci_map_page(np->pci_dev, frag->page, frag->page_offset+offset, bcnt,
-							   PCI_DMA_TODEVICE);
+			np->put_tx_ctx->dma =
+				skb_frag_pci_map(np->pci_dev, frag, offset, bcnt,
+						 PCI_DMA_TODEVICE);
 			np->put_tx_ctx->dma_len = bcnt;
 			np->put_tx_ctx->dma_single = 0;
 			put_tx->bufhigh = cpu_to_le32(dma_high(np->put_tx_ctx->dma));
diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index 2dfcc80..766e037 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -2141,11 +2141,11 @@ static int gfar_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			if (i == nr_frags - 1)
 				lstatus |= BD_LFLAG(TXBD_LAST | TXBD_INTERRUPT);
 
-			bufaddr = dma_map_page(&priv->ofdev->dev,
-					skb_shinfo(skb)->frags[i].page,
-					skb_shinfo(skb)->frags[i].page_offset,
-					length,
-					DMA_TO_DEVICE);
+			bufaddr = skb_frag_dma_map(&priv->ofdev->dev,
+						   &skb_shinfo(skb)->frags[i],
+						   0,
+						   length,
+						   DMA_TO_DEVICE);
 
 			/* set the TxBD length and buffer pointer */
 			txbdp->bufPtr = bufaddr;
diff --git a/drivers/net/greth.c b/drivers/net/greth.c
index f181304..37c1eff 100644
--- a/drivers/net/greth.c
+++ b/drivers/net/greth.c
@@ -111,7 +111,7 @@ static void greth_print_tx_packet(struct sk_buff *skb)
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 
 		print_hex_dump(KERN_DEBUG, "TX: ", DUMP_PREFIX_OFFSET, 16, 1,
-			       phys_to_virt(page_to_phys(skb_shinfo(skb)->frags[i].page)) +
+			       phys_to_virt(page_to_phys(skb_frag_page(&skb_shinfo(skb)->frags[i]))) +
 			       skb_shinfo(skb)->frags[i].page_offset,
 			       length, true);
 	}
@@ -526,11 +526,8 @@ greth_start_xmit_gbit(struct sk_buff *skb, struct net_device *dev)
 
 		greth_write_bd(&bdp->stat, status);
 
-		dma_addr = dma_map_page(greth->dev,
-					frag->page,
-					frag->page_offset,
-					frag->size,
-					DMA_TO_DEVICE);
+		dma_addr = skb_frag_dma_map(greth->dev, frag, 0, frag->size,
+					    DMA_TO_DEVICE);
 
 		if (unlikely(dma_mapping_error(greth->dev, dma_addr)))
 			goto frag_map_error;
diff --git a/drivers/net/ibmveth.c b/drivers/net/ibmveth.c
index b388d78..65e5874 100644
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -1001,9 +1001,8 @@ retry_bounce:
 		unsigned long dma_addr;
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-		dma_addr = dma_map_page(&adapter->vdev->dev, frag->page,
-					frag->page_offset, frag->size,
-					DMA_TO_DEVICE);
+		dma_addr = skb_frag_dma_map(&adapter->vdev->dev, frag, 0,
+					    frag->size, DMA_TO_DEVICE);
 
 		if (dma_mapping_error(&adapter->vdev->dev, dma_addr))
 			goto map_failed_frags;
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 2c28621..17f94f4 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -4132,10 +4132,7 @@ static inline int igb_tx_map_adv(struct igb_ring *tx_ring, struct sk_buff *skb,
 		buffer_info->time_stamp = jiffies;
 		buffer_info->next_to_watch = i;
 		buffer_info->mapped_as_page = true;
-		buffer_info->dma = dma_map_page(dev,
-						frag->page,
-						frag->page_offset,
-						len,
+		buffer_info->dma = skb_frag_dma_map(dev, frag, 0, len,
 						DMA_TO_DEVICE);
 		if (dma_mapping_error(dev, buffer_info->dma))
 			goto dma_error;
diff --git a/drivers/net/igbvf/netdev.c b/drivers/net/igbvf/netdev.c
index 1c77fb3..3f6655f 100644
--- a/drivers/net/igbvf/netdev.c
+++ b/drivers/net/igbvf/netdev.c
@@ -2074,10 +2074,7 @@ static inline int igbvf_tx_map_adv(struct igbvf_adapter *adapter,
 		buffer_info->time_stamp = jiffies;
 		buffer_info->next_to_watch = i;
 		buffer_info->mapped_as_page = true;
-		buffer_info->dma = dma_map_page(&pdev->dev,
-						frag->page,
-						frag->page_offset,
-						len,
+		buffer_info->dma = skb_frag_dma_map(&pdev->dev, frag, 0, len,
 						DMA_TO_DEVICE);
 		if (dma_mapping_error(&pdev->dev, buffer_info->dma))
 			goto dma_error;
diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index 6a130eb..45c4e90 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -1341,7 +1341,7 @@ ixgb_tx_map(struct ixgb_adapter *adapter, struct sk_buff *skb,
 
 		frag = &skb_shinfo(skb)->frags[f];
 		len = frag->size;
-		offset = frag->page_offset;
+		offset = 0;
 
 		while (len) {
 			i++;
@@ -1361,8 +1361,8 @@ ixgb_tx_map(struct ixgb_adapter *adapter, struct sk_buff *skb,
 			buffer_info->time_stamp = jiffies;
 			buffer_info->mapped_as_page = true;
 			buffer_info->dma =
-				dma_map_page(&pdev->dev, frag->page,
-					     offset, size, DMA_TO_DEVICE);
+				skb_frag_dma_map(&pdev->dev, frag, offset, size,
+						 DMA_TO_DEVICE);
 			if (dma_mapping_error(&pdev->dev, buffer_info->dma))
 				goto dma_error;
 			buffer_info->next_to_watch = 0;
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 08e8e25..307cf06 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -6632,7 +6632,7 @@ static int ixgbe_tx_map(struct ixgbe_adapter *adapter,
 
 		frag = &skb_shinfo(skb)->frags[f];
 		len = min((unsigned int)frag->size, total);
-		offset = frag->page_offset;
+		offset = 0;
 
 		while (len) {
 			i++;
@@ -6643,10 +6643,9 @@ static int ixgbe_tx_map(struct ixgbe_adapter *adapter,
 			size = min(len, (uint)IXGBE_MAX_DATA_PER_TXD);
 
 			tx_buffer_info->length = size;
-			tx_buffer_info->dma = dma_map_page(dev,
-							   frag->page,
-							   offset, size,
-							   DMA_TO_DEVICE);
+			tx_buffer_info->dma =
+				skb_frag_dma_map(dev, frag, offset, size,
+					     DMA_TO_DEVICE);
 			tx_buffer_info->mapped_as_page = true;
 			if (dma_mapping_error(dev, tx_buffer_info->dma))
 				goto dma_error;
diff --git a/drivers/net/ixgbevf/ixgbevf_main.c b/drivers/net/ixgbevf/ixgbevf_main.c
index 28d3cb2..ad05ad9 100644
--- a/drivers/net/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ixgbevf/ixgbevf_main.c
@@ -2951,18 +2951,16 @@ static int ixgbevf_tx_map(struct ixgbevf_adapter *adapter,
 
 		frag = &skb_shinfo(skb)->frags[f];
 		len = min((unsigned int)frag->size, total);
-		offset = frag->page_offset;
+		offset = 0;
 
 		while (len) {
 			tx_buffer_info = &tx_ring->tx_buffer_info[i];
 			size = min(len, (unsigned int)IXGBE_MAX_DATA_PER_TXD);
 
 			tx_buffer_info->length = size;
-			tx_buffer_info->dma = dma_map_page(&adapter->pdev->dev,
-							   frag->page,
-							   offset,
-							   size,
-							   DMA_TO_DEVICE);
+			tx_buffer_info->dma =
+				skb_frag_dma_map(&adapter->pdev->dev, frag,
+						 offset, size, DMA_TO_DEVICE);
 			tx_buffer_info->mapped_as_page = true;
 			if (dma_mapping_error(&pdev->dev, tx_buffer_info->dma))
 				goto dma_error;
diff --git a/drivers/net/jme.c b/drivers/net/jme.c
index b5b174a..a73e895 100644
--- a/drivers/net/jme.c
+++ b/drivers/net/jme.c
@@ -1928,8 +1928,9 @@ jme_map_tx_skb(struct jme_adapter *jme, struct sk_buff *skb, int idx)
 		ctxdesc = txdesc + ((idx + i + 2) & (mask));
 		ctxbi = txbi + ((idx + i + 2) & (mask));
 
-		jme_fill_tx_map(jme->pdev, ctxdesc, ctxbi, frag->page,
-				 frag->page_offset, frag->size, hidma);
+		jme_fill_tx_map(jme->pdev, ctxdesc, ctxbi,
+				__skb_frag_page(frag),
+				frag->page_offset, frag->size, hidma);
 	}
 
 	len = skb_is_nonlinear(skb) ? skb_headlen(skb) : skb->len;
diff --git a/drivers/net/ksz884x.c b/drivers/net/ksz884x.c
index 41ea592..e610d88 100644
--- a/drivers/net/ksz884x.c
+++ b/drivers/net/ksz884x.c
@@ -4703,8 +4703,7 @@ static void send_packet(struct sk_buff *skb, struct net_device *dev)
 
 			dma_buf->dma = pci_map_single(
 				hw_priv->pdev,
-				page_address(this_frag->page) +
-				this_frag->page_offset,
+				skb_frag_address(this_frag),
 				dma_buf->len,
 				PCI_DMA_TODEVICE);
 			set_tx_buf(desc, dma_buf->dma);
diff --git a/drivers/net/mlx4/en_rx.c b/drivers/net/mlx4/en_rx.c
index 277215f..21a89e0 100644
--- a/drivers/net/mlx4/en_rx.c
+++ b/drivers/net/mlx4/en_rx.c
@@ -60,20 +60,18 @@ static int mlx4_en_alloc_frag(struct mlx4_en_priv *priv,
 		if (!page)
 			return -ENOMEM;
 
-		skb_frags[i].page = page_alloc->page;
+		__skb_frag_set_page(&skb_frags[i], page_alloc->page);
 		skb_frags[i].page_offset = page_alloc->offset;
 		page_alloc->page = page;
 		page_alloc->offset = frag_info->frag_align;
 	} else {
-		page = page_alloc->page;
-		get_page(page);
-
-		skb_frags[i].page = page;
+		__skb_frag_set_page(&skb_frags[i], page_alloc->page);
+		__skb_frag_ref(&skb_frags[i]);
 		skb_frags[i].page_offset = page_alloc->offset;
 		page_alloc->offset += frag_info->frag_stride;
 	}
-	dma = pci_map_single(mdev->pdev, page_address(skb_frags[i].page) +
-			     skb_frags[i].page_offset, frag_info->frag_size,
+	dma = pci_map_single(mdev->pdev, skb_frag_address(&skb_frags[i]),
+			     frag_info->frag_size,
 			     PCI_DMA_FROMDEVICE);
 	rx_desc->data[i].addr = cpu_to_be64(dma);
 	return 0;
@@ -169,7 +167,7 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 
 err:
 	while (i--)
-		put_page(skb_frags[i].page);
+		__skb_frag_unref(&skb_frags[i]);
 	return -ENOMEM;
 }
 
@@ -196,7 +194,7 @@ static void mlx4_en_free_rx_desc(struct mlx4_en_priv *priv,
 		en_dbg(DRV, priv, "Unmapping buffer at dma:0x%llx\n", (u64) dma);
 		pci_unmap_single(mdev->pdev, dma, skb_frags[nr].size,
 				 PCI_DMA_FROMDEVICE);
-		put_page(skb_frags[nr].page);
+		__skb_frag_unref(&skb_frags[nr]);
 	}
 }
 
@@ -420,7 +418,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 			break;
 
 		/* Save page reference in skb */
-		skb_frags_rx[nr].page = skb_frags[nr].page;
+		__skb_frag_set_page(&skb_frags_rx[nr], skb_frags[nr].page);
 		skb_frags_rx[nr].size = skb_frags[nr].size;
 		skb_frags_rx[nr].page_offset = skb_frags[nr].page_offset;
 		dma = be64_to_cpu(rx_desc->data[nr].addr);
@@ -444,7 +442,7 @@ fail:
 	 * the descriptor) of this packet; remaining fragments are reused... */
 	while (nr > 0) {
 		nr--;
-		put_page(skb_frags_rx[nr].page);
+		__skb_frag_unref(&skb_frags_rx[nr]);
 	}
 	return 0;
 }
@@ -474,7 +472,7 @@ static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
 
 	/* Get pointer to first fragment so we could copy the headers into the
 	 * (linear part of the) skb */
-	va = page_address(skb_frags[0].page) + skb_frags[0].page_offset;
+	va = skb_frag_address(&skb_frags[0]);
 
 	if (length <= SMALL_PACKET_SIZE) {
 		/* We are copying all relevant data to the skb - temporarily
diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
index b229acf..10049b1 100644
--- a/drivers/net/mlx4/en_tx.c
+++ b/drivers/net/mlx4/en_tx.c
@@ -461,26 +461,13 @@ static inline void mlx4_en_xmit_poll(struct mlx4_en_priv *priv, int tx_ind)
 		}
 }
 
-static void *get_frag_ptr(struct sk_buff *skb)
-{
-	struct skb_frag_struct *frag =  &skb_shinfo(skb)->frags[0];
-	struct page *page = frag->page;
-	void *ptr;
-
-	ptr = page_address(page);
-	if (unlikely(!ptr))
-		return NULL;
-
-	return ptr + frag->page_offset;
-}
-
 static int is_inline(struct sk_buff *skb, void **pfrag)
 {
 	void *ptr;
 
 	if (inline_thold && !skb_is_gso(skb) && skb->len <= inline_thold) {
 		if (skb_shinfo(skb)->nr_frags == 1) {
-			ptr = get_frag_ptr(skb);
+			ptr = skb_frag_address_safe(&skb_shinfo(skb)->frags[0]);
 			if (unlikely(!ptr))
 				return 0;
 
@@ -757,7 +744,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		/* Map fragments */
 		for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
 			frag = &skb_shinfo(skb)->frags[i];
-			dma = pci_map_page(mdev->dev->pdev, frag->page, frag->page_offset,
+			dma = skb_frag_pci_map(mdev->dev->pdev, frag, 0,
 					   frag->size, PCI_DMA_TODEVICE);
 			data->addr = cpu_to_be64(dma);
 			data->lkey = cpu_to_be32(mdev->mr.key);
diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c
index a5d9b1c..d02a034 100644
--- a/drivers/net/mv643xx_eth.c
+++ b/drivers/net/mv643xx_eth.c
@@ -752,10 +752,10 @@ static void txq_submit_frag_skb(struct tx_queue *txq, struct sk_buff *skb)
 
 		desc->l4i_chk = 0;
 		desc->byte_cnt = this_frag->size;
-		desc->buf_ptr = dma_map_page(mp->dev->dev.parent,
-					     this_frag->page,
-					     this_frag->page_offset,
-					     this_frag->size, DMA_TO_DEVICE);
+		desc->buf_ptr = skb_frag_dma_map(mp->dev->dev.parent,
+						 this_frag, 0,
+						 this_frag->size,
+						 DMA_TO_DEVICE);
 	}
 }
 
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index bf84849..a3459eb 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -1339,7 +1339,7 @@ myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum,
 	/* Fill skb_frag_struct(s) with data from our receive */
 	for (i = 0, remainder = len; remainder > 0; i++) {
 		myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
-		rx_frags[i].page = rx->info[idx].page;
+		__skb_frag_set_page(&rx_frags[i], rx->info[idx].page); /* XXX */
 		rx_frags[i].page_offset = rx->info[idx].page_offset;
 		if (remainder < MYRI10GE_ALLOC_SIZE)
 			rx_frags[i].size = remainder;
@@ -1372,7 +1372,7 @@ myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum,
 		ss->stats.rx_dropped++;
 		do {
 			i--;
-			put_page(rx_frags[i].page);
+			__skb_frag_unref(&rx_frags[i]); /* XXX */
 		} while (i != 0);
 		return 0;
 	}
@@ -1380,7 +1380,7 @@ myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum,
 	/* Attach the pages to the skb, and trim off any padding */
 	myri10ge_rx_skb_build(skb, va, rx_frags, len, hlen);
 	if (skb_shinfo(skb)->frags[0].size <= 0) {
-		put_page(skb_shinfo(skb)->frags[0].page);
+		skb_frag_unref(skb, 0);
 		skb_shinfo(skb)->nr_frags = 0;
 	}
 	skb->protocol = eth_type_trans(skb, dev);
@@ -2220,7 +2220,7 @@ myri10ge_get_frag_header(struct skb_frag_struct *frag, void **mac_hdr,
 	struct ethhdr *eh;
 	struct vlan_ethhdr *veh;
 	struct iphdr *iph;
-	u8 *va = page_address(frag->page) + frag->page_offset;
+	u8 *va = skb_frag_address(frag);
 	unsigned long ll_hlen;
 	/* passed opaque through lro_receive_frags() */
 	__wsum csum = (__force __wsum) (unsigned long)priv;
@@ -2863,8 +2863,8 @@ again:
 		frag = &skb_shinfo(skb)->frags[frag_idx];
 		frag_idx++;
 		len = frag->size;
-		bus = pci_map_page(mgp->pdev, frag->page, frag->page_offset,
-				   len, PCI_DMA_TODEVICE);
+		bus = skb_frag_pci_map(mgp->pdev, frag, 0, len,
+				       PCI_DMA_TODEVICE);
 		dma_unmap_addr_set(&tx->info[idx], bus, bus);
 		dma_unmap_len_set(&tx->info[idx], len, len);
 	}
diff --git a/drivers/net/netxen/netxen_nic_main.c b/drivers/net/netxen/netxen_nic_main.c
index c0788a3..f962aa4 100644
--- a/drivers/net/netxen/netxen_nic_main.c
+++ b/drivers/net/netxen/netxen_nic_main.c
@@ -1836,7 +1836,7 @@ netxen_map_tx_skb(struct pci_dev *pdev,
 		frag = &skb_shinfo(skb)->frags[i];
 		nf = &pbuf->frag_array[i+1];
 
-		map = pci_map_page(pdev, frag->page, frag->page_offset,
+		map = skb_frag_pci_map(pdev, frag, 0,
 				frag->size, PCI_DMA_TODEVICE);
 		if (pci_dma_mapping_error(pdev, map))
 			goto unwind;
diff --git a/drivers/net/niu.c b/drivers/net/niu.c
index cc25bff..a901193 100644
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -3290,7 +3290,7 @@ static void niu_rx_skb_append(struct sk_buff *skb, struct page *page,
 	int i = skb_shinfo(skb)->nr_frags;
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-	frag->page = page;
+	__skb_frag_set_page(frag, page);
 	frag->page_offset = offset;
 	frag->size = size;
 
@@ -6731,7 +6731,7 @@ static netdev_tx_t niu_start_xmit(struct sk_buff *skb,
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 		len = frag->size;
-		mapping = np->ops->map_page(np->device, frag->page,
+		mapping = np->ops->map_page(np->device, __skb_frag_page(frag),
 					    frag->page_offset, len,
 					    DMA_TO_DEVICE);
 
diff --git a/drivers/net/ns83820.c b/drivers/net/ns83820.c
index 3e4040f..39c5e21 100644
--- a/drivers/net/ns83820.c
+++ b/drivers/net/ns83820.c
@@ -1181,9 +1181,8 @@ again:
 		if (!nr_frags)
 			break;
 
-		buf = pci_map_page(dev->pci_dev, frag->page,
-				   frag->page_offset,
-				   frag->size, PCI_DMA_TODEVICE);
+		buf = skb_frag_pci_map(dev->pci_dev, frag, 0, frag->size,
+				       PCI_DMA_TODEVICE);
 		dprintk("frag: buf=%08Lx  page=%08lx offset=%08lx\n",
 			(long long)buf, (long) page_to_pfn(frag->page),
 			frag->page_offset);
diff --git a/drivers/net/pasemi_mac.c b/drivers/net/pasemi_mac.c
index 9ec112c..2d71ed2 100644
--- a/drivers/net/pasemi_mac.c
+++ b/drivers/net/pasemi_mac.c
@@ -1505,9 +1505,8 @@ static int pasemi_mac_start_tx(struct sk_buff *skb, struct net_device *dev)
 	for (i = 0; i < nfrags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-		map[i+1] = pci_map_page(mac->dma_pdev, frag->page,
-					frag->page_offset, frag->size,
-					PCI_DMA_TODEVICE);
+		map[i + 1] = skb_frag_pci_map(mac->dma_pdev, frag, 0,
+					      frag->size, PCI_DMA_TODEVICE);
 		map_size[i+1] = frag->size;
 		if (pci_dma_mapping_error(mac->dma_pdev, map[i+1])) {
 			nfrags = i;
diff --git a/drivers/net/qla3xxx.c b/drivers/net/qla3xxx.c
index 771bb61..79ff9c1 100644
--- a/drivers/net/qla3xxx.c
+++ b/drivers/net/qla3xxx.c
@@ -2388,9 +2388,8 @@ static int ql_send_map(struct ql3_adapter *qdev,
 			seg++;
 		}
 
-		map = pci_map_page(qdev->pdev, frag->page,
-				   frag->page_offset, frag->size,
-				   PCI_DMA_TODEVICE);
+		map = skb_frag_pci_map(qdev->pdev, frag, 0, frag->size,
+				       PCI_DMA_TODEVICE);
 
 		err = pci_dma_mapping_error(qdev->pdev, map);
 		if (err) {
diff --git a/drivers/net/qlcnic/qlcnic_main.c b/drivers/net/qlcnic/qlcnic_main.c
index 0f6af5c..e8170e1 100644
--- a/drivers/net/qlcnic/qlcnic_main.c
+++ b/drivers/net/qlcnic/qlcnic_main.c
@@ -2120,7 +2120,7 @@ qlcnic_map_tx_skb(struct pci_dev *pdev,
 		frag = &skb_shinfo(skb)->frags[i];
 		nf = &pbuf->frag_array[i+1];
 
-		map = pci_map_page(pdev, frag->page, frag->page_offset,
+		map = skb_frag_pci_map(pdev, frag, 0,
 				frag->size, PCI_DMA_TODEVICE);
 		if (pci_dma_mapping_error(pdev, map))
 			goto unwind;
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index 930ae45..4503e16 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -1430,10 +1430,8 @@ static int ql_map_send(struct ql_adapter *qdev,
 			map_idx++;
 		}
 
-		map =
-		    pci_map_page(qdev->pdev, frag->page,
-				 frag->page_offset, frag->size,
-				 PCI_DMA_TODEVICE);
+		map = skb_frag_pci_map(qdev->pdev, frag, 0, frag->size,
+				       PCI_DMA_TODEVICE);
 
 		err = pci_dma_mapping_error(qdev->pdev, map);
 		if (err) {
@@ -1494,7 +1492,7 @@ static void ql_process_mac_rx_gro_page(struct ql_adapter *qdev,
 	rx_frag = skb_shinfo(skb)->frags;
 	nr_frags = skb_shinfo(skb)->nr_frags;
 	rx_frag += nr_frags;
-	rx_frag->page = lbq_desc->p.pg_chunk.page;
+	__skb_frag_set_page(rx_frag, lbq_desc->p.pg_chunk.page);
 	rx_frag->page_offset = lbq_desc->p.pg_chunk.offset;
 	rx_frag->size = length;
 
diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 05d8178..916603d 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -4630,7 +4630,7 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
 
 		txd = tp->TxDescArray + entry;
 		len = frag->size;
-		addr = ((void *) page_address(frag->page)) + frag->page_offset;
+		addr = skb_frag_address(frag);
 		mapping = dma_map_single(d, addr, len, DMA_TO_DEVICE);
 		if (unlikely(dma_mapping_error(d, mapping))) {
 			if (net_ratelimit())
diff --git a/drivers/net/s2io.c b/drivers/net/s2io.c
index df0d2c8..1901b26 100644
--- a/drivers/net/s2io.c
+++ b/drivers/net/s2io.c
@@ -4242,10 +4242,9 @@ static netdev_tx_t s2io_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (!frag->size)
 			continue;
 		txdp++;
-		txdp->Buffer_Pointer = (u64)pci_map_page(sp->pdev, frag->page,
-							 frag->page_offset,
-							 frag->size,
-							 PCI_DMA_TODEVICE);
+		txdp->Buffer_Pointer = (u64)skb_frag_pci_map(sp->pdev, frag,
+							     0, frag->size,
+							     PCI_DMA_TODEVICE);
 		txdp->Control_1 = TXD_BUFFER0_SIZE(frag->size);
 		if (offload_type == SKB_GSO_UDP)
 			txdp->Control_1 |= TXD_UFO_EN;
diff --git a/drivers/net/sfc/rx.c b/drivers/net/sfc/rx.c
index 62e4364..91a6b71 100644
--- a/drivers/net/sfc/rx.c
+++ b/drivers/net/sfc/rx.c
@@ -478,7 +478,7 @@ static void efx_rx_packet_gro(struct efx_channel *channel,
 		if (efx->net_dev->features & NETIF_F_RXHASH)
 			skb->rxhash = efx_rx_buf_hash(eh);
 
-		skb_shinfo(skb)->frags[0].page = page;
+		skb_frag_set_page(skb, 0, page);
 		skb_shinfo(skb)->frags[0].page_offset =
 			efx_rx_buf_offset(efx, rx_buf);
 		skb_shinfo(skb)->frags[0].size = rx_buf->len;
diff --git a/drivers/net/sfc/tx.c b/drivers/net/sfc/tx.c
index 84eb99e..237adc4 100644
--- a/drivers/net/sfc/tx.c
+++ b/drivers/net/sfc/tx.c
@@ -137,8 +137,6 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 	struct pci_dev *pci_dev = efx->pci_dev;
 	struct efx_tx_buffer *buffer;
 	skb_frag_t *fragment;
-	struct page *page;
-	int page_offset;
 	unsigned int len, unmap_len = 0, fill_level, insert_ptr;
 	dma_addr_t dma_addr, unmap_addr = 0;
 	unsigned int dma_len;
@@ -241,12 +239,10 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 			break;
 		fragment = &skb_shinfo(skb)->frags[i];
 		len = fragment->size;
-		page = fragment->page;
-		page_offset = fragment->page_offset;
 		i++;
 		/* Map for DMA */
 		unmap_single = false;
-		dma_addr = pci_map_page(pci_dev, page, page_offset, len,
+		dma_addr = skb_frag_pci_map(pci_dev, fragment, 0, len,
 					PCI_DMA_TODEVICE);
 	}
 
@@ -929,8 +925,7 @@ static void tso_start(struct tso_state *st, const struct sk_buff *skb)
 static int tso_get_fragment(struct tso_state *st, struct efx_nic *efx,
 			    skb_frag_t *frag)
 {
-	st->unmap_addr = pci_map_page(efx->pci_dev, frag->page,
-				      frag->page_offset, frag->size,
+	st->unmap_addr = skb_frag_pci_map(efx->pci_dev, frag, 0, frag->size,
 				      PCI_DMA_TODEVICE);
 	if (likely(!pci_dma_mapping_error(efx->pci_dev, st->unmap_addr))) {
 		st->unmap_single = false;
diff --git a/drivers/net/skge.c b/drivers/net/skge.c
index f4be5c7..8b80db5 100644
--- a/drivers/net/skge.c
+++ b/drivers/net/skge.c
@@ -2747,8 +2747,8 @@ static netdev_tx_t skge_xmit_frame(struct sk_buff *skb,
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-			map = pci_map_page(hw->pdev, frag->page, frag->page_offset,
-					   frag->size, PCI_DMA_TODEVICE);
+			map = skb_frag_pci_map(hw->pdev, frag, 0, frag->size,
+					       PCI_DMA_TODEVICE);
 
 			e = e->next;
 			e->skb = skb;
diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index 3ee41da..1f63f9f 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -1143,10 +1143,9 @@ static int sky2_rx_map_skb(struct pci_dev *pdev, struct rx_ring_info *re,
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-		re->frag_addr[i] = pci_map_page(pdev, frag->page,
-						frag->page_offset,
-						frag->size,
-						PCI_DMA_FROMDEVICE);
+		re->frag_addr[i] = skb_frag_pci_map(pdev, frag, 0,
+						    frag->size,
+						    PCI_DMA_FROMDEVICE);
 
 		if (pci_dma_mapping_error(pdev, re->frag_addr[i]))
 			goto map_page_error;
@@ -1826,8 +1825,8 @@ static netdev_tx_t sky2_xmit_frame(struct sk_buff *skb,
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-		mapping = pci_map_page(hw->pdev, frag->page, frag->page_offset,
-				       frag->size, PCI_DMA_TODEVICE);
+		mapping = skb_frag_pci_map(hw->pdev, frag, 0, frag->size,
+					   PCI_DMA_TODEVICE);
 
 		if (pci_dma_mapping_error(hw->pdev, mapping))
 			goto mapping_unwind;
@@ -2360,7 +2359,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
 
 		if (length == 0) {
 			/* don't need this page */
-			__free_page(frag->page);
+			__skb_frag_unref(frag);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
diff --git a/drivers/net/starfire.c b/drivers/net/starfire.c
index 36045f3..a0c8f34 100644
--- a/drivers/net/starfire.c
+++ b/drivers/net/starfire.c
@@ -1270,7 +1270,7 @@ static netdev_tx_t start_tx(struct sk_buff *skb, struct net_device *dev)
 			skb_frag_t *this_frag = &skb_shinfo(skb)->frags[i - 1];
 			status |= this_frag->size;
 			np->tx_info[entry].mapping =
-				pci_map_single(np->pci_dev, page_address(this_frag->page) + this_frag->page_offset, this_frag->size, PCI_DMA_TODEVICE);
+				pci_map_single(np->pci_dev, skb_frag_address(this_frag), this_frag->size, PCI_DMA_TODEVICE);
 		}
 
 		np->tx_ring[entry].addr = cpu_to_dma(np->tx_info[entry].mapping);
diff --git a/drivers/net/stmmac/stmmac_main.c b/drivers/net/stmmac/stmmac_main.c
index e25e44a..5157624 100644
--- a/drivers/net/stmmac/stmmac_main.c
+++ b/drivers/net/stmmac/stmmac_main.c
@@ -1040,9 +1040,8 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		desc = priv->dma_tx + entry;
 
 		TX_DBG("\t[entry %d] segment len: %d\n", entry, len);
-		desc->des2 = dma_map_page(priv->device, frag->page,
-					  frag->page_offset,
-					  len, DMA_TO_DEVICE);
+		desc->des2 = skb_frag_dma_map(priv->device, frag, 0, len,
+					      DMA_TO_DEVICE);
 		priv->tx_skbuff[entry] = NULL;
 		priv->hw->desc->prepare_tx_desc(desc, 0, len, csum_insertion);
 		priv->hw->desc->set_tx_owner(desc);
diff --git a/drivers/net/sungem.c b/drivers/net/sungem.c
index ab59300..e3c1306 100644
--- a/drivers/net/sungem.c
+++ b/drivers/net/sungem.c
@@ -1078,10 +1078,9 @@ static netdev_tx_t gem_start_xmit(struct sk_buff *skb,
 			u64 this_ctrl;
 
 			len = this_frag->size;
-			mapping = pci_map_page(gp->pdev,
-					       this_frag->page,
-					       this_frag->page_offset,
-					       len, PCI_DMA_TODEVICE);
+			mapping = skb_frag_pci_map(gp->pdev,
+						   this_frag, 0,
+						   len, PCI_DMA_TODEVICE);
 			this_ctrl = ctrl;
 			if (frag == skb_shinfo(skb)->nr_frags - 1)
 				this_ctrl |= TXDCTRL_EOF;
diff --git a/drivers/net/sunhme.c b/drivers/net/sunhme.c
index 30aad54..3baef5e 100644
--- a/drivers/net/sunhme.c
+++ b/drivers/net/sunhme.c
@@ -2315,9 +2315,8 @@ static netdev_tx_t happy_meal_start_xmit(struct sk_buff *skb,
 			u32 len, mapping, this_txflags;
 
 			len = this_frag->size;
-			mapping = dma_map_page(hp->dma_dev, this_frag->page,
-					       this_frag->page_offset, len,
-					       DMA_TO_DEVICE);
+			mapping = skb_frag_dma_map(hp->dma_dev, this_frag,
+						   0, len, DMA_TO_DEVICE);
 			this_txflags = tx_flags;
 			if (frag == skb_shinfo(skb)->nr_frags - 1)
 				this_txflags |= TXFLAG_EOP;
diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
index 80fbee0..b817616 100644
--- a/drivers/net/tehuti.c
+++ b/drivers/net/tehuti.c
@@ -1520,8 +1520,8 @@ bdx_tx_map_skb(struct bdx_priv *priv, struct sk_buff *skb,
 		frag = &skb_shinfo(skb)->frags[i];
 		db->wptr->len = frag->size;
 		db->wptr->addr.dma =
-		    pci_map_page(priv->pdev, frag->page, frag->page_offset,
-				 frag->size, PCI_DMA_TODEVICE);
+		    skb_frag_pci_map(priv->pdev, frag, 0, frag->size,
+				     PCI_DMA_TODEVICE);
 
 		pbl++;
 		pbl->len = CPU_CHIP_SWAP32(db->wptr->len);
diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index a1f9f9e..935f4eb 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -6040,10 +6040,8 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 			len = frag->size;
-			mapping = pci_map_page(tp->pdev,
-					       frag->page,
-					       frag->page_offset,
-					       len, PCI_DMA_TODEVICE);
+			mapping = skb_frag_pci_map(tp->pdev, frag, 0, len,
+						   PCI_DMA_TODEVICE);
 
 			tnapi->tx_buffers[entry].skb = NULL;
 			dma_unmap_addr_set(&tnapi->tx_buffers[entry], mapping,
diff --git a/drivers/net/tsi108_eth.c b/drivers/net/tsi108_eth.c
index 5c633a3..52f89a5 100644
--- a/drivers/net/tsi108_eth.c
+++ b/drivers/net/tsi108_eth.c
@@ -710,9 +710,10 @@ static int tsi108_send_packet(struct sk_buff * skb, struct net_device *dev)
 		} else {
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
 
-			data->txring[tx].buf0 =
-			    dma_map_page(NULL, frag->page, frag->page_offset,
-					    frag->size, DMA_TO_DEVICE);
+			data->txring[tx].buf0 = skb_frag_dma_map(NULL, frag,
+								 0,
+								 frag->size,
+								 DMA_TO_DEVICE);
 			data->txring[tx].len = frag->size;
 		}
 
diff --git a/drivers/net/typhoon.c b/drivers/net/typhoon.c
index 3de4283..eb32147 100644
--- a/drivers/net/typhoon.c
+++ b/drivers/net/typhoon.c
@@ -819,8 +819,7 @@ typhoon_start_tx(struct sk_buff *skb, struct net_device *dev)
 			typhoon_inc_tx_index(&txRing->lastWrite, 1);
 
 			len = frag->size;
-			frag_addr = (void *) page_address(frag->page) +
-						frag->page_offset;
+			frag_addr = skb_frag_address(frag);
 			skb_dma = pci_map_single(tp->tx_pdev, frag_addr, len,
 					 PCI_DMA_TODEVICE);
 			txd->flags = TYPHOON_FRAG_DESC | TYPHOON_DESC_VALID;
diff --git a/drivers/net/via-velocity.c b/drivers/net/via-velocity.c
index 06daa9d..9c57836 100644
--- a/drivers/net/via-velocity.c
+++ b/drivers/net/via-velocity.c
@@ -2580,9 +2580,9 @@ static netdev_tx_t velocity_xmit(struct sk_buff *skb,
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-		tdinfo->skb_dma[i + 1] = pci_map_page(vptr->pdev, frag->page,
-				frag->page_offset, frag->size,
-				PCI_DMA_TODEVICE);
+		tdinfo->skb_dma[i + 1] = skb_frag_pci_map(vptr->pdev, frag,
+							  0, frag->size,
+							  PCI_DMA_TODEVICE);
 
 		td_ptr->td_buf[i + 1].pa_low = cpu_to_le32(tdinfo->skb_dma[i + 1]);
 		td_ptr->td_buf[i + 1].pa_high = 0;
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f685324..c35ae8f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -137,7 +137,7 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
 	f = &skb_shinfo(skb)->frags[i];
 	f->size = min((unsigned)PAGE_SIZE - offset, *len);
 	f->page_offset = offset;
-	f->page = page;
+	__skb_frag_set_page(f, page);
 
 	skb->data_len += f->size;
 	skb->len += f->size;
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index fa6e2ac..3564169 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -650,7 +650,7 @@ vmxnet3_append_frag(struct sk_buff *skb, struct Vmxnet3_RxCompDesc *rcd,
 
 	BUG_ON(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS);
 
-	frag->page = rbi->page;
+	__skb_frag_set_page(frag, rbi->page);
 	frag->page_offset = 0;
 	frag->size = rcd->len;
 	skb->data_len += frag->size;
@@ -744,9 +744,8 @@ vmxnet3_map_pkt(struct sk_buff *skb, struct vmxnet3_tx_ctx *ctx,
 
 		tbi = tq->buf_info + tq->tx_ring.next2fill;
 		tbi->map_type = VMXNET3_MAP_PAGE;
-		tbi->dma_addr = pci_map_page(adapter->pdev, frag->page,
-					     frag->page_offset, frag->size,
-					     PCI_DMA_TODEVICE);
+		tbi->dma_addr = skb_frag_pci_map(adapter->pdev, frag, 0,
+						 frag->size, PCI_DMA_TODEVICE);
 
 		tbi->len = frag->size;
 
diff --git a/drivers/net/vxge/vxge-main.c b/drivers/net/vxge/vxge-main.c
index 8ab870a..a802036 100644
--- a/drivers/net/vxge/vxge-main.c
+++ b/drivers/net/vxge/vxge-main.c
@@ -921,9 +921,9 @@ vxge_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (!frag->size)
 			continue;
 
-		dma_pointer = (u64) pci_map_page(fifo->pdev, frag->page,
-				frag->page_offset, frag->size,
-				PCI_DMA_TODEVICE);
+		dma_pointer = (u64) skb_frag_pci_map(fifo->pdev, frag, 0,
+						     frag->size,
+						     PCI_DMA_TODEVICE);
 
 		if (unlikely(pci_dma_mapping_error(fifo->pdev, dma_pointer)))
 			goto _exit2;
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 0e4851b..5c79483 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -215,6 +215,16 @@ static int get_page_ext(struct page *pg,
 			 sizeof(struct iphdr) + MAX_IPOPTLEN + \
 			 sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE)
 
+static unsigned long frag_get_pending_idx(skb_frag_t *frag)
+{
+	return (unsigned long)skb_frag_page(frag);
+}
+
+static void frag_set_pending_idx(skb_frag_t *frag, unsigned long pending_idx)
+{
+	__skb_frag_set_page(frag, (void *)pending_idx);
+}
+
 static inline pending_ring_idx_t pending_index(unsigned i)
 {
 	return i & (MAX_PENDING_REQS-1);
@@ -512,7 +522,7 @@ static int netbk_gop_skb(struct sk_buff *skb,
 
 	for (i = 0; i < nr_frags; i++) {
 		netbk_gop_frag_copy(vif, skb, npo,
-				    skb_shinfo(skb)->frags[i].page,
+				    __skb_frag_page(&skb_shinfo(skb)->frags[i]),
 				    skb_shinfo(skb)->frags[i].size,
 				    skb_shinfo(skb)->frags[i].page_offset,
 				    &head);
@@ -913,7 +923,7 @@ static struct gnttab_copy *xen_netbk_get_requests(struct xen_netbk *netbk,
 	int i, start;
 
 	/* Skip first skb fragment if it is on same page as header fragment. */
-	start = ((unsigned long)shinfo->frags[0].page == pending_idx);
+	start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
 
 	for (i = start; i < shinfo->nr_frags; i++, txp++) {
 		struct page *page;
@@ -945,7 +955,7 @@ static struct gnttab_copy *xen_netbk_get_requests(struct xen_netbk *netbk,
 		memcpy(&pending_tx_info[pending_idx].req, txp, sizeof(*txp));
 		xenvif_get(vif);
 		pending_tx_info[pending_idx].vif = vif;
-		frags[i].page = (void *)pending_idx;
+		frag_set_pending_idx(&frags[i], pending_idx);
 	}
 
 	return gop;
@@ -976,13 +986,13 @@ static int xen_netbk_tx_check_gop(struct xen_netbk *netbk,
 	}
 
 	/* Skip first skb fragment if it is on same page as header fragment. */
-	start = ((unsigned long)shinfo->frags[0].page == pending_idx);
+	start = (frag_get_pending_idx(&shinfo->frags[0]) == pending_idx);
 
 	for (i = start; i < nr_frags; i++) {
 		int j, newerr;
 		pending_ring_idx_t index;
 
-		pending_idx = (unsigned long)shinfo->frags[i].page;
+		pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
 
 		/* Check error status: if okay then remember grant handle. */
 		newerr = (++gop)->status;
@@ -1008,7 +1018,7 @@ static int xen_netbk_tx_check_gop(struct xen_netbk *netbk,
 		pending_idx = *((u16 *)skb->data);
 		xen_netbk_idx_release(netbk, pending_idx);
 		for (j = start; j < i; j++) {
-			pending_idx = (unsigned long)shinfo->frags[i].page;
+			pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
 			xen_netbk_idx_release(netbk, pending_idx);
 		}
 
@@ -1029,12 +1039,14 @@ static void xen_netbk_fill_frags(struct xen_netbk *netbk, struct sk_buff *skb)
 	for (i = 0; i < nr_frags; i++) {
 		skb_frag_t *frag = shinfo->frags + i;
 		struct xen_netif_tx_request *txp;
+		struct page *page;
 		unsigned long pending_idx;
 
-		pending_idx = (unsigned long)frag->page;
+		pending_idx = frag_get_pending_idx(frag);
 
 		txp = &netbk->pending_tx_info[pending_idx].req;
-		frag->page = virt_to_page(idx_to_kaddr(netbk, pending_idx));
+		page = virt_to_page(idx_to_kaddr(netbk, pending_idx));
+		__skb_frag_set_page(frag, page);
 		frag->size = txp->size;
 		frag->page_offset = txp->offset;
 
@@ -1349,11 +1361,11 @@ static unsigned xen_netbk_tx_build_gops(struct xen_netbk *netbk)
 		skb_shinfo(skb)->nr_frags = ret;
 		if (data_len < txreq.size) {
 			skb_shinfo(skb)->nr_frags++;
-			skb_shinfo(skb)->frags[0].page =
-				(void *)(unsigned long)pending_idx;
+			frag_set_pending_idx(&skb_shinfo(skb)->frags[0],
+					     pending_idx);
 		} else {
 			/* Discriminate from any valid pending_idx value. */
-			skb_shinfo(skb)->frags[0].page = (void *)~0UL;
+			frag_set_pending_idx(&skb_shinfo(skb)->frags[0], ~0UL);
 		}
 
 		__skb_queue_tail(&netbk->tx_queue, skb);
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index d29365a..ecc4b4b 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -265,7 +265,7 @@ no_skb:
 			break;
 		}
 
-		skb_shinfo(skb)->frags[0].page = page;
+		skb_frag_set_page(skb, 0, page);
 		skb_shinfo(skb)->nr_frags = 1;
 		__skb_queue_tail(&np->rx_batch, skb);
 	}
@@ -299,8 +299,8 @@ no_skb:
 		BUG_ON((signed short)ref < 0);
 		np->grant_rx_ref[id] = ref;
 
-		pfn = page_to_pfn(skb_shinfo(skb)->frags[0].page);
-		vaddr = page_address(skb_shinfo(skb)->frags[0].page);
+		pfn = page_to_pfn(skb_frag_page(&skb_shinfo(skb)->frags[0]));
+		vaddr = page_address(skb_frag_page(&skb_shinfo(skb)->frags[0]));
 
 		req = RING_GET_REQUEST(&np->rx, req_prod + i);
 		gnttab_grant_foreign_access_ref(ref,
@@ -451,7 +451,7 @@ static void xennet_make_frags(struct sk_buff *skb, struct net_device *dev,
 		ref = gnttab_claim_grant_reference(&np->gref_tx_head);
 		BUG_ON((signed short)ref < 0);
 
-		mfn = pfn_to_mfn(page_to_pfn(frag->page));
+		mfn = pfn_to_mfn(page_to_pfn(skb_frag_page(frag)));
 		gnttab_grant_foreign_access_ref(ref, np->xbdev->otherend_id,
 						mfn, GNTMAP_readonly);
 
@@ -755,8 +755,9 @@ static RING_IDX xennet_fill_frags(struct netfront_info *np,
 	while ((nskb = __skb_dequeue(list))) {
 		struct xen_netif_rx_response *rx =
 			RING_GET_RESPONSE(&np->rx, ++cons);
+		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
 
-		frag->page = skb_shinfo(nskb)->frags[0].page;
+		__skb_frag_set_page(frag, __skb_frag_page(nfrag));
 		frag->page_offset = rx->offset;
 		frag->size = rx->status;
 
@@ -858,7 +859,7 @@ static int handle_incoming_queue(struct net_device *dev,
 		memcpy(skb->data, vaddr + offset,
 		       skb_headlen(skb));
 
-		if (page != skb_shinfo(skb)->frags[0].page)
+		if (page != skb_frag_page(&skb_shinfo(skb)->frags[0]))
 			__free_page(page);
 
 		/* Ethernet work: Delayed to here as it peeks the header. */
@@ -937,7 +938,8 @@ err:
 			}
 		}
 
-		NETFRONT_SKB_CB(skb)->page = skb_shinfo(skb)->frags[0].page;
+		NETFRONT_SKB_CB(skb)->page =
+			__skb_frag_page(&skb_shinfo(skb)->frags[0]);
 		NETFRONT_SKB_CB(skb)->offset = rx->offset;
 
 		len = rx->status;
@@ -951,7 +953,7 @@ err:
 			skb_shinfo(skb)->frags[0].size = rx->status - len;
 			skb->data_len = rx->status - len;
 		} else {
-			skb_shinfo(skb)->frags[0].page = NULL;
+			skb_frag_set_page(skb, 0, NULL);
 			skb_shinfo(skb)->nr_frags = 0;
 		}
 
@@ -1094,7 +1096,8 @@ static void xennet_release_rx_bufs(struct netfront_info *np)
 
 		if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 			/* Remap the page. */
-			struct page *page = skb_shinfo(skb)->frags[0].page;
+			const struct page *page =
+				skb_frag_page(&skb_shinfo(skb)->frags[0]);
 			unsigned long pfn = page_to_pfn(page);
 			void *vaddr = page_address(page);
 
@@ -1593,6 +1596,8 @@ static int xennet_connect(struct net_device *dev)
 
 	/* Step 2: Rebuild the RX buffer freelist and the RX ring itself. */
 	for (requeue_idx = 0, i = 0; i < NET_RX_RING_SIZE; i++) {
+		skb_frag_t *frag;
+		const struct page *page;
 		if (!np->rx_skbs[i])
 			continue;
 
@@ -1600,10 +1605,11 @@ static int xennet_connect(struct net_device *dev)
 		ref = np->grant_rx_ref[requeue_idx] = xennet_get_rx_ref(np, i);
 		req = RING_GET_REQUEST(&np->rx, requeue_idx);
 
+		frag = &skb_shinfo(skb)->frags[0];
+		page = skb_frag_page(frag);
 		gnttab_grant_foreign_access_ref(
 			ref, np->xbdev->otherend_id,
-			pfn_to_mfn(page_to_pfn(skb_shinfo(skb)->
-					       frags->page)),
+			pfn_to_mfn(page_to_pfn(page)),
 			0);
 		req->gref = ref;
 		req->id   = requeue_idx;
diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index ab255fb..f7a3517 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -296,7 +296,7 @@ static int bnx2fc_xmit(struct fc_lport *lport, struct fc_frame *fp)
 			return -ENOMEM;
 		}
 		frag = &skb_shinfo(skb)->frags[skb_shinfo(skb)->nr_frags - 1];
-		cp = kmap_atomic(frag->page, KM_SKB_DATA_SOFTIRQ)
+		cp = kmap_atomic(__skb_frag_page(frag), KM_SKB_DATA_SOFTIRQ)
 				+ frag->page_offset;
 	} else {
 		cp = (struct fcoe_crc_eof *)skb_put(skb, tlen);
diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
index a2a9c7c..949ee48 100644
--- a/drivers/scsi/cxgbi/libcxgbi.c
+++ b/drivers/scsi/cxgbi/libcxgbi.c
@@ -1812,7 +1812,7 @@ static int sgl_read_to_frags(struct scatterlist *sg, unsigned int sgoffset,
 
 		}
 		copy = min(datalen, sglen);
-		if (i && page == frags[i - 1].page &&
+		if (i && page == skb_frag_page(&frags[i - 1]) &&
 		    sgoffset + sg->offset ==
 			frags[i - 1].page_offset + frags[i - 1].size) {
 			frags[i - 1].size += copy;
@@ -1948,7 +1948,7 @@ int cxgbi_conn_init_pdu(struct iscsi_task *task, unsigned int offset,
 
 			/* data fits in the skb's headroom */
 			for (i = 0; i < tdata->nr_frags; i++, frag++) {
-				char *src = kmap_atomic(frag->page,
+				char *src = kmap_atomic(__skb_frag_page(frag),
 							KM_SOFTIRQ0);
 
 				memcpy(dst, src+frag->page_offset, frag->size);
@@ -1963,7 +1963,7 @@ int cxgbi_conn_init_pdu(struct iscsi_task *task, unsigned int offset,
 		} else {
 			/* data fit into frag_list */
 			for (i = 0; i < tdata->nr_frags; i++)
-				get_page(tdata->frags[i].page);
+				__skb_frag_ref(&tdata->frags[i]);
 
 			memcpy(skb_shinfo(skb)->frags, tdata->frags,
 				sizeof(skb_frag_t) * tdata->nr_frags);
diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 155d7b9..deee71a 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -1425,7 +1425,7 @@ int fcoe_xmit(struct fc_lport *lport, struct fc_frame *fp)
 			return -ENOMEM;
 		}
 		frag = &skb_shinfo(skb)->frags[skb_shinfo(skb)->nr_frags - 1];
-		cp = kmap_atomic(frag->page, KM_SKB_DATA_SOFTIRQ)
+		cp = kmap_atomic(__skb_frag_page(frag), KM_SKB_DATA_SOFTIRQ)
 			+ frag->page_offset;
 	} else {
 		cp = (struct fcoe_crc_eof *)skb_put(skb, tlen);
diff --git a/drivers/scsi/fcoe/fcoe_transport.c b/drivers/scsi/fcoe/fcoe_transport.c
index 41068e8..40243ce 100644
--- a/drivers/scsi/fcoe/fcoe_transport.c
+++ b/drivers/scsi/fcoe/fcoe_transport.c
@@ -108,8 +108,9 @@ u32 fcoe_fc_crc(struct fc_frame *fp)
 		len = frag->size;
 		while (len > 0) {
 			clen = min(len, PAGE_SIZE - (off & ~PAGE_MASK));
-			data = kmap_atomic(frag->page + (off >> PAGE_SHIFT),
-					   KM_SKB_DATA_SOFTIRQ);
+			data = kmap_atomic(
+				__skb_frag_page(frag) + (off >> PAGE_SHIFT),
+				KM_SKB_DATA_SOFTIRQ);
 			crc = crc32(crc, data + (off & ~PAGE_MASK), clen);
 			kunmap_atomic(data, KM_SKB_DATA_SOFTIRQ);
 			off += clen;
diff --git a/drivers/staging/et131x/et1310_tx.c b/drivers/staging/et131x/et1310_tx.c
index 4241d2a..b2ab240 100644
--- a/drivers/staging/et131x/et1310_tx.c
+++ b/drivers/staging/et131x/et1310_tx.c
@@ -519,12 +519,11 @@ static int nic_send_packet(struct et131x_adapter *etdev, struct tcb *tcb)
 			 * returned by pci_map_page() is always 32-bit
 			 * addressable (as defined by the pci/dma subsystem)
 			 */
-			desc[frag++].addr_lo =
-			    pci_map_page(etdev->pdev,
-					 frags[i - 1].page,
-					 frags[i - 1].page_offset,
-					 frags[i - 1].size,
-					 PCI_DMA_TODEVICE);
+			desc[frag++].addr_lo = skb_frag_pci_map(etdev->pdev,
+								&frags[i - 1],
+								0,
+								frags[i - 1].size,
+								PCI_DMA_TODEVICE);
 		}
 	}
 
diff --git a/drivers/staging/hv/netvsc_drv.c b/drivers/staging/hv/netvsc_drv.c
index 7b9c229..80d1f1f 100644
--- a/drivers/staging/hv/netvsc_drv.c
+++ b/drivers/staging/hv/netvsc_drv.c
@@ -172,7 +172,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct net_device *net)
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
 
-		packet->page_buf[i+2].pfn = page_to_pfn(f->page);
+		packet->page_buf[i+2].pfn = page_to_pfn(skb_frag_page(f));
 		packet->page_buf[i+2].offset = f->page_offset;
 		packet->page_buf[i+2].len = f->size;
 	}
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 07/10] net: add support for per-paged-fragment destructors
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
                   ` (5 preceding siblings ...)
  2011-07-15 11:07 ` [PATCH 06/10] net: convert drivers to paged frag API Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 11:07 ` [PATCH 08/10] net: add paged frag destructor support to kernel_sendpage Ian Campbell
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

Entities which care about the complete lifecycle of pages which they inject
into the network stack via an skb paged fragment can choose to set this
destructor in order to receive a callback when the stack is really finished
with a page (including all clones, retransmits, pull-ups etc etc).

This destructor will always be propagated alongside the struct page when
copying skb_frag_t->page. This is the reason I chose to embed the destructor in
a "struct { } page" within the skb_frag_t, rather than as a separate field,
since it allows existing code which propagates ->frags[N].page to Just
Work(tm).

When the destructor is present the page reference counting is done slightly
differently. No references are held by the network stack on the struct page (it
is up to the caller to manage this as necessary) instead the network stack will
track references via the count embedded in the destructor structure. When this
reference count reaches zero then the destructor will be called and the caller
can take the necesary steps to release the page (i.e. release the struct page
reference itself).

The intention is that callers can use this callback to delay completion to
_their_ callers until the network stack has completely released the page, in
order to prevent use-after-free or modification of data pages which are still
in use by the stack.

It is allowable (indeed expected) for a caller to share a single destructor
instance between multiple pages injected into the stack e.g. a group of pages
included in a single higher level operation might share a destructor which is
used to complete that higher level operation.

NB: a small number of drivers use skb_frag_t independently of struct sk_buff so
this patch is slightly larger than necessary. I did consider leaving skb_frag_t
alone and defining a new (but similar) structure to be used in the struct
sk_buff itself. This would also have the advantage of more clearly separating
the two uses, which is useful since there are now special reference counting
accessors for skb_frag_t within a struct sk_buff but not (necessarily) for
those used outside of an skb.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/net/cxgb4/sge.c       |   14 +++++++-------
 drivers/net/cxgb4vf/sge.c     |   18 +++++++++---------
 drivers/net/mlx4/en_rx.c      |    2 +-
 drivers/scsi/cxgbi/libcxgbi.c |    2 +-
 include/linux/skbuff.h        |   31 ++++++++++++++++++++++++++-----
 net/core/skbuff.c             |   17 +++++++++++++++++
 6 files changed, 61 insertions(+), 23 deletions(-)

diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
index f1813b5..3e7c4b3 100644
--- a/drivers/net/cxgb4/sge.c
+++ b/drivers/net/cxgb4/sge.c
@@ -1416,7 +1416,7 @@ static inline void copy_frags(struct sk_buff *skb,
 	unsigned int n;
 
 	/* usually there's just one frag */
-	skb_frag_set_page(skb, 0, gl->frags[0].page);
+	skb_frag_set_page(skb, 0, gl->frags[0].page.p);	/* XXX */
 	ssi->frags[0].page_offset = gl->frags[0].page_offset + offset;
 	ssi->frags[0].size = gl->frags[0].size - offset;
 	ssi->nr_frags = gl->nfrags;
@@ -1425,7 +1425,7 @@ static inline void copy_frags(struct sk_buff *skb,
 		memcpy(&ssi->frags[1], &gl->frags[1], n * sizeof(skb_frag_t));
 
 	/* get a reference to the last page, we don't own it */
-	get_page(gl->frags[n].page);
+	get_page(gl->frags[n].page.p);	/* XXX */
 }
 
 /**
@@ -1482,7 +1482,7 @@ static void t4_pktgl_free(const struct pkt_gl *gl)
 	const skb_frag_t *p;
 
 	for (p = gl->frags, n = gl->nfrags - 1; n--; p++)
-		put_page(p->page);
+		put_page(p->page.p); /* XXX */
 }
 
 /*
@@ -1635,7 +1635,7 @@ static void restore_rx_bufs(const struct pkt_gl *si, struct sge_fl *q,
 		else
 			q->cidx--;
 		d = &q->sdesc[q->cidx];
-		d->page = si->frags[frags].page;
+		d->page = si->frags[frags].page.p; /* XXX */
 		d->dma_addr |= RX_UNMAPPED_BUF;
 		q->avail++;
 	}
@@ -1717,7 +1717,7 @@ static int process_responses(struct sge_rspq *q, int budget)
 			for (frags = 0, fp = si.frags; ; frags++, fp++) {
 				rsd = &rxq->fl.sdesc[rxq->fl.cidx];
 				bufsz = get_buf_size(rsd);
-				fp->page = rsd->page;
+				fp->page.p = rsd->page; /* XXX */
 				fp->page_offset = q->offset;
 				fp->size = min(bufsz, len);
 				len -= fp->size;
@@ -1734,8 +1734,8 @@ static int process_responses(struct sge_rspq *q, int budget)
 						get_buf_addr(rsd),
 						fp->size, DMA_FROM_DEVICE);
 
-			si.va = page_address(si.frags[0].page) +
-				si.frags[0].page_offset;
+			si.va = page_address(si.frags[0].page.p) +
+				si.frags[0].page_offset; /* XXX */
 
 			prefetch(si.va);
 
diff --git a/drivers/net/cxgb4vf/sge.c b/drivers/net/cxgb4vf/sge.c
index f4c4480..0a0dda1 100644
--- a/drivers/net/cxgb4vf/sge.c
+++ b/drivers/net/cxgb4vf/sge.c
@@ -1397,7 +1397,7 @@ struct sk_buff *t4vf_pktgl_to_skb(const struct pkt_gl *gl,
 		skb_copy_to_linear_data(skb, gl->va, pull_len);
 
 		ssi = skb_shinfo(skb);
-		skb_frag_set_page(skb, 0, gl->frags[0].page);
+		skb_frag_set_page(skb, 0, gl->frags[0].page.p); /* XXX */
 		ssi->frags[0].page_offset = gl->frags[0].page_offset + pull_len;
 		ssi->frags[0].size = gl->frags[0].size - pull_len;
 		if (gl->nfrags > 1)
@@ -1410,7 +1410,7 @@ struct sk_buff *t4vf_pktgl_to_skb(const struct pkt_gl *gl,
 		skb->truesize += skb->data_len;
 
 		/* Get a reference for the last page, we don't own it */
-		get_page(gl->frags[gl->nfrags - 1].page);
+		get_page(gl->frags[gl->nfrags - 1].page.p); /* XXX */
 	}
 
 out:
@@ -1430,7 +1430,7 @@ void t4vf_pktgl_free(const struct pkt_gl *gl)
 
 	frag = gl->nfrags - 1;
 	while (frag--)
-		put_page(gl->frags[frag].page);
+		put_page(gl->frags[frag].page.p); /* XXX */
 }
 
 /**
@@ -1450,7 +1450,7 @@ static inline void copy_frags(struct sk_buff *skb,
 	unsigned int n;
 
 	/* usually there's just one frag */
-	skb_frag_set_page(skb, 0, gl->frags[0].page);
+	skb_frag_set_page(skb, 0, gl->frags[0].page.p);	/* XXX */
 	si->frags[0].page_offset = gl->frags[0].page_offset + offset;
 	si->frags[0].size = gl->frags[0].size - offset;
 	si->nr_frags = gl->nfrags;
@@ -1460,7 +1460,7 @@ static inline void copy_frags(struct sk_buff *skb,
 		memcpy(&si->frags[1], &gl->frags[1], n * sizeof(skb_frag_t));
 
 	/* get a reference to the last page, we don't own it */
-	get_page(gl->frags[n].page);
+	get_page(gl->frags[n].page.p); /* XXX */
 }
 
 /**
@@ -1633,7 +1633,7 @@ static void restore_rx_bufs(const struct pkt_gl *gl, struct sge_fl *fl,
 		else
 			fl->cidx--;
 		sdesc = &fl->sdesc[fl->cidx];
-		sdesc->page = gl->frags[frags].page;
+		sdesc->page = gl->frags[frags].page.p; /* XXX */
 		sdesc->dma_addr |= RX_UNMAPPED_BUF;
 		fl->avail++;
 	}
@@ -1721,7 +1721,7 @@ int process_responses(struct sge_rspq *rspq, int budget)
 				BUG_ON(rxq->fl.avail == 0);
 				sdesc = &rxq->fl.sdesc[rxq->fl.cidx];
 				bufsz = get_buf_size(sdesc);
-				fp->page = sdesc->page;
+				fp->page.p = sdesc->page; /* XXX */
 				fp->page_offset = rspq->offset;
 				fp->size = min(bufsz, len);
 				len -= fp->size;
@@ -1739,8 +1739,8 @@ int process_responses(struct sge_rspq *rspq, int budget)
 			dma_sync_single_for_cpu(rspq->adapter->pdev_dev,
 						get_buf_addr(sdesc),
 						fp->size, DMA_FROM_DEVICE);
-			gl.va = (page_address(gl.frags[0].page) +
-				 gl.frags[0].page_offset);
+			gl.va = (page_address(gl.frags[0].page.p) +
+				 gl.frags[0].page_offset); /* XXX */
 			prefetch(gl.va);
 
 			/*
diff --git a/drivers/net/mlx4/en_rx.c b/drivers/net/mlx4/en_rx.c
index 21a89e0..c5d01ce 100644
--- a/drivers/net/mlx4/en_rx.c
+++ b/drivers/net/mlx4/en_rx.c
@@ -418,7 +418,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 			break;
 
 		/* Save page reference in skb */
-		__skb_frag_set_page(&skb_frags_rx[nr], skb_frags[nr].page);
+		__skb_frag_set_page(&skb_frags_rx[nr], skb_frags[nr].page.p); /* XXX */
 		skb_frags_rx[nr].size = skb_frags[nr].size;
 		skb_frags_rx[nr].page_offset = skb_frags[nr].page_offset;
 		dma = be64_to_cpu(rx_desc->data[nr].addr);
diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
index 949ee48..8d16a74 100644
--- a/drivers/scsi/cxgbi/libcxgbi.c
+++ b/drivers/scsi/cxgbi/libcxgbi.c
@@ -1823,7 +1823,7 @@ static int sgl_read_to_frags(struct scatterlist *sg, unsigned int sgoffset,
 				return -EINVAL;
 			}
 
-			frags[i].page = page;
+			frags[i].page.p = page;
 			frags[i].page_offset = sg->offset + sgoffset;
 			frags[i].size = copy;
 			i++;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 982c6a3..99b56e3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -135,8 +135,17 @@ struct sk_buff;
 
 typedef struct skb_frag_struct skb_frag_t;
 
+struct skb_frag_destructor {
+	atomic_t ref;
+	int (*destroy)(void *data);
+	void *data;
+};
+
 struct skb_frag_struct {
-	struct page *page;
+	struct {
+		struct page *p;
+		struct skb_frag_destructor *destructor;
+	} page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
 	__u32 page_offset;
 	__u32 size;
@@ -1129,7 +1138,8 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 {
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-	frag->page		  = page;
+	frag->page.p		  = page;
+	frag->page.destructor     = NULL;
 	frag->page_offset	  = off;
 	frag->size		  = size;
 }
@@ -1648,7 +1658,7 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page)
  */
 static inline struct page *__skb_frag_page(const skb_frag_t *frag)
 {
-	return frag->page;
+	return frag->page.p;
 }
 
 /**
@@ -1659,9 +1669,12 @@ static inline struct page *__skb_frag_page(const skb_frag_t *frag)
  */
 static inline const struct page *skb_frag_page(const skb_frag_t *frag)
 {
-	return frag->page;
+	return frag->page.p;
 }
 
+extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
+extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
+
 /**
  * __skb_frag_ref - take an addition reference on a paged fragment.
  * @frag: the paged fragment
@@ -1670,6 +1683,10 @@ static inline const struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_ref(frag->page.destructor);
+		return;
+	}
 	get_page(__skb_frag_page(frag));
 }
 
@@ -1693,6 +1710,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
  */
 static inline void __skb_frag_unref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_unref(frag->page.destructor);
+		return;
+	}
 	put_page(__skb_frag_page(frag));
 }
 
@@ -1745,7 +1766,7 @@ static inline void *skb_frag_address_safe(const skb_frag_t *frag)
  */
 static inline void __skb_frag_set_page(skb_frag_t *frag, struct page *page)
 {
-	frag->page = page;
+	frag->page.p = page;
 	__skb_frag_ref(frag);
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2133600..bdc6f6e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -292,6 +292,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
 }
 EXPORT_SYMBOL(dev_alloc_skb);
 
+void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
+{
+	BUG_ON(destroy == NULL);
+	atomic_inc(&destroy->ref);
+}
+EXPORT_SYMBOL(skb_frag_destructor_ref);
+
+void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
+{
+	if (destroy == NULL)
+		return;
+
+	if (atomic_dec_and_test(&destroy->ref))
+		destroy->destroy(destroy->data);
+}
+EXPORT_SYMBOL(skb_frag_destructor_unref);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 08/10] net: add paged frag destructor support to kernel_sendpage.
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
                   ` (6 preceding siblings ...)
  2011-07-15 11:07 ` [PATCH 07/10] net: add support for per-paged-fragment destructors Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 11:07 ` [PATCH 09/10] nfs: use sk fragment destructors to delay I/O completion until page is released by network stack Ian Campbell
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

NB: I added a separate sendpage_destructor to struct proto_ops and struct proto
for this PoC but expect that it would be preferable to just add the new
parameter and update all callers in the tree for the final version.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/staging/pohmelfs/trans.c |    2 +-
 fs/dlm/lowcomms.c                |    2 +-
 include/linux/net.h              |    9 ++++++++-
 include/net/inet_common.h        |    4 ++++
 include/net/sock.h               |    4 ++++
 include/net/tcp.h                |    4 ++++
 net/ceph/messenger.c             |    2 +-
 net/ipv4/af_inet.c               |   20 ++++++++++++++++++--
 net/ipv4/tcp.c                   |   29 ++++++++++++++++++++++++-----
 net/ipv4/tcp_ipv4.c              |    1 +
 net/ipv6/af_inet6.c              |    1 +
 net/ipv6/tcp_ipv6.c              |    1 +
 net/socket.c                     |   28 +++++++++++++++++++++++-----
 net/sunrpc/svcsock.c             |    6 +++---
 14 files changed, 94 insertions(+), 19 deletions(-)

diff --git a/drivers/staging/pohmelfs/trans.c b/drivers/staging/pohmelfs/trans.c
index 36a2535..b5d8411 100644
--- a/drivers/staging/pohmelfs/trans.c
+++ b/drivers/staging/pohmelfs/trans.c
@@ -104,7 +104,7 @@ static int netfs_trans_send_pages(struct netfs_trans *t, struct netfs_state *st)
 		msg.msg_flags = MSG_WAITALL | (attached_pages == 1 ? 0 :
 				MSG_MORE);
 
-		err = kernel_sendpage(st->socket, page, 0, size, msg.msg_flags);
+		err = kernel_sendpage(st->socket, page, NULL, 0, size, msg.msg_flags);
 		if (err <= 0) {
 			printk("%s: %d/%d failed to send transaction page: t: %p, gen: %u, size: %u, err: %d.\n",
 					__func__, i, t->page_num, t, t->gen, size, err);
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 5e2c71f..64933ff 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1341,7 +1341,7 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
+			ret = kernel_sendpage(con->sock, e->page, NULL, offset, len,
 					      msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
diff --git a/include/linux/net.h b/include/linux/net.h
index b299230..dfedc46 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -204,6 +205,10 @@ struct proto_ops {
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
 				      int offset, size_t size, int flags);
+	ssize_t		(*sendpage_destructor)
+				     (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
+				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
 };
@@ -273,7 +278,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..0c39b4b 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -23,6 +23,10 @@ extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
 extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 			     size_t size, int flags);
+extern ssize_t inet_sendpage_destructor(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
+			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
 extern int inet_shutdown(struct socket *sock, int how);
diff --git a/include/net/sock.h b/include/net/sock.h
index c0b938c..7836acf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -764,6 +764,10 @@ struct proto {
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
 					int offset, size_t size, int flags);
+	int			(*sendpage_destructor)
+				       (struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..2b42320 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -319,6 +319,10 @@ extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
 extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 			size_t size, int flags);
+extern int tcp_sendpage_destructor(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
+			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 				 struct tcphdr *th, unsigned len);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 78b55f4..ec7955b 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -852,7 +852,7 @@ static int write_partial_msg_pages(struct ceph_connection *con)
 				cpu_to_le32(crc32c(tmpcrc, base, len));
 			con->out_msg_pos.did_page_crc = 1;
 		}
-		ret = kernel_sendpage(con->sock, page,
+		ret = kernel_sendpage(con->sock, page, NULL,
 				      con->out_msg_pos.page_pos + page_shift,
 				      len,
 				      MSG_DONTWAIT | MSG_NOSIGNAL |
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eae1f67..7954809 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -738,7 +738,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage_destructor(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -750,10 +752,21 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 	    inet_autobind(sk))
 		return -EAGAIN;
 
-	if (sk->sk_prot->sendpage)
+	if (destroy) {
+		if (sk->sk_prot->sendpage_destructor)
+			return sk->sk_prot->sendpage_destructor
+				(sk, page, destroy, offset, size, flags);
+	} else if (sk->sk_prot->sendpage)
 		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
 	return sock_no_sendpage(sock, page, offset, size, flags);
 }
+EXPORT_SYMBOL(inet_sendpage_destructor);
+
+ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+		      size_t size, int flags)
+{
+	return inet_sendpage_destructor(sock, page, NULL, offset, size, flags);
+}
 EXPORT_SYMBOL(inet_sendpage);
 
 int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
@@ -917,6 +930,7 @@ const struct proto_ops inet_stream_ops = {
 	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
+	.sendpage_destructor = inet_sendpage_destructor,
 	.splice_read	   = tcp_splice_read,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_sock_common_setsockopt,
@@ -945,6 +959,7 @@ const struct proto_ops inet_dgram_ops = {
 	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
+	.sendpage_destructor = inet_sendpage_destructor,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_sock_common_setsockopt,
 	.compat_getsockopt = compat_sock_common_getsockopt,
@@ -976,6 +991,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
+	.sendpage_destructor = inet_sendpage_destructor,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_sock_common_setsockopt,
 	.compat_getsockopt = compat_sock_common_getsockopt,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3a3703c..bfc778e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -757,7 +757,10 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor **destructors,
+				int poffset,
 			 size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -783,6 +786,7 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	while (psize > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
 		struct page *page = pages[poffset / PAGE_SIZE];
+		struct skb_frag_destructor *destructor = destructors ? destructors[poffset / PAGE_SIZE] : NULL;
 		int copy, i, can_coalesce;
 		int offset = poffset % PAGE_SIZE;
 		int size = min_t(size_t, psize, PAGE_SIZE - offset);
@@ -815,8 +819,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_shinfo(skb)->frags[i - 1].size += copy;
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_shinfo(skb)->frags[i].page.destructor = destructor;
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -871,8 +876,11 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage_destructor(struct sock *sk,
+			    struct page *page,
+			    struct skb_frag_destructor *destructor,
+			    int offset,
+			    size_t size, int flags)
 {
 	ssize_t res;
 
@@ -882,10 +890,21 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 					flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page,
+			       destructor ? &destructor : NULL,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
+EXPORT_SYMBOL(tcp_sendpage_destructor);
+
+int tcp_sendpage(struct sock *sk,
+		 struct page *page,
+		 int offset,
+		 size_t size, int flags)
+{
+	return tcp_sendpage_destructor(sk, page, NULL, offset, size, flags);
+}
 EXPORT_SYMBOL(tcp_sendpage);
 
 #define TCP_PAGE(sk)	(sk->sk_sndmsg_page)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 708dc20..9baa996 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2587,6 +2587,7 @@ struct proto tcp_prot = {
 	.recvmsg		= tcp_recvmsg,
 	.sendmsg		= tcp_sendmsg,
 	.sendpage		= tcp_sendpage,
+	.sendpage_destructor	= tcp_sendpage_destructor,
 	.backlog_rcv		= tcp_v4_do_rcv,
 	.hash			= inet_hash,
 	.unhash			= inet_unhash,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d450a2f..58d2520 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -531,6 +531,7 @@ const struct proto_ops inet6_stream_ops = {
 	.recvmsg	   = inet_recvmsg,		/* ok		*/
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
+	.sendpage_destructor = inet_sendpage_destructor,
 	.splice_read	   = tcp_splice_read,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_sock_common_setsockopt,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 87551ca..98a2576 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2210,6 +2210,7 @@ struct proto tcpv6_prot = {
 	.recvmsg		= tcp_recvmsg,
 	.sendmsg		= tcp_sendmsg,
 	.sendpage		= tcp_sendpage,
+	.sendpage_destructor	= tcp_sendpage_destructor,
 	.backlog_rcv		= tcp_v6_do_rcv,
 	.hash			= tcp_v6_hash,
 	.unhash			= inet_unhash,
diff --git a/net/socket.c b/net/socket.c
index 02dc82d..f1c39a4 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -795,7 +795,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	if (more)
 		flags |= MSG_MORE;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3343,15 +3343,33 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
+	int ret;
 	sock_update_classid(sock->sk);
 
-	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+	/*
+         * If we have a destructor but the socket does not support
+         * sendpage_destructor then fallback to sock_no_sendpage which
+         * is copying...
+         */
+	if (destroy) {
+		if (sock->ops->sendpage_destructor)
+			return sock->ops->sendpage_destructor(sock, page, destroy,
+							      offset, size, flags);
+	} else {
+		if (sock->ops->sendpage)
+			return sock->ops->sendpage(sock, page,
+						   offset, size, flags);
+	}
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	ret = sock_no_sendpage(sock, page, offset, size, flags);
+	/* sock_no_sendpage copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
+	return ret;
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index af04f77..a80b1d3 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -181,7 +181,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -194,7 +194,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -208,7 +208,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 09/10] nfs: use sk fragment destructors to delay I/O completion until page is released by network stack.
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
                   ` (7 preceding siblings ...)
  2011-07-15 11:07 ` [PATCH 08/10] net: add paged frag destructor support to kernel_sendpage Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 14:01   ` Trond Myklebust
  2011-07-15 11:07 ` [PATCH 10/10] nfs: debugging for nfs destructor Ian Campbell
  2011-07-15 15:17 ` [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility David Miller
  10 siblings, 1 reply; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

Thos prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens then the write() system call and the userspace process
can continue potentially modifying the data before the retransmission occurs.

NB: this only covers the O_DIRECT write() case. I expect other cases to need
handling as well.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 fs/nfs/direct.c            |   17 +++++++++++++++--
 fs/nfs/nfs2xdr.c           |    4 ++--
 fs/nfs/nfs3xdr.c           |    7 ++++---
 fs/nfs/nfs4xdr.c           |    6 +++---
 include/linux/nfs_xdr.h    |    4 ++++
 include/linux/sunrpc/xdr.h |    2 ++
 net/sunrpc/svcsock.c       |    2 +-
 net/sunrpc/xdr.c           |    4 +++-
 net/sunrpc/xprtsock.c      |    2 +-
 9 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 8eea253..4735fd9 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -691,8 +691,7 @@ static void nfs_direct_write_release(void *calldata)
 out_unlock:
 	spin_unlock(&dreq->lock);
 
-	if (put_dreq(dreq))
-		nfs_direct_write_complete(dreq, data->inode);
+	skb_frag_destructor_unref(data->args.pages_destructor);
 }
 
 static const struct rpc_call_ops nfs_write_direct_ops = {
@@ -703,6 +702,15 @@ static const struct rpc_call_ops nfs_write_direct_ops = {
 	.rpc_release = nfs_direct_write_release,
 };
 
+static int nfs_write_page_destroy(void *calldata)
+{
+	struct nfs_write_data *data = calldata;
+	struct nfs_direct_req *dreq = (struct nfs_direct_req *) data->req;
+	if (put_dreq(dreq))
+		nfs_direct_write_complete(dreq, data->inode);
+	return 0;
+}
+
 /*
  * For each wsize'd chunk of the user's buffer, dispatch an NFS WRITE
  * operation.  If nfs_writedata_alloc() or get_user_pages() fails,
@@ -769,6 +777,10 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 
 		list_move_tail(&data->pages, &dreq->rewrite_list);
 
+		atomic_set(&data->pagevec_destructor.ref, 1);
+		data->pagevec_destructor.destroy = nfs_write_page_destroy;
+		data->pagevec_destructor.data = data;
+
 		data->req = (struct nfs_page *) dreq;
 		data->inode = inode;
 		data->cred = msg.rpc_cred;
@@ -778,6 +790,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 		data->args.offset = pos;
 		data->args.pgbase = pgbase;
 		data->args.pages = data->pagevec;
+		data->args.pages_destructor = &data->pagevec_destructor;
 		data->args.count = bytes;
 		data->args.stable = sync;
 		data->res.fattr = &data->fattr;
diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c
index 792cb13..6dc77f0 100644
--- a/fs/nfs/nfs2xdr.c
+++ b/fs/nfs/nfs2xdr.c
@@ -431,7 +431,7 @@ static void encode_path(struct xdr_stream *xdr, struct page **pages, u32 length)
 	BUG_ON(length > NFS2_MAXPATHLEN);
 	p = xdr_reserve_space(xdr, 4);
 	*p = cpu_to_be32(length);
-	xdr_write_pages(xdr, pages, 0, length);
+	xdr_write_pages(xdr, pages, NULL, 0, length);
 }
 
 static int decode_path(struct xdr_stream *xdr)
@@ -659,7 +659,7 @@ static void encode_writeargs(struct xdr_stream *xdr,
 
 	/* nfsdata */
 	*p = cpu_to_be32(count);
-	xdr_write_pages(xdr, args->pages, args->pgbase, count);
+	xdr_write_pages(xdr, args->pages, NULL, args->pgbase, count);
 }
 
 static void nfs2_xdr_enc_writeargs(struct rpc_rqst *req,
diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
index 183c6b1..f7a83a1 100644
--- a/fs/nfs/nfs3xdr.c
+++ b/fs/nfs/nfs3xdr.c
@@ -238,7 +238,7 @@ static void encode_nfspath3(struct xdr_stream *xdr, struct page **pages,
 {
 	BUG_ON(length > NFS3_MAXPATHLEN);
 	encode_uint32(xdr, length);
-	xdr_write_pages(xdr, pages, 0, length);
+	xdr_write_pages(xdr, pages, NULL, 0, length);
 }
 
 static int decode_nfspath3(struct xdr_stream *xdr)
@@ -994,7 +994,8 @@ static void encode_write3args(struct xdr_stream *xdr,
 	*p++ = cpu_to_be32(args->count);
 	*p++ = cpu_to_be32(args->stable);
 	*p = cpu_to_be32(args->count);
-	xdr_write_pages(xdr, args->pages, args->pgbase, args->count);
+	xdr_write_pages(xdr, args->pages, args->pages_destructor,
+			args->pgbase, args->count);
 }
 
 static void nfs3_xdr_enc_write3args(struct rpc_rqst *req,
@@ -1331,7 +1332,7 @@ static void nfs3_xdr_enc_setacl3args(struct rpc_rqst *req,
 
 	base = req->rq_slen;
 	if (args->npages != 0)
-		xdr_write_pages(xdr, args->pages, 0, args->len);
+		xdr_write_pages(xdr, args->pages, NULL, 0, args->len);
 	else
 		xdr_reserve_space(xdr, NFS_ACL_INLINE_BUFSIZE);
 
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 6870bc6..ac9931d 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1031,7 +1031,7 @@ static void encode_create(struct xdr_stream *xdr, const struct nfs4_create_arg *
 	case NF4LNK:
 		p = reserve_space(xdr, 4);
 		*p = cpu_to_be32(create->u.symlink.len);
-		xdr_write_pages(xdr, create->u.symlink.pages, 0, create->u.symlink.len);
+		xdr_write_pages(xdr, create->u.symlink.pages, NULL, 0, create->u.symlink.len);
 		break;
 
 	case NF4BLK: case NF4CHR:
@@ -1573,7 +1573,7 @@ encode_setacl(struct xdr_stream *xdr, struct nfs_setaclargs *arg, struct compoun
 	BUG_ON(arg->acl_len % 4);
 	p = reserve_space(xdr, 4);
 	*p = cpu_to_be32(arg->acl_len);
-	xdr_write_pages(xdr, arg->acl_pages, arg->acl_pgbase, arg->acl_len);
+	xdr_write_pages(xdr, arg->acl_pages, NULL, arg->acl_pgbase, arg->acl_len);
 	hdr->nops++;
 	hdr->replen += decode_setacl_maxsz;
 }
@@ -1647,7 +1647,7 @@ static void encode_write(struct xdr_stream *xdr, const struct nfs_writeargs *arg
 	*p++ = cpu_to_be32(args->stable);
 	*p = cpu_to_be32(args->count);
 
-	xdr_write_pages(xdr, args->pages, args->pgbase, args->count);
+	xdr_write_pages(xdr, args->pages, NULL, args->pgbase, args->count);
 	hdr->nops++;
 	hdr->replen += decode_write_maxsz;
 }
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 00848d8..4bbc2bf 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -5,6 +5,8 @@
 #include <linux/nfs3.h>
 #include <linux/sunrpc/gss_api.h>
 
+#include <linux/skbuff.h>
+
 /*
  * To change the maximum rsize and wsize supported by the NFS client, adjust
  * NFS_MAX_FILE_IO_SIZE.  64KB is a typical maximum, but some servers can
@@ -470,6 +472,7 @@ struct nfs_writeargs {
 	enum nfs3_stable_how	stable;
 	unsigned int		pgbase;
 	struct page **		pages;
+	struct skb_frag_destructor *pages_destructor;
 	const u32 *		bitmask;
 	struct nfs4_sequence_args	seq_args;
 };
@@ -1121,6 +1124,7 @@ struct nfs_write_data {
 	struct list_head	pages;		/* Coalesced requests we wish to flush */
 	struct nfs_page		*req;		/* multi ops per nfs_page */
 	struct page		**pagevec;
+	struct skb_frag_destructor pagevec_destructor;
 	unsigned int		npages;		/* Max length of pagevec */
 	struct nfs_writeargs	args;		/* argument struct */
 	struct nfs_writeres	res;		/* result struct */
diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index a20970e..cebb531 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -57,6 +57,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
@@ -214,6 +215,7 @@ typedef int	(*kxdrdproc_t)(void *rqstp, struct xdr_stream *xdr, void *obj);
 extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p);
 extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes);
 extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages,
+		struct skb_frag_destructor *destroy,
 		unsigned int base, unsigned int len);
 extern void xdr_init_decode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p);
 extern void xdr_init_decode_pages(struct xdr_stream *xdr, struct xdr_buf *buf,
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index a80b1d3..40c2420 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -194,7 +194,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index f008c14..9c7dded 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -525,12 +525,14 @@ EXPORT_SYMBOL_GPL(xdr_reserve_space);
  * @len: length of data in bytes
  *
  */
-void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, unsigned int base,
+void xdr_write_pages(struct xdr_stream *xdr, struct page **pages,
+		 struct skb_frag_destructor *destroy, unsigned int base,
 		 unsigned int len)
 {
 	struct xdr_buf *buf = xdr->buf;
 	struct kvec *iov = buf->tail;
 	buf->pages = pages;
+	buf->destructor = destroy;
 	buf->page_base = base;
 	buf->page_len = len;
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 72abb73..aa31294 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -397,7 +397,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage_destructor(sock, *ppage, xdr->destructor, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 10/10] nfs: debugging for nfs destructor
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
                   ` (8 preceding siblings ...)
  2011-07-15 11:07 ` [PATCH 09/10] nfs: use sk fragment destructors to delay I/O completion until page is released by network stack Ian Campbell
@ 2011-07-15 11:07 ` Ian Campbell
  2011-07-15 15:17 ` [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility David Miller
  10 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 11:07 UTC (permalink / raw)
  To: netdev; +Cc: linux-nfs, Ian Campbell

---
 fs/nfs/direct.c            |   57 +++++++++++++++++++++++++++++++++++++++++++-
 fs/nfs/nfs3xdr.c           |    2 +
 include/linux/mm.h         |    4 +++
 include/linux/page-flags.h |    2 +
 mm/swap.c                  |    6 ++++
 net/core/skbuff.c          |   14 ++++++++++-
 net/sunrpc/svcsock.c       |    7 +++++
 net/sunrpc/xdr.c           |   12 ++++++++-
 net/sunrpc/xprtsock.c      |    8 ++++++
 9 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 4735fd9..9512f75 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -144,8 +144,12 @@ static void nfs_direct_dirty_pages(struct page **pages, unsigned int pgbase, siz
 static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
 {
 	unsigned int i;
-	for (i = 0; i < npages; i++)
+	for (i = 0; i < npages; i++) {
+		if (0) printk(KERN_CRIT "%s releasing %p (clearing debug)\n",
+		       __func__, pages[i]);
+		ClearPageDebug(pages[i]);
 		page_cache_release(pages[i]);
+	}
 }
 
 static inline struct nfs_direct_req *nfs_direct_req_alloc(void)
@@ -216,6 +220,8 @@ out:
  */
 static void nfs_direct_complete(struct nfs_direct_req *dreq)
 {
+	if (0) printk(KERN_CRIT "%s for dreq:%p count:%d\n",
+	       __func__, dreq, atomic_read(&dreq->io_count));
 	if (dreq->iocb) {
 		long res = (long) dreq->error;
 		if (!res)
@@ -521,6 +527,8 @@ static void nfs_direct_write_reschedule(struct nfs_direct_req *dreq)
 				(unsigned long long)data->args.offset);
 	}
 
+	if (0) printk(KERN_CRIT "%s for w:%p dreq:%p count:%d\n", __func__,
+	       data, dreq, atomic_read(&dreq->io_count));
 	if (put_dreq(dreq))
 		nfs_direct_write_complete(dreq, inode);
 }
@@ -549,6 +557,8 @@ static void nfs_direct_commit_release(void *calldata)
 	}
 
 	dprintk("NFS: %5u commit returned %d\n", data->task.tk_pid, status);
+	if (0) printk(KERN_CRIT "%s for w:%p dreq:%p count:%d\n", __func__,
+	       data, dreq, atomic_read(&dreq->io_count));
 	nfs_direct_write_complete(dreq, data->inode);
 	nfs_commit_free(data);
 }
@@ -609,20 +619,25 @@ static void nfs_direct_write_complete(struct nfs_direct_req *dreq, struct inode
 {
 	int flags = dreq->flags;
 
+	if (0) printk(KERN_CRIT "%s for dreq:%p\n", __func__, dreq);
 	dreq->flags = 0;
 	switch (flags) {
 		case NFS_ODIRECT_DO_COMMIT:
+			if (0) printk(KERN_CRIT "%s: DO_COMMIT\n", __func__);
 			nfs_direct_commit_schedule(dreq);
 			break;
 		case NFS_ODIRECT_RESCHED_WRITES:
+			if (0) printk(KERN_CRIT "%s: RESCHED\n", __func__);
 			nfs_direct_write_reschedule(dreq);
 			break;
 		default:
+			if (0) printk(KERN_CRIT "%s: DONE\n", __func__);
 			if (dreq->commit_data != NULL)
 				nfs_commit_free(dreq->commit_data);
 			nfs_direct_free_writedata(dreq);
 			nfs_zap_mapping(inode, inode->i_mapping);
 			nfs_direct_complete(dreq);
+			//set_all_pages_debug(data->pagevec, data->npages);
 	}
 }
 
@@ -661,6 +676,7 @@ static void nfs_direct_write_release(void *calldata)
 {
 	struct nfs_write_data *data = calldata;
 	struct nfs_direct_req *dreq = (struct nfs_direct_req *) data->req;
+
 	int status = data->task.tk_status;
 
 	spin_lock(&dreq->lock);
@@ -691,6 +707,8 @@ static void nfs_direct_write_release(void *calldata)
 out_unlock:
 	spin_unlock(&dreq->lock);
 
+	if (0) printk(KERN_CRIT "%s for w:%p dreq:%p count:%d\n", __func__,
+	       data, dreq, atomic_read(&dreq->io_count));
 	skb_frag_destructor_unref(data->args.pages_destructor);
 }
 
@@ -706,11 +724,29 @@ static int nfs_write_page_destroy(void *calldata)
 {
 	struct nfs_write_data *data = calldata;
 	struct nfs_direct_req *dreq = (struct nfs_direct_req *) data->req;
+	//int i;
+	if (0) printk(KERN_CRIT "%s for w:%p dreq:%p count:%d\n", __func__,
+	       data, dreq,
+	       atomic_read(&dreq->io_count));
+	//for (i = 0; i < data->npages; i++)
+	//	put_page(data->pagevec[i]);
+
 	if (put_dreq(dreq))
 		nfs_direct_write_complete(dreq, data->inode);
 	return 0;
 }
 
+static void set_all_pages_debug(struct page **pages, int npages)
+{
+	int i;
+	return;
+	for (i=0; i<npages; i++) {
+		if (0) printk(KERN_CRIT "Marking page %p as debug current count:%d\n",
+		       pages[i],page_count(pages[i]));
+		SetPageDebug(pages[i]);
+	}
+}
+
 /*
  * For each wsize'd chunk of the user's buffer, dispatch an NFS WRITE
  * operation.  If nfs_writedata_alloc() or get_user_pages() fails,
@@ -754,6 +790,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 		if (unlikely(!data))
 			break;
 
+		if (0) printk(KERN_CRIT "%s: getting user pages\n", __func__);
 		down_read(&current->mm->mmap_sem);
 		result = get_user_pages(current, current->mm, user_addr,
 					data->npages, 0, 0, data->pagevec, NULL);
@@ -773,6 +810,8 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 			data->npages = result;
 		}
 
+		set_all_pages_debug(data->pagevec, data->npages);
+
 		get_dreq(dreq);
 
 		list_move_tail(&data->pages, &dreq->rewrite_list);
@@ -790,7 +829,11 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 		data->args.offset = pos;
 		data->args.pgbase = pgbase;
 		data->args.pages = data->pagevec;
+#if 1
 		data->args.pages_destructor = &data->pagevec_destructor;
+#else
+		data->args.pages_destructor = NULL;
+#endif
 		data->args.count = bytes;
 		data->args.stable = sync;
 		data->res.fattr = &data->fattr;
@@ -804,6 +847,16 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 		msg.rpc_resp = &data->res;
 		NFS_PROTO(inode)->write_setup(data, &msg);
 
+		if (0) printk(KERN_CRIT "%s scheduling w:%p dreq:%p count:%d. args %p pages:%d dest:%p %pf\n",
+		       __func__, data, dreq,
+		       atomic_read(&dreq->io_count),
+		       &data->args, data->npages,
+		       data->args.pages_destructor,
+		       data->args.pages_destructor->destroy);
+		if (0) printk(KERN_CRIT "%s page[0] is %p count:%d\n",
+		       __func__, data->pagevec[0],
+		       page_count(data->pagevec[0]));
+
 		task = rpc_run_task(&task_setup_data);
 		if (IS_ERR(task))
 			break;
@@ -866,6 +919,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		return result < 0 ? result : -EIO;
 	}
 
+	if (0) printk(KERN_CRIT "%s for dreq:%p count:%d\n", __func__,
+	       dreq, atomic_read(&dreq->io_count));
 	if (put_dreq(dreq))
 		nfs_direct_write_complete(dreq, dreq->inode);
 	return 0;
diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
index f7a83a1..2137550 100644
--- a/fs/nfs/nfs3xdr.c
+++ b/fs/nfs/nfs3xdr.c
@@ -994,6 +994,8 @@ static void encode_write3args(struct xdr_stream *xdr,
 	*p++ = cpu_to_be32(args->count);
 	*p++ = cpu_to_be32(args->stable);
 	*p = cpu_to_be32(args->count);
+	if (0) printk(KERN_CRIT "%s xdr %p args %p destructor %pF\n",
+	       __func__, xdr, args, args->pages_destructor->destroy);
 	xdr_write_pages(xdr, args->pages, args->pages_destructor,
 			args->pgbase, args->count);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 550ec8f..f992dfa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -362,6 +362,10 @@ static inline int page_count(struct page *page)
 
 static inline void get_page(struct page *page)
 {
+	if (PageDebug(page)) printk(KERN_CRIT "%s(%p) from %pF count %d\n",
+				    __func__, page,
+				    __builtin_return_address(0),
+				    page_count(page));
 	/*
 	 * Getting a normal page or the head of a compound page
 	 * requires to already have an elevated page->_count. Only if
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7d632cc..8434345 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,6 +107,7 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+	PG_debug,
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -209,6 +210,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
+PAGEFLAG(Debug, debug)
 
 __PAGEFLAG(SlobFree, slob_free)
 
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..4450105 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -61,6 +61,8 @@ static void __page_cache_release(struct page *page)
 
 static void __put_single_page(struct page *page)
 {
+	if (PageDebug(page)) printk(KERN_CRIT "%s(%p) from %pF\n", __func__, page, __builtin_return_address(0));
+	ClearPageDebug(page);
 	__page_cache_release(page);
 	free_hot_cold_page(page, 0);
 }
@@ -153,6 +155,10 @@ static void put_compound_page(struct page *page)
 
 void put_page(struct page *page)
 {
+	if (PageDebug(page)) printk(KERN_CRIT "%s(%p) from %pF count %d\n",
+				    __func__, page,
+				    __builtin_return_address(0),
+				    page_count(page));
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bdc6f6e..c3fdfb7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -296,6 +296,9 @@ void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
 {
 	BUG_ON(destroy == NULL);
 	atomic_inc(&destroy->ref);
+	if (0) printk(KERN_CRIT "%s from %pF: %d\n",
+	       __func__, __builtin_return_address(0),
+	       atomic_read(&destroy->ref));
 }
 EXPORT_SYMBOL(skb_frag_destructor_ref);
 
@@ -304,8 +307,17 @@ void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
 	if (destroy == NULL)
 		return;
 
-	if (atomic_dec_and_test(&destroy->ref))
+	if (atomic_dec_and_test(&destroy->ref)) {
+		if (0) printk(KERN_CRIT "%s from %pF: calling destructor %p %pf(%p)\n",
+		       __func__, __builtin_return_address(0),
+		       destroy, destroy->destroy, destroy->data);
 		destroy->destroy(destroy->data);
+	} else {
+		if (0) printk(KERN_CRIT "%s from %pF: destructor %p %pf(%p) not called %d remaining\n",
+		       __func__, __builtin_return_address(0),
+		       destroy, destroy->destroy, destroy->data,
+		       atomic_read(&destroy->ref));
+	}
 }
 EXPORT_SYMBOL(skb_frag_destructor_unref);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 40c2420..64ce9907 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -176,6 +176,10 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	int		slen;
 	int		len = 0;
 
+	if (xdr->destructor)
+		if (0) printk(KERN_CRIT "%s sending xdr %p with destructor %pF\n",
+		       __func__, xdr, xdr->destructor->destroy);
+
 	slen = xdr->len;
 
 	/* send head */
@@ -194,6 +198,9 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
+		if (xdr->destructor)
+			if (0) printk(KERN_CRIT "%s sending xdr %p page %p with destructor %pF\n",
+			       __func__, xdr, *ppage, xdr->destructor);
 		result = kernel_sendpage(sock, *ppage, xdr->destructor, base, size, flags);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 9c7dded..fcaea6a 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -135,6 +135,10 @@ xdr_encode_pages(struct xdr_buf *xdr, struct page **pages, unsigned int base,
 	struct kvec *tail = xdr->tail;
 	u32 *p;
 
+	if (xdr->destructor)
+		if (0) printk(KERN_CRIT "%s xdr %p with destructor %p\n",
+		       __func__, xdr, xdr->destructor);
+
 	xdr->pages = pages;
 	xdr->page_base = base;
 	xdr->page_len = len;
@@ -165,6 +169,11 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset,
 	char *buf = (char *)head->iov_base;
 	unsigned int buflen = head->iov_len;
 
+
+	if (xdr->destructor)
+		if (0) printk(KERN_CRIT "%s xdr %p with destructor %p\n",
+		       __func__, xdr, xdr->destructor);
+
 	head->iov_len  = offset;
 
 	xdr->pages = pages;
@@ -535,7 +544,8 @@ void xdr_write_pages(struct xdr_stream *xdr, struct page **pages,
 	buf->destructor = destroy;
 	buf->page_base = base;
 	buf->page_len = len;
-
+	if (destroy)
+		if (0) printk(KERN_CRIT "%s xdr %p %p destructor %p\n", __func__, xdr, buf, destroy);
 	iov->iov_base = (char *)xdr->p;
 	iov->iov_len  = 0;
 	xdr->iov = iov;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index aa31294..336d787 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -386,6 +386,10 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 	unsigned int remainder;
 	int err, sent = 0;
 
+	if (xdr->destructor)
+		if (0) printk(KERN_CRIT "%s sending xdr %p with destructor %pF\n",
+		       __func__, xdr, xdr->destructor->destroy);
+
 	remainder = xdr->page_len - base;
 	base += xdr->page_base;
 	ppage = xdr->pages + (base >> PAGE_SHIFT);
@@ -425,6 +429,10 @@ static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen,
 	unsigned int remainder = xdr->len - base;
 	int err, sent = 0;
 
+	if (xdr->destructor)
+		if (0) printk(KERN_CRIT "%s sending xdr %p with destructor %pF\n",
+		       __func__, xdr, xdr->destructor->destroy);
+
 	if (unlikely(!sock))
 		return -ENOTSOCK;
 
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 09/10] nfs: use sk fragment destructors to delay I/O completion until page is released by network stack.
  2011-07-15 11:07 ` [PATCH 09/10] nfs: use sk fragment destructors to delay I/O completion until page is released by network stack Ian Campbell
@ 2011-07-15 14:01   ` Trond Myklebust
  2011-07-15 15:21     ` Ian Campbell
  2011-07-21 13:18     ` Ian Campbell
  0 siblings, 2 replies; 17+ messages in thread
From: Trond Myklebust @ 2011-07-15 14:01 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, linux-nfs

On Fri, 2011-07-15 at 12:07 +0100, Ian Campbell wrote: 
> Thos prevents an issue where an ACK is delayed, a retransmit is queued (either
> at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> wire. If this happens then the write() system call and the userspace process
> can continue potentially modifying the data before the retransmission occurs.
> 
> NB: this only covers the O_DIRECT write() case. I expect other cases to need
> handling as well.

That is why this belongs entirely in the RPC layer, and really should
not touch the NFS layer.
If you move your callback to the RPC layer and have it notify the
rpc_task when the pages have been sent, then it should be possible to
achieve the same thing.

IOW: Add an extra state machine step after call_decode() which checks if
all the page data has been transmitted and if not, puts the rpc_task on
a wait queue, and has it wait for the fragment destructor callback
before calling rpc_exit_task().

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility
  2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
                   ` (9 preceding siblings ...)
  2011-07-15 11:07 ` [PATCH 10/10] nfs: debugging for nfs destructor Ian Campbell
@ 2011-07-15 15:17 ` David Miller
  2011-07-15 15:36   ` Ian Campbell
  10 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2011-07-15 15:17 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: netdev, linux-nfs

From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Fri, 15 Jul 2011 12:06:46 +0100

> What is the general feeling regarding this approach?

Not bad, I like that only the users of destructors pay the price of
the extra atomics.

Like you say in patch #8, I wouldn't bother adding a whole new
->sendpage_destructor() OP, just add the new argument to the existing
method.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 09/10] nfs: use sk fragment destructors to delay I/O completion until page is released by network stack.
  2011-07-15 14:01   ` Trond Myklebust
@ 2011-07-15 15:21     ` Ian Campbell
  2011-07-21 13:18     ` Ian Campbell
  1 sibling, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 15:21 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: netdev, linux-nfs

On Fri, 2011-07-15 at 15:01 +0100, Trond Myklebust wrote:
> On Fri, 2011-07-15 at 12:07 +0100, Ian Campbell wrote: 
> > Thos prevents an issue where an ACK is delayed, a retransmit is queued (either
> > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > wire. If this happens then the write() system call and the userspace process
> > can continue potentially modifying the data before the retransmission occurs.
> > 
> > NB: this only covers the O_DIRECT write() case. I expect other cases to need
> > handling as well.
> 
> That is why this belongs entirely in the RPC layer, and really should
> not touch the NFS layer.
> If you move your callback to the RPC layer and have it notify the
> rpc_task when the pages have been sent, then it should be possible to
> achieve the same thing.
> 
> IOW: Add an extra state machine step after call_decode() which checks if
> all the page data has been transmitted and if not, puts the rpc_task on
> a wait queue, and has it wait for the fragment destructor callback
> before calling rpc_exit_task().

Make sense, I'll do that.

Thanks,
Ian.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility
  2011-07-15 15:17 ` [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility David Miller
@ 2011-07-15 15:36   ` Ian Campbell
  0 siblings, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-15 15:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-nfs

On Fri, 2011-07-15 at 16:17 +0100, David Miller wrote:
> From: Ian Campbell <Ian.Campbell@citrix.com>
> Date: Fri, 15 Jul 2011 12:06:46 +0100
> 
> > What is the general feeling regarding this approach?
> 
> Not bad,

Thanks, I'll continue in this direction then.

>  I like that only the users of destructors pay the price of
> the extra atomics.

Yes, I very much wanted to avoid hitting everyone with extra overhead.

> Like you say in patch #8, I wouldn't bother adding a whole new
> ->sendpage_destructor() OP, just add the new argument to the existing
> method.

Will do.

Ian.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 03/10] net: add APIs for manipulating skb page fragments.
  2011-07-15 11:07 ` [PATCH 03/10] net: add APIs for manipulating skb page fragments Ian Campbell
@ 2011-07-15 22:34   ` Michał Mirosław
  0 siblings, 0 replies; 17+ messages in thread
From: Michał Mirosław @ 2011-07-15 22:34 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, linux-nfs

2011/7/15 Ian Campbell <ian.campbell@citrix.com>:
[...]
> +static inline dma_addr_t skb_frag_pci_map(struct pci_dev *hwdev,
> +                                         const skb_frag_t *frag,
> +                                         unsigned long offset,
> +                                         size_t size,
> +                                         int direction)
> +
> +{
> +       return pci_map_page(hwdev, __skb_frag_page(frag),
> +                           frag->page_offset + offset, size, direction);
> +}

You can get rid of this one, as drivers should be converted to generic
DMA API instead. The change is only in passing &pci_dev->dev instead
of pci_dev.

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 09/10] nfs: use sk fragment destructors to delay I/O completion until page is released by network stack.
  2011-07-15 14:01   ` Trond Myklebust
  2011-07-15 15:21     ` Ian Campbell
@ 2011-07-21 13:18     ` Ian Campbell
  1 sibling, 0 replies; 17+ messages in thread
From: Ian Campbell @ 2011-07-21 13:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: netdev, linux-nfs

On Fri, 2011-07-15 at 15:01 +0100, Trond Myklebust wrote:
> On Fri, 2011-07-15 at 12:07 +0100, Ian Campbell wrote: 
> > Thos prevents an issue where an ACK is delayed, a retransmit is queued (either
> > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > wire. If this happens then the write() system call and the userspace process
> > can continue potentially modifying the data before the retransmission occurs.
> > 
> > NB: this only covers the O_DIRECT write() case. I expect other cases to need
> > handling as well.
> 
> That is why this belongs entirely in the RPC layer, and really should
> not touch the NFS layer.
> If you move your callback to the RPC layer and have it notify the
> rpc_task when the pages have been sent, then it should be possible to
> achieve the same thing.
> 
> IOW: Add an extra state machine step after call_decode() which checks if
> all the page data has been transmitted and if not, puts the rpc_task on
> a wait queue, and has it wait for the fragment destructor callback
> before calling rpc_exit_task().
> 
> Cheers

Is this the sort of thing? I wasn't sure where best to put the
destructor data structure to get the right lifecycle and ended up
putting it in the struct rpc_rqst and initialising it at
xprt_request_init time.

I changed everywhere which currently transitions to rpc_exit_task to
transition to a new "call_complete" task, blocking on the pending wait
queue. The SKB destructor wakes that queue and call_complete then
transitions to rpc_exit_task.

Several of the locations already block on that wait queue so I simply
remove the wake up in those cases (since it will happen in the SKB frag
destructor). Since we call unref at these points (to drop the initial
refcount) in the common case we will be woken from the pending wait
queue before we even sleep on it.

Thanks,
Ian.

>From 49d7d53d065bf0963fd4bb70405f4f1972f618c4 Mon Sep 17 00:00:00 2001
From: Ian Campbell <ian.campbell@citrix.com>
Date: Mon, 11 Jul 2011 14:43:24 +0100
Subject: [PATCH] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   28 +++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    2 +-
 net/sunrpc/xprt.c           |   13 +++++++++++++
 net/sunrpc/xprtsock.c       |    2 +-
 6 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index a20970e..172f81e 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 81cce3b..0de6bc3 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -91,7 +91,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 8c91415..0c85acb 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1114,6 +1115,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1277,6 +1280,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1310,8 +1314,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1384,7 +1388,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1424,7 +1429,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_NFS_V4_1 */
 
@@ -1591,12 +1595,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1611,6 +1617,18 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	struct rpc_rqst	*req = task->tk_rqstp;
+	dprintk("RPC: %5u call_complete result %d\n", task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index a80b1d3..40c2420 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -194,7 +194,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index ce5eb68..62f52a3 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1017,6 +1017,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(void *calldata)
+{
+	struct rpc_task *task = calldata;
+	struct rpc_rqst	*req = task->tk_rqstp;
+
+	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1028,6 +1038,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
+	req->destructor.data = task;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d027621..ca1643b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -397,7 +397,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5




^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-07-21 13:18 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-15 11:06 [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility Ian Campbell
2011-07-15 11:07 ` [PATCH 01/10] mm: Make some struct page's const Ian Campbell
2011-07-15 11:07 ` [PATCH 02/10] mm: use const struct page for r/o page-flag accessor methods Ian Campbell
2011-07-15 11:07 ` [PATCH 03/10] net: add APIs for manipulating skb page fragments Ian Campbell
2011-07-15 22:34   ` Michał Mirosław
2011-07-15 11:07 ` [PATCH 04/10] net: convert core to skb paged frag APIs Ian Campbell
2011-07-15 11:07 ` [PATCH 05/10] net: convert protocols to SKB " Ian Campbell
2011-07-15 11:07 ` [PATCH 06/10] net: convert drivers to paged frag API Ian Campbell
2011-07-15 11:07 ` [PATCH 07/10] net: add support for per-paged-fragment destructors Ian Campbell
2011-07-15 11:07 ` [PATCH 08/10] net: add paged frag destructor support to kernel_sendpage Ian Campbell
2011-07-15 11:07 ` [PATCH 09/10] nfs: use sk fragment destructors to delay I/O completion until page is released by network stack Ian Campbell
2011-07-15 14:01   ` Trond Myklebust
2011-07-15 15:21     ` Ian Campbell
2011-07-21 13:18     ` Ian Campbell
2011-07-15 11:07 ` [PATCH 10/10] nfs: debugging for nfs destructor Ian Campbell
2011-07-15 15:17 ` [PATCH/RFC 0/10] enable SKB paged fragment lifetime visibility David Miller
2011-07-15 15:36   ` Ian Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).