All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] pskb_extract() helper function.
@ 2016-04-20 10:17 Sowmini Varadhan
  2016-04-20 10:17 ` [PATCH net-next 1/2] skbuff: Add " Sowmini Varadhan
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Sowmini Varadhan @ 2016-04-20 10:17 UTC (permalink / raw)
  To: netdev, rds-devel, santosh.shilimkar, davem
  Cc: sowmini.varadhan, eric.dumazet, marcelo.leitner

This patchset follows up on the discussion in
 https://www.mail-archive.com/netdev@vger.kernel.org/msg105090.html

For RDS-TCP, we have to deal with the full gamut of
nonlinear sk_buffs, including all the frag_list variants.
Also, the parent skb has to remain unchanged, while the clone
is queued for Rx on the PF_RDS socket. 

Patch 1 of this patchset adds a pskb_extract() function that 
does all this without the redundant memcpy's in pskb_expand_head() 
and __pskb_pull_tail().

Sowmini Varadhan (2):
  Add pskb_extract() helper function
  Call pskb_extract() helper function

 include/linux/skbuff.h |    2 +
 net/core/skbuff.c      |  248 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/rds/tcp_recv.c     |   14 +--
 3 files changed, 253 insertions(+), 11 deletions(-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH net-next 1/2] skbuff: Add pskb_extract() helper function
  2016-04-20 10:17 [PATCH net-next 0/2] pskb_extract() helper function Sowmini Varadhan
@ 2016-04-20 10:17 ` Sowmini Varadhan
  2016-04-22 23:27   ` marcelo.leitner
  2016-04-20 10:17 ` [PATCH net-next 2/2] RDS: TCP: Call " Sowmini Varadhan
  2016-04-22 23:23 ` [PATCH net-next 0/2] " marcelo.leitner
  2 siblings, 1 reply; 6+ messages in thread
From: Sowmini Varadhan @ 2016-04-20 10:17 UTC (permalink / raw)
  To: netdev, rds-devel, santosh.shilimkar, davem
  Cc: sowmini.varadhan, eric.dumazet, marcelo.leitner

A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.

The existing code uses the sequence
	clone = skb_clone(..);
	pskb_pull(clone, off, ..);
	pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().

To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/skbuff.h |    2 +
 net/core/skbuff.c      |  248 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 250 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index da0ace3..a1ce639 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2986,6 +2986,8 @@ struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
 int skb_ensure_writable(struct sk_buff *skb, int write_len);
 int skb_vlan_pop(struct sk_buff *skb);
 int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci);
+struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
+			     gfp_t gfp);
 
 static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4cc594c..e8b6d20 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4619,3 +4619,251 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 	return NULL;
 }
 EXPORT_SYMBOL(alloc_skb_with_frags);
+
+/* carve out the first off bytes from skb when off < headlen */
+static int pskb_carve_inside_header(struct sk_buff *skb, const u32 off,
+				    const int headlen, gfp_t gfp_mask)
+{
+	int i;
+	int size = skb_end_offset(skb);
+	int new_hlen = headlen - off;
+	u8 *data;
+	int doff = 0;
+
+	size = SKB_DATA_ALIGN(size);
+
+	if (skb_pfmemalloc(skb))
+		gfp_mask |= __GFP_MEMALLOC;
+	data = kmalloc_reserve(size +
+			       SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
+			       gfp_mask, NUMA_NO_NODE, NULL);
+	if (!data)
+		return -ENOMEM;
+
+	size = SKB_WITH_OVERHEAD(ksize(data));
+
+	/* Copy real data, and all frags */
+	skb_copy_from_linear_data_offset(skb, off, data, new_hlen);
+	skb->len -= off;
+
+	memcpy((struct skb_shared_info *)(data + size),
+	       skb_shinfo(skb),
+	       offsetof(struct skb_shared_info,
+			frags[skb_shinfo(skb)->nr_frags]));
+	if (skb_cloned(skb)) {
+		/* drop the old head gracefully */
+		if (skb_orphan_frags(skb, gfp_mask)) {
+			kfree(data);
+			return -ENOMEM;
+		}
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+			skb_frag_ref(skb, i);
+		if (skb_has_frag_list(skb))
+			skb_clone_fraglist(skb);
+		skb_release_data(skb);
+	} else {
+		/* we can reuse existing recount- all we did was
+		 * relocate values
+		 */
+		skb_free_head(skb);
+	}
+
+	doff = (data - skb->head);
+	skb->head = data;
+	skb->data = data;
+	skb->head_frag = 0;
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+	skb->end = size;
+	doff = 0;
+#else
+	skb->end = skb->head + size;
+#endif
+	skb_set_tail_pointer(skb, skb_headlen(skb));
+	skb_headers_offset_update(skb, 0);
+	skb->cloned = 0;
+	skb->hdr_len = 0;
+	skb->nohdr = 0;
+	atomic_set(&skb_shinfo(skb)->dataref, 1);
+
+	return 0;
+}
+
+static int pskb_carve(struct sk_buff *skb, const u32 off, gfp_t gfp);
+
+/* carve out the first eat bytes from skb's frag_list. May recurse into
+ * pskb_carve()
+ */
+static int pskb_carve_frag_list(struct sk_buff *skb,
+				struct skb_shared_info *shinfo, int eat,
+				gfp_t gfp_mask)
+{
+	struct sk_buff *list = shinfo->frag_list;
+	struct sk_buff *clone = NULL;
+	struct sk_buff *insp = NULL;
+
+	do {
+		if (!list) {
+			pr_err("Not enough bytes to eat. Want %d\n", eat);
+			return -EFAULT;
+		}
+		if (list->len <= eat) {
+			/* Eaten as whole. */
+			eat -= list->len;
+			list = list->next;
+			insp = list;
+		} else {
+			/* Eaten partially. */
+			if (skb_shared(list)) {
+				clone = skb_clone(list, gfp_mask);
+				if (!clone)
+					return -ENOMEM;
+				insp = list->next;
+				list = clone;
+			} else {
+				/* This may be pulled without problems. */
+				insp = list;
+			}
+			if (pskb_carve(list, eat, gfp_mask) < 0) {
+				kfree_skb(clone);
+				return -ENOMEM;
+			}
+			break;
+		}
+	} while (eat);
+
+	/* Free pulled out fragments. */
+	while ((list = shinfo->frag_list) != insp) {
+		shinfo->frag_list = list->next;
+		kfree_skb(list);
+	}
+	/* And insert new clone at head. */
+	if (clone) {
+		clone->next = list;
+		shinfo->frag_list = clone;
+	}
+	return 0;
+}
+
+/* carve off first len bytes from skb. Split line (off) is in the
+ * non-linear part of skb
+ */
+static int pskb_carve_inside_nonlinear(struct sk_buff *skb, const u32 off,
+				       int pos, gfp_t gfp_mask)
+{
+	int i, k = 0;
+	int size = skb_end_offset(skb);
+	u8 *data;
+	const int nfrags = skb_shinfo(skb)->nr_frags;
+	struct skb_shared_info *shinfo;
+	int doff = 0;
+
+	size = SKB_DATA_ALIGN(size);
+
+	if (skb_pfmemalloc(skb))
+		gfp_mask |= __GFP_MEMALLOC;
+	data = kmalloc_reserve(size +
+			       SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
+			       gfp_mask, NUMA_NO_NODE, NULL);
+	if (!data)
+		return -ENOMEM;
+
+	size = SKB_WITH_OVERHEAD(ksize(data));
+
+	memcpy((struct skb_shared_info *)(data + size),
+	       skb_shinfo(skb), offsetof(struct skb_shared_info,
+					 frags[skb_shinfo(skb)->nr_frags]));
+	if (skb_orphan_frags(skb, gfp_mask)) {
+		kfree(data);
+		return -ENOMEM;
+	}
+	shinfo = (struct skb_shared_info *)(data + size);
+	for (i = 0; i < nfrags; i++) {
+		int fsize = skb_frag_size(&skb_shinfo(skb)->frags[i]);
+
+		if (pos + fsize > off) {
+			shinfo->frags[k] = skb_shinfo(skb)->frags[i];
+
+			if (pos < off) {
+				/* Split frag.
+				 * We have two variants in this case:
+				 * 1. Move all the frag to the second
+				 *    part, if it is possible. F.e.
+				 *    this approach is mandatory for TUX,
+				 *    where splitting is expensive.
+				 * 2. Split is accurately. We make this.
+				 */
+				shinfo->frags[0].page_offset += off - pos;
+				skb_frag_size_sub(&shinfo->frags[0], off - pos);
+			}
+			skb_frag_ref(skb, i);
+			k++;
+		}
+		pos += fsize;
+	}
+	shinfo->nr_frags = k;
+	if (skb_has_frag_list(skb))
+		skb_clone_fraglist(skb);
+
+	if (k == 0) {
+		/* split line is in frag list */
+		pskb_carve_frag_list(skb, shinfo, off - pos, gfp_mask);
+	}
+	skb_release_data(skb);
+
+	doff = (data - skb->head);
+	skb->head = data;
+	skb->head_frag = 0;
+	skb->data = data;
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+	skb->end = size;
+	doff = 0;
+#else
+	skb->end = skb->head + size;
+#endif
+	skb_reset_tail_pointer(skb);
+	skb_headers_offset_update(skb, 0);
+	skb->cloned   = 0;
+	skb->hdr_len  = 0;
+	skb->nohdr    = 0;
+	skb->len -= off;
+	skb->data_len = skb->len;
+	atomic_set(&skb_shinfo(skb)->dataref, 1);
+	return 0;
+}
+
+/* remove len bytes from the beginning of the skb */
+static int pskb_carve(struct sk_buff *skb, const u32 len, gfp_t gfp)
+{
+	int headlen = skb_headlen(skb);
+
+	if (len < headlen)
+		return pskb_carve_inside_header(skb, len, headlen, gfp);
+	else
+		return pskb_carve_inside_nonlinear(skb, len, headlen, gfp);
+}
+
+/* Extract to_copy bytes starting at off from skb, and return this in
+ * a new skb
+ */
+struct sk_buff *pskb_extract(struct sk_buff *skb, int off,
+			     int to_copy, gfp_t gfp)
+{
+	struct sk_buff  *clone = skb_clone(skb, gfp);
+
+	if (!clone)
+		return NULL;
+
+	if (pskb_carve(clone, off, gfp) < 0) {
+		pr_warn("pskb_carve failed\n");
+		kfree_skb(clone);
+		return NULL;
+	}
+
+	if (pskb_trim(clone, to_copy)) {
+		pr_warn("pskb_trim failed\n");
+		kfree_skb(clone);
+		return NULL;
+	}
+	return clone;
+}
+EXPORT_SYMBOL(pskb_extract);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next 2/2] RDS: TCP:  Call pskb_extract() helper function
  2016-04-20 10:17 [PATCH net-next 0/2] pskb_extract() helper function Sowmini Varadhan
  2016-04-20 10:17 ` [PATCH net-next 1/2] skbuff: Add " Sowmini Varadhan
@ 2016-04-20 10:17 ` Sowmini Varadhan
  2016-04-22 23:23 ` [PATCH net-next 0/2] " marcelo.leitner
  2 siblings, 0 replies; 6+ messages in thread
From: Sowmini Varadhan @ 2016-04-20 10:17 UTC (permalink / raw)
  To: netdev, rds-devel, santosh.shilimkar, davem
  Cc: sowmini.varadhan, eric.dumazet, marcelo.leitner

rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/tcp_recv.c |   14 +++-----------
 1 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 27a9921..d75d8b5 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -207,22 +207,14 @@ static int rds_tcp_data_recv(read_descriptor_t *desc, struct sk_buff *skb,
 		}
 
 		if (left && tc->t_tinc_data_rem) {
-			clone = skb_clone(skb, arg->gfp);
+			to_copy = min(tc->t_tinc_data_rem, left);
+
+			clone = pskb_extract(skb, offset, to_copy, arg->gfp);
 			if (!clone) {
 				desc->error = -ENOMEM;
 				goto out;
 			}
 
-			to_copy = min(tc->t_tinc_data_rem, left);
-			if (!pskb_pull(clone, offset) ||
-			    pskb_trim(clone, to_copy)) {
-				pr_warn("rds_tcp_data_recv: pull/trim failed "
-					"left %zu data_rem %zu skb_len %d\n",
-					left, tc->t_tinc_data_rem, skb->len);
-				kfree_skb(clone);
-				desc->error = -ENOMEM;
-				goto out;
-			}
 			skb_queue_tail(&tinc->ti_skb_list, clone);
 
 			rdsdebug("skb %p data %p len %d off %u to_copy %zu -> "
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next 0/2] pskb_extract() helper function.
  2016-04-20 10:17 [PATCH net-next 0/2] pskb_extract() helper function Sowmini Varadhan
  2016-04-20 10:17 ` [PATCH net-next 1/2] skbuff: Add " Sowmini Varadhan
  2016-04-20 10:17 ` [PATCH net-next 2/2] RDS: TCP: Call " Sowmini Varadhan
@ 2016-04-22 23:23 ` marcelo.leitner
  2016-04-22 23:41   ` Sowmini Varadhan
  2 siblings, 1 reply; 6+ messages in thread
From: marcelo.leitner @ 2016-04-22 23:23 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: netdev, rds-devel, santosh.shilimkar, davem, eric.dumazet

On Wed, Apr 20, 2016 at 03:17:40AM -0700, Sowmini Varadhan wrote:
> This patchset follows up on the discussion in
>  https://www.mail-archive.com/netdev@vger.kernel.org/msg105090.html
> 
> For RDS-TCP, we have to deal with the full gamut of
> nonlinear sk_buffs, including all the frag_list variants.
> Also, the parent skb has to remain unchanged, while the clone
> is queued for Rx on the PF_RDS socket. 
> 
> Patch 1 of this patchset adds a pskb_extract() function that 
> does all this without the redundant memcpy's in pskb_expand_head() 
> and __pskb_pull_tail().

I applied this patchset and updated SCTP to also use it for data chunks.

My tests results were very similar to what I had without it. Varying to
better or worse, tending worse.  Thing is, SCTP always works on
linearized skbs as it can't crawl on fragments, so those clone/trim
operations are just offset adjusts regarding the data, and it's shared.
With pskb_extract, it implies in a new memory allocation and a copy,
even in this best case, so for SCTP, for now, it's actually a drawback
I'm afraid.

  Marcelo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next 1/2] skbuff: Add pskb_extract() helper function
  2016-04-20 10:17 ` [PATCH net-next 1/2] skbuff: Add " Sowmini Varadhan
@ 2016-04-22 23:27   ` marcelo.leitner
  0 siblings, 0 replies; 6+ messages in thread
From: marcelo.leitner @ 2016-04-22 23:27 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: netdev, rds-devel, santosh.shilimkar, davem, eric.dumazet

On Wed, Apr 20, 2016 at 03:17:41AM -0700, Sowmini Varadhan wrote:
...
> +/* Extract to_copy bytes starting at off from skb, and return this in
> + * a new skb
> + */
> +struct sk_buff *pskb_extract(struct sk_buff *skb, int off,
> +			     int to_copy, gfp_t gfp)
> +{
> +	struct sk_buff  *clone = skb_clone(skb, gfp);
> +
> +	if (!clone)
> +		return NULL;
> +
> +	if (pskb_carve(clone, off, gfp) < 0) {
> +		pr_warn("pskb_carve failed\n");

You most likely don't want these pr_warn

> +		kfree_skb(clone);
> +		return NULL;
> +	}
> +
> +	if (pskb_trim(clone, to_copy)) {
> +		pr_warn("pskb_trim failed\n");
> +		kfree_skb(clone);
> +		return NULL;
> +	}
> +	return clone;

Then these two blocks can be just:

	if (pskb_carve(clone, off, gfp) < 0 ||
	    pskb_trim(clone, to_copy)) {
		kfree_skb(clone);
		clone = NULL;
	}
	return clone;

  Marcelo

> +}
> +EXPORT_SYMBOL(pskb_extract);
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next 0/2] pskb_extract() helper function.
  2016-04-22 23:23 ` [PATCH net-next 0/2] " marcelo.leitner
@ 2016-04-22 23:41   ` Sowmini Varadhan
  0 siblings, 0 replies; 6+ messages in thread
From: Sowmini Varadhan @ 2016-04-22 23:41 UTC (permalink / raw)
  To: marcelo.leitner; +Cc: netdev, rds-devel, santosh.shilimkar, davem, eric.dumazet

On (04/22/16 20:23), marcelo.leitner@gmail.com wrote:
> My tests results were very similar to what I had without it. Varying to
> better or worse, tending worse.  Thing is, SCTP always works on
> linearized skbs as it can't crawl on fragments, so those clone/trim

sorry to hear that. For RDS-TCP, the rx side does show a noticeable
benefit with rds-stress. To an extent, this is also impacted
by the packet size, and the type of test (for our DB workloads,
we use request-response tests, and the packet size tends to
typically be 8K req, 256 byt responses), so I guess ymmv. 

--Sowmini

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-04-22 23:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-20 10:17 [PATCH net-next 0/2] pskb_extract() helper function Sowmini Varadhan
2016-04-20 10:17 ` [PATCH net-next 1/2] skbuff: Add " Sowmini Varadhan
2016-04-22 23:27   ` marcelo.leitner
2016-04-20 10:17 ` [PATCH net-next 2/2] RDS: TCP: Call " Sowmini Varadhan
2016-04-22 23:23 ` [PATCH net-next 0/2] " marcelo.leitner
2016-04-22 23:41   ` Sowmini Varadhan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.