Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
@ 2011-10-27 19:53 Neil Horman
  2011-10-27 19:53 ` [RFC PATCH 1/5] net: add SKB_FCLONE_SCRATCH API Neil Horman
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Neil Horman @ 2011-10-27 19:53 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, David S. Miller

I had this idea awhile ago while I was looking at the receive path for multicast
frames.   The top of the mcast recieve path (in __udp4_lib_mcast_deliver, has a
loop in which we traverse a hash list linearly, looking for sockets that are
listening to a given multicast group.  For each matching socket we clone the skb
to enqueue it to the corresponding socket.  This creates two problems:

1) Application driven jitter in the receive path
   As you add processes that listen to the same multcast group, you increase the
number of iterations you have to preform in this loop, which can lead to
increases in the amount of time you spend processing each frame in softirq
context, expecially if you are memory constrained, and the skb_clone operation
has to call all the way back into the buddy allocator for more ram.  This can
lead to needlessly dropped frames as rx latency increases in the stack.

2) Increased memory usage
   As you increase the number of listeners to a multicast group, you directly
increase the number of times you clone and skb, putting increased memory
pressure on the system.

while neither of these problems is a huge concern, I thought it would be nice if
we could mitigate the effects of increased application instances on performance
in this area.  As such I came up with this patch set.  I created a new skb
fclone type called FCLONE_SCRATCH.  When available, it commandeers the
internally fragmented space of an skb data buffer and uses that to allocate
additional skbs during the clone operation. Since the skb->data area is
allocated with a kmalloc operation (and is therefore nominally a power of 2 in
size), and nominally network interfaces tend to have an mtu of around 1500
bytes, we typically can reclaim several hundred bytes of space at the end of an
skb (more if the incomming packet is not a full MTU in size).  This space, being
exclusively accessible to the softirq doing the reclaim, can be quickly accesed
without the need for additional locking, potntially providing lower jitter in
napi context per frame during a receive operation, as well as some memory
savings.

I'm still collecting stats on its performance, but I thought I would post now to
get some early reviews and feedback on it.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/5] net: add SKB_FCLONE_SCRATCH API
  2011-10-27 19:53 Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Neil Horman
@ 2011-10-27 19:53 ` Neil Horman
  2011-10-27 19:53 ` [RFC PATCH 2/5] net: add FCLONE_SCRATCH use to ipv4 udp path Neil Horman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Neil Horman @ 2011-10-27 19:53 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, Neil Horman

From: Neil Horman <nhorman@updev.think-freely.org>

The FCLONE api for skb allocation is nice in that it allows for the
pre-allocation of skbs when you know you will need additional clones.  A nice
addition to this api would be the ability to quickly allocate extra skbs when
needed without having to call into the slab allocator.  This API provides that
ability.  By using the internally fragmented space between the tail and end
pointer, and after the skb_shinfo space, we can opportunistically format this
space for use as extra sk_buff structures.  This allows for both fast
allocations in cases where skbs need to be cloned quickly (like in a multiple
multicast listener workload), and it does so without needing to allocate further
memory from the system, reducing overall memory demand.

There are rules when using this api however:

1) skbs that have their data reserved via this api become fixed, i.e. they can
no longer call skb_pull, or pskb_expand_tail

2) only a single skb can reserve the space.  The api assumes that the skb that
reserves the space is the owner, and only that skbs owning context will allocate
out of the shared area

Tested successfully by myself
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 include/linux/skbuff.h |   51 +++++++++++++++++++++++++++++-
 net/core/skbuff.c      |   82 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 129 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6a6b352..e04fa48 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -258,7 +258,7 @@ struct skb_shared_info {
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
 
-/* We divide dataref into two halves.  The higher 16 bits hold references
+/* We divide dataref two halves.  The higher 15 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of
  * the header in skb->hdr_len.
@@ -277,6 +277,7 @@ enum {
 	SKB_FCLONE_UNAVAILABLE,
 	SKB_FCLONE_ORIG,
 	SKB_FCLONE_CLONE,
+	SKB_FCLONE_SCRATCH,
 };
 
 enum {
@@ -2512,5 +2513,53 @@ static inline bool skb_is_recycleable(const struct sk_buff *skb, int skb_size)
 
 	return true;
 }
+
+struct skb_scr_control {
+	struct sk_buff_head scr_skbs;
+	struct sk_buff *owner;
+};
+
+/*
+ * gets our control data for the scratch area
+ */
+static inline struct skb_scr_control*
+	skb_get_scratch_control(struct sk_buff *skb)
+{
+	struct skb_scr_control *sctl;
+	sctl = (struct skb_scr_control *)((void *)skb_shinfo(skb) +
+					  sizeof(struct skb_shared_info));
+	return sctl;
+}
+
+/*
+ * Converts the scratch space of an skbs data area to a list of
+ * skbuffs.  Returns the number of additional skbs allocated
+ */
+extern unsigned int skb_make_fclone_scratch(struct sk_buff *skb);
+
+/*
+ * Allocates an skb out of our scratch space
+ */
+static inline struct sk_buff *alloc_fscratch_skb(struct sk_buff *skb)
+{
+	struct skb_scr_control *sctl = skb_get_scratch_control(skb);
+	struct sk_buff *sskb;
+
+	BUG_ON(skb->fclone != SKB_FCLONE_SCRATCH);
+	BUG_ON(!sctl);
+	BUG_ON(sctl->owner != skb);
+	if (skb_queue_empty(&sctl->scr_skbs))
+		return NULL;
+
+	sskb = __skb_dequeue(&sctl->scr_skbs);
+
+	/*
+	 * Mark us as a scratch skb, so we get properly kfree-ed
+	 */
+	sskb->fclone = SKB_FCLONE_SCRATCH;
+
+	return sskb;
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SKBUFF_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ca4db40..6fdf1a7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -367,6 +367,7 @@ static void kfree_skbmem(struct sk_buff *skb)
 	atomic_t *fclone_ref;
 
 	switch (skb->fclone) {
+	case SKB_FCLONE_SCRATCH:
 	case SKB_FCLONE_UNAVAILABLE:
 		kmem_cache_free(skbuff_head_cache, skb);
 		break;
@@ -438,8 +439,16 @@ static void skb_release_all(struct sk_buff *skb)
 
 void __kfree_skb(struct sk_buff *skb)
 {
+	struct skb_scr_control *sctl;
+	bool need_free = (skb->fclone == SKB_FCLONE_SCRATCH);
+	if (need_free) {
+		sctl = skb_get_scratch_control(skb);
+		need_free = (sctl->owner == skb);
+	}
+
 	skb_release_all(skb);
-	kfree_skbmem(skb);
+	if (need_free)
+		kfree_skbmem(skb);
 }
 EXPORT_SYMBOL(__kfree_skb);
 
@@ -701,6 +710,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 {
 	struct sk_buff *n;
+	atomic_t *fclone_ref;
 
 	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
 		if (skb_copy_ubufs(skb, gfp_mask))
@@ -710,10 +720,15 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n = skb + 1;
 	if (skb->fclone == SKB_FCLONE_ORIG &&
 	    n->fclone == SKB_FCLONE_UNAVAILABLE) {
-		atomic_t *fclone_ref = (atomic_t *) (n + 1);
+		fclone_ref = (atomic_t *) (n + 1);
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
-	} else {
+	} else if (skb->fclone == SKB_FCLONE_SCRATCH)
+		n = alloc_fscratch_skb(skb);
+	else
+		n = NULL;
+
+	if (!n) {
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -3205,3 +3220,64 @@ void __skb_warn_lro_forwarding(const struct sk_buff *skb)
 			   " while LRO is enabled\n", skb->dev->name);
 }
 EXPORT_SYMBOL(__skb_warn_lro_forwarding);
+
+unsigned int skb_make_fclone_scratch(struct sk_buff *skb)
+{
+	size_t bufsz, totsz, scrsz, tmpsz;
+	struct skb_scr_control *sctl;
+	struct sk_buff *scr_skb;
+	struct skb_shared_info *old_info;
+	bool format_tail = false;
+
+	if (skb_shared(skb))
+		return 0;
+
+	/*
+	 * Cant do scratch space on fcloned skbs
+	 */
+	if (skb->fclone)
+		return 0;
+
+	if ((skb->end - skb->tail) > sizeof(struct skb_shared_info)) {
+		old_info = skb_shinfo(skb);
+		skb->end = skb->tail;
+		memcpy(skb_shinfo(skb), old_info,
+		       sizeof(struct skb_shared_info));
+	}
+
+	/*
+	 * skb is ours, lets see how big the data area is
+	 */
+	totsz = ksize(skb->head);
+
+	/*
+	 * This is the used size of our data buffer
+	 */
+	bufsz = (skb_end_pointer(skb) - skb->head) +
+		 sizeof(struct skb_shared_info);
+
+	if ((bufsz + sizeof(struct skb_scr_control)) >= totsz)
+		return 0;
+
+	/*
+	 * And this is the leftover area, minus sizeof(int) to store the number
+	 * of scratch skbs we have
+	 */
+	scrsz = totsz - (bufsz + sizeof(struct skb_scr_control));
+
+	sctl = skb_get_scratch_control(skb);
+
+	sctl->owner = skb;
+	scr_skb = (struct sk_buff *)(sctl + 1);
+	__skb_queue_head_init(&sctl->scr_skbs);
+	for (tmpsz = sizeof(struct sk_buff); tmpsz < scrsz;
+	    tmpsz += sizeof(struct sk_buff)) {
+		__skb_queue_tail(&sctl->scr_skbs, scr_skb);
+		scr_skb++;
+	}
+
+	skb->fclone = SKB_FCLONE_SCRATCH;
+
+	return skb_queue_len(&sctl->scr_skbs);
+
+}
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/5] net: add FCLONE_SCRATCH use to ipv4 udp path
  2011-10-27 19:53 Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Neil Horman
  2011-10-27 19:53 ` [RFC PATCH 1/5] net: add SKB_FCLONE_SCRATCH API Neil Horman
@ 2011-10-27 19:53 ` Neil Horman
  2011-10-27 19:53 ` [RFC PATCH 3/5] net: Add & modify tracepoints to skb FCLONE_SCRATCH paths Neil Horman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Neil Horman @ 2011-10-27 19:53 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, Neil Horman, David S. Miller

From: Neil Horman <nhorman@updev.think-freely.org>

UDP v4 multicast in a multiple group listener workload is a heavy user of
skb_clone, which can lead to performance problems if memory is constrained and
frame reception rates are high.  Once in the udp path however, we can reserve
the unused storage of the udp datagram and allocate skbs from it quickly.
Modify the udp multicast receive path to reserve storage using the
FCLONE_SCRATCH api to make opportunistic use of this space.

Tested successfully by myself
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 net/ipv4/udp.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ebaa96b..e1c0b1e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1489,7 +1489,6 @@ drop:
 	return -1;
 }
 
-
 static void flush_stack(struct sock **stack, unsigned int count,
 			struct sk_buff *skb, unsigned int final)
 {
@@ -1532,6 +1531,8 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	int dif;
 	unsigned int i, count = 0;
 
+	skb_make_fclone_scratch(skb);
+
 	spin_lock(&hslot->lock);
 	sk = sk_nulls_head(&hslot->head);
 	dif = skb->dev->ifindex;
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/5] net: Add & modify tracepoints to skb FCLONE_SCRATCH paths
  2011-10-27 19:53 Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Neil Horman
  2011-10-27 19:53 ` [RFC PATCH 1/5] net: add SKB_FCLONE_SCRATCH API Neil Horman
  2011-10-27 19:53 ` [RFC PATCH 2/5] net: add FCLONE_SCRATCH use to ipv4 udp path Neil Horman
@ 2011-10-27 19:53 ` Neil Horman
  2011-10-27 19:53 ` [RFC PATCH 4/5] perf: add perf script to monitor efficiency increase in FCLONE_SCRATCH api Neil Horman
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Neil Horman @ 2011-10-27 19:53 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, Neil Horman, David S. Miller

From: Neil Horman <nhorman@updev.think-freely.org>

Since skbs that are fcloned via the FCLONE_SCRATCH method are opportunistic, it
would be nice to have some feedback on how efficiently the path is acting.
These tracepoints provide the infrastructure needed to create perf scripts that
let us measure things like the number of skbs that are converted for
FCLONE_SCRATCH use, the number of scratch skbs that actually get
allocated, and the average per-packet time spent in the napi softirq context.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 include/linux/skbuff.h      |   21 +--------------------
 include/trace/events/napi.h |   41 +++++++++++++++++++++++++++++++++++++++++
 include/trace/events/skb.h  |   41 +++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c              |    2 ++
 net/core/skbuff.c           |   37 +++++++++++++++++++++++++++++++++----
 5 files changed, 118 insertions(+), 24 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e04fa48..77e3605 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2540,26 +2540,7 @@ extern unsigned int skb_make_fclone_scratch(struct sk_buff *skb);
 /*
  * Allocates an skb out of our scratch space
  */
-static inline struct sk_buff *alloc_fscratch_skb(struct sk_buff *skb)
-{
-	struct skb_scr_control *sctl = skb_get_scratch_control(skb);
-	struct sk_buff *sskb;
-
-	BUG_ON(skb->fclone != SKB_FCLONE_SCRATCH);
-	BUG_ON(!sctl);
-	BUG_ON(sctl->owner != skb);
-	if (skb_queue_empty(&sctl->scr_skbs))
-		return NULL;
-
-	sskb = __skb_dequeue(&sctl->scr_skbs);
-
-	/*
-	 * Mark us as a scratch skb, so we get properly kfree-ed
-	 */
-	sskb->fclone = SKB_FCLONE_SCRATCH;
-
-	return sskb;
-}
+extern struct sk_buff *alloc_fscratch_skb(struct sk_buff *skb);
 
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SKBUFF_H */
diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
index 8fe1e93..5af5675 100644
--- a/include/trace/events/napi.h
+++ b/include/trace/events/napi.h
@@ -7,6 +7,7 @@
 #include <linux/netdevice.h>
 #include <linux/tracepoint.h>
 #include <linux/ftrace.h>
+#include <linux/kernel_stat.h>
 
 #define NO_DEV "(no_device)"
 
@@ -30,6 +31,46 @@ TRACE_EVENT(napi_poll,
 		__entry->napi, __get_str(dev_name))
 );
 
+TRACE_EVENT(napi_schedule,
+
+	TP_PROTO(struct napi_struct *napi),
+
+	TP_ARGS(napi),
+
+	TP_STRUCT__entry(
+		__field(	struct napi_struct *,	napi)
+		__string(	dev_name, napi->dev ? napi->dev->name : NO_DEV)
+	),
+
+	TP_fast_assign(
+		__entry->napi = napi;
+		__assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
+	),
+
+	TP_printk("napi schedule on napi struct %p for device %s",
+		__entry->napi, __get_str(dev_name))
+);
+
+TRACE_EVENT(napi_complete,
+
+	TP_PROTO(struct napi_struct *napi),
+
+	TP_ARGS(napi),
+
+	TP_STRUCT__entry(
+		__field(	struct napi_struct *,	napi)
+		__string(	dev_name, napi->dev ? napi->dev->name : NO_DEV)
+	),
+
+	TP_fast_assign(
+		__entry->napi = napi;
+		__assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
+	),
+
+	TP_printk("napi complete on napi struct %p for device %s",
+		__entry->napi, __get_str(dev_name))
+);
+
 #undef NO_DEV
 
 #endif /* _TRACE_NAPI_H_ */
diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 0c68ae22..3b83438 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -69,6 +69,47 @@ TRACE_EVENT(skb_copy_datagram_iovec,
 	TP_printk("skbaddr=%p len=%d", __entry->skbaddr, __entry->len)
 );
 
+TRACE_EVENT(skb_make_fclone_scratch,
+
+	TP_PROTO(struct sk_buff *skb, int count),
+
+	TP_ARGS(skb, count),
+
+	TP_STRUCT__entry(
+		__field(	const struct sk_buff *,	skb)
+		__string(	name, skb->dev ? skb->dev->name : "unknown")
+		__field(	int,			fccount)
+	),
+
+	TP_fast_assign(
+		__entry->skb = skb;
+		__assign_str(name, skb->dev ? skb->dev->name : "unknown");
+		__entry->fccount = count;
+	),
+
+	TP_printk("skb= %p, dev=%s, count=%d", __entry->skb,
+		  __get_str(name), __entry->fccount)
+);
+
+TRACE_EVENT(alloc_fscratch_skb,
+
+	TP_PROTO(const struct sk_buff *parent, const struct sk_buff *child),
+
+	TP_ARGS(parent, child),
+
+	TP_STRUCT__entry(
+		__field(	const struct sk_buff *, parent)
+		__field(	const struct sk_buff *, child)
+	),
+
+	TP_fast_assign(
+		__entry->parent = parent;
+		__entry->child = child;
+	),
+
+	TP_printk("parent=%p, child=%p", __entry->parent, __entry->child)
+);
+
 #endif /* _TRACE_SKB_H */
 
 /* This part must be outside protection */
diff --git a/net/core/dev.c b/net/core/dev.c
index b7ba81a..1af4c02 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2579,6 +2579,7 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 static inline void ____napi_schedule(struct softnet_data *sd,
 				     struct napi_struct *napi)
 {
+	trace_napi_schedule(napi);
 	list_add_tail(&napi->poll_list, &sd->poll_list);
 	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 }
@@ -3829,6 +3830,7 @@ void __napi_complete(struct napi_struct *n)
 	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
 	BUG_ON(n->gro_list);
 
+	trace_napi_complete(n);
 	list_del(&n->poll_list);
 	smp_mb__before_clear_bit();
 	clear_bit(NAPI_STATE_SCHED, &n->state);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6fdf1a7..0347446 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3221,6 +3221,30 @@ void __skb_warn_lro_forwarding(const struct sk_buff *skb)
 }
 EXPORT_SYMBOL(__skb_warn_lro_forwarding);
 
+/*
+ * Allocates an skb out of our scratch space
+ */
+struct sk_buff *alloc_fscratch_skb(struct sk_buff *skb)
+{
+	struct skb_scr_control *sctl = skb_get_scratch_control(skb);
+	struct sk_buff *sskb = NULL;
+
+	BUG_ON(skb->fclone != SKB_FCLONE_SCRATCH);
+	BUG_ON(!sctl);
+	BUG_ON(sctl->owner != skb);
+
+	sskb = __skb_dequeue(&sctl->scr_skbs);
+
+	/*
+	 * Mark us as a scratch skb, so we get properly kfree-ed
+	 */
+	sskb->fclone = SKB_FCLONE_SCRATCH;
+
+	trace_alloc_fscratch_skb(skb, sskb);
+	return sskb;
+}
+EXPORT_SYMBOL(alloc_fscratch_skb);
+
 unsigned int skb_make_fclone_scratch(struct sk_buff *skb)
 {
 	size_t bufsz, totsz, scrsz, tmpsz;
@@ -3228,15 +3252,16 @@ unsigned int skb_make_fclone_scratch(struct sk_buff *skb)
 	struct sk_buff *scr_skb;
 	struct skb_shared_info *old_info;
 	bool format_tail = false;
+	int fclone_count = 0;
 
 	if (skb_shared(skb))
-		return 0;
+		goto out;
 
 	/*
 	 * Cant do scratch space on fcloned skbs
 	 */
 	if (skb->fclone)
-		return 0;
+		goto out;
 
 	if ((skb->end - skb->tail) > sizeof(struct skb_shared_info)) {
 		old_info = skb_shinfo(skb);
@@ -3257,7 +3282,7 @@ unsigned int skb_make_fclone_scratch(struct sk_buff *skb)
 		 sizeof(struct skb_shared_info);
 
 	if ((bufsz + sizeof(struct skb_scr_control)) >= totsz)
-		return 0;
+		goto out;
 
 	/*
 	 * And this is the leftover area, minus sizeof(int) to store the number
@@ -3278,6 +3303,10 @@ unsigned int skb_make_fclone_scratch(struct sk_buff *skb)
 
 	skb->fclone = SKB_FCLONE_SCRATCH;
 
-	return skb_queue_len(&sctl->scr_skbs);
+	fclone_count = skb_queue_len(&sctl->scr_skbs);
+out:
+	trace_skb_make_fclone_scratch(skb, fclone_count);
+	return fclone_count;
 
 }
+EXPORT_SYMBOL(skb_make_fclone_scratch);
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/5] perf: add perf script to monitor efficiency increase in FCLONE_SCRATCH api
  2011-10-27 19:53 Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Neil Horman
                   ` (2 preceding siblings ...)
  2011-10-27 19:53 ` [RFC PATCH 3/5] net: Add & modify tracepoints to skb FCLONE_SCRATCH paths Neil Horman
@ 2011-10-27 19:53 ` Neil Horman
  2011-10-27 19:53 ` [RFC PATCH 5/5] net: add FCLONE_SCRATCH use to ipv6 udp path Neil Horman
  2011-10-27 22:55 ` Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Eric Dumazet
  5 siblings, 0 replies; 9+ messages in thread
From: Neil Horman @ 2011-10-27 19:53 UTC (permalink / raw)
  To: netdev; +Cc: root, Neil Horman, David S. Miller

From: root <root@updev.think-freely.org>

Since the FCLONE_SCRATCH mehanism is opportunistic, gathering internally
fragmented memory when available, its beneficial to know how efficiently its
working, so that tuning can be implemented to optimize it.  This patch adds a
perf script to export data collected via the previously added tracepoints.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 .../scripts/python/bin/net-fscratch-stats-record   |    4 +
 .../scripts/python/bin/net-fscratch-stats-report   |    4 +
 tools/perf/scripts/python/net-fscratch.py          |  198 ++++++++++++++++++++
 3 files changed, 206 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/scripts/python/bin/net-fscratch-stats-record
 create mode 100644 tools/perf/scripts/python/bin/net-fscratch-stats-report
 create mode 100644 tools/perf/scripts/python/net-fscratch.py

diff --git a/tools/perf/scripts/python/bin/net-fscratch-stats-record b/tools/perf/scripts/python/bin/net-fscratch-stats-record
new file mode 100644
index 0000000..7aae593
--- /dev/null
+++ b/tools/perf/scripts/python/bin/net-fscratch-stats-record
@@ -0,0 +1,4 @@
+#!/bin/bash
+perf record 	-e skb:skb_make_fclone_scratch -e skb:alloc_fscratch_skb \
+		-e napi:napi_schedule -e napi:napi_complete \
+		-e napi:napi_poll -e net:netif_receive_skb $@
diff --git a/tools/perf/scripts/python/bin/net-fscratch-stats-report b/tools/perf/scripts/python/bin/net-fscratch-stats-report
new file mode 100644
index 0000000..85bb867
--- /dev/null
+++ b/tools/perf/scripts/python/bin/net-fscratch-stats-report
@@ -0,0 +1,4 @@
+#!/bin/bash
+# description: display a process of packet and processing time
+
+perf script -s "$PERF_EXEC_PATH"/scripts/python/net-fscratch.py $@
diff --git a/tools/perf/scripts/python/net-fscratch.py b/tools/perf/scripts/python/net-fscratch.py
new file mode 100644
index 0000000..f9ae5c9
--- /dev/null
+++ b/tools/perf/scripts/python/net-fscratch.py
@@ -0,0 +1,198 @@
+# Display a process of packets and processed time.
+# It helps us to investigate networking or network device.
+#
+# options
+# tx: show only tx chart
+# rx: show only rx chart
+# dev=: show only thing related to specified device
+# debug: work with debug mode. It shows buffer status.
+
+import os
+import sys
+
+sys.path.append(os.environ['PERF_EXEC_PATH'] + \
+	'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
+
+from perf_trace_context import *
+from Core import *
+from Util import *
+
+parent_skbs = {}
+
+total_parents=0
+total_children_avail=0
+total_children_used=0
+total_orphans=0
+
+IDX_FC_COUNT=0
+IDX_FC_KIDS=1
+
+STATE_START_TIMING=0
+STATE_TIMING=1
+STATE_COLLECT_TIMING=2
+STATE_RESET=3
+
+cpu_cycle_stats = {}
+cpu_total_stats = {}
+
+class cpuCycleStats():
+	def __init__(self):
+		self.start_rx_time = 0
+		self.end_rx_time = 0
+		self.state = STATE_RESET
+		self.total_rx_frames = 0
+
+class cpuTotalStats():
+	def __init__(self):
+		self.total_frames = 0
+		self.total_napi_time = 0
+		self.napi_sc_cycles = 0
+
+def gather_fclone_use_stats(stat):
+	global total_parents
+	global total_children_avail
+	global total_children_used
+
+	total_parents = total_parents+1
+	total_children_avail = total_children_avail + stat[IDX_FC_COUNT]
+	total_children_used = total_children_used + stat[IDX_FC_KIDS]
+
+# called from perf, when it finds a correspoinding event
+def skb__skb_make_fclone_scratch(event_name, context, common_cpu,
+ common_secs, common_nsecs, common_pid, common_comm,
+ skb, name, fccount):
+	global parent_skbs
+
+	if (skb in parent_skbs.keys()):
+		gather_fclone_use_stats(parent_skbs[skb])
+		parent_skbs[skb] = None
+
+	parent_skbs[skb] = [fccount, 0]
+
+def skb__alloc_fscratch_skb(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	parent, child):
+	global parent_skbs
+	global total_orphans
+
+	if (child == 0):
+		#We didn't have an fscratch_child to allocate
+		return
+
+	try:
+		parent_skbs[parent][IDX_FC_KIDS] += 1
+	except:
+		total_orphans += 1
+
+def napi__napi_schedule(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	napi, dev_name):
+	global cpu_cycle_stats
+
+	if (common_cpu in cpu_cycle_stats.keys()):
+		return;
+
+	cpu_cycle_stats[common_cpu] = cpuCycleStats()
+	cpu_cycle_stats[common_cpu].state = STATE_START_TIMING
+	return
+
+def napi__napi_complete(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	napi, dev_name):
+	global cpu_cycle_stats
+	global cpu_total_stats
+
+
+	if (common_cpu not in cpu_cycle_stats.keys()):
+		return
+
+	if (cpu_cycle_stats[common_cpu].state == STATE_TIMING):
+		cpu_cycle_stats[common_cpu].state = STATE_COLLECT_TIMING
+
+
+def napi__napi_poll(event_name, context, common_cpu,
+        common_secs, common_nsecs, common_pid, common_comm,
+        napi, dev_name):
+	global cpu_cycle_stats
+	global cpu_total_stats
+
+	if (common_cpu not in cpu_cycle_stats.keys()):
+		return
+
+
+	if (common_cpu not in cpu_total_stats.keys()):
+		cpu_total_stats[common_cpu] = cpuTotalStats()
+
+	state = cpu_cycle_stats[common_cpu].state
+
+	if (state == STATE_COLLECT_TIMING):
+		cpu_total_stats[common_cpu].napi_sc_cycles += 1
+
+	if (cpu_cycle_stats[common_cpu].end_rx_time == cpu_cycle_stats[common_cpu].start_rx_time):
+		cpu_cycle_stats[common_cpu].end_rx_time = common_nsecs
+
+	if ((state == STATE_COLLECT_TIMING) or (state == STATE_TIMING)):
+		if (cpu_cycle_stats[common_cpu].end_rx_time > cpu_cycle_stats[common_cpu].start_rx_time):
+			napi_time = cpu_cycle_stats[common_cpu].end_rx_time - cpu_cycle_stats[common_cpu].start_rx_time
+		else:
+			napi_time = cpu_cycle_stats[common_cpu].start_rx_time - cpu_cycle_stats[common_cpu].end_rx_time
+
+		if (napi_time == 0):
+			cpu_cycle_stats[common_cpu].total_rx_frames = 0
+
+		cpu_total_stats[common_cpu].total_frames += cpu_cycle_stats[common_cpu].total_rx_frames
+		cpu_total_stats[common_cpu].total_napi_time += napi_time
+		cpu_cycle_stats[common_cpu] = cpuCycleStats()
+		cpu_cycle_stats[common_cpu].state = STATE_START_TIMING
+
+
+def net__netif_receive_skb(event_name, context, common_cpu,
+        common_secs, common_nsecs, common_pid, common_comm,
+        skbaddr, len, name):
+	global cpu_cycle_stats
+
+	if (common_cpu not in cpu_cycle_stats.keys()):
+		return
+
+	if (cpu_cycle_stats[common_cpu].state == STATE_START_TIMING):
+		cpu_cycle_stats[common_cpu].state = STATE_TIMING
+		cpu_cycle_stats[common_cpu].start_rx_time = common_nsecs
+
+
+	if (cpu_cycle_stats[common_cpu].state == STATE_TIMING):
+		cpu_cycle_stats[common_cpu].total_rx_frames += 1
+		cpu_cycle_stats[common_cpu].end_rx_time = common_nsecs
+
+
+def trace_end():
+	global parent_skbs
+	global total_parents
+        global total_children_avail
+        global total_children_used
+	global total_orphans
+	global cpu_total_stats
+
+	for i in parent_skbs.keys():
+		gather_fclone_use_stats(parent_skbs[i])
+	try:
+		avg_offer_skb = str(total_children_avail / total_parents)
+		avg_used_skb = str(total_children_used / total_parents)
+	except:
+		avg_offer_skb = str(0)
+		avg_used_skb = str(0)
+
+	print "Performance report:"
+	print "Skbs marked as having scratch space available: " + str(total_parents)
+	print "Total fclone_scratch skb children available: " + str(total_children_avail)
+	print "Total fclone_scratch skb children used: " + str(total_children_used)
+	print "Total orphans: " + str(total_orphans)
+	print "Average number of scratch skbs available: " + avg_offer_skb
+	print "Average number of scratch skbs used: " + avg_used_skb
+	for i in cpu_total_stats.keys():
+		tframe = cpu_total_stats[i].total_frames
+		ttime = cpu_total_stats[i].total_napi_time
+		try:
+			print "CPU " + str(i) + " avg napi latency " + str(ttime/tframe) + " nsec/frame (" + str(ttime) + " " + str(tframe) + ")"
+		except:
+			print "CPU " + str(i) + " avg napi latency 0 usec/frame (" + str(ttime) + " " + str(tframe) + ")"
+		print "CPU " + str(i) + " napi sched/complete cycles: " + str(cpu_total_stats[i].napi_sc_cycles)
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 5/5] net: add FCLONE_SCRATCH use to ipv6 udp path
  2011-10-27 19:53 Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Neil Horman
                   ` (3 preceding siblings ...)
  2011-10-27 19:53 ` [RFC PATCH 4/5] perf: add perf script to monitor efficiency increase in FCLONE_SCRATCH api Neil Horman
@ 2011-10-27 19:53 ` Neil Horman
  2011-10-27 22:55 ` Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Eric Dumazet
  5 siblings, 0 replies; 9+ messages in thread
From: Neil Horman @ 2011-10-27 19:53 UTC (permalink / raw)
  To: netdev; +Cc: root, Neil Horman, David S. Miller

From: root <root@amd-dinar-01.lab.bos.redhat.com>

Like the ipv4 path, the ipv6 path can benefit from this change by taking
advantage of the unused space at the tail of an skbuffs data area.  Mark ipv6
udp multicast frames as being elligible for scratch fcloning.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 net/ipv6/udp.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index f4ca0a5..dda6661 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -645,6 +645,8 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	int dif;
 	unsigned int i, count = 0;
 
+	skb_make_fclone_scratch(skb);
+
 	spin_lock(&hslot->lock);
 	sk = sk_nulls_head(&hslot->head);
 	dif = inet6_iif(skb);
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
  2011-10-27 19:53 Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Neil Horman
                   ` (4 preceding siblings ...)
  2011-10-27 19:53 ` [RFC PATCH 5/5] net: add FCLONE_SCRATCH use to ipv6 udp path Neil Horman
@ 2011-10-27 22:55 ` Eric Dumazet
  2011-10-28  1:37   ` Neil Horman
  5 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2011-10-27 22:55 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, David S. Miller

Le jeudi 27 octobre 2011 à 15:53 -0400, Neil Horman a écrit :
> I had this idea awhile ago while I was looking at the receive path for multicast
> frames.   The top of the mcast recieve path (in __udp4_lib_mcast_deliver, has a
> loop in which we traverse a hash list linearly, looking for sockets that are
> listening to a given multicast group.  For each matching socket we clone the skb
> to enqueue it to the corresponding socket.  This creates two problems:
> 
> 1) Application driven jitter in the receive path
>    As you add processes that listen to the same multcast group, you increase the
> number of iterations you have to preform in this loop, which can lead to
> increases in the amount of time you spend processing each frame in softirq
> context, expecially if you are memory constrained, and the skb_clone operation
> has to call all the way back into the buddy allocator for more ram.  This can
> lead to needlessly dropped frames as rx latency increases in the stack.
> 

Hmm... time to perform this loop not depends on memory constraints,
since GFP_ATOMIC allocations are done. It succeed or not, immediately.

Time is consumed on the copy of the skb head, and refcnt
increases/decreases on datarefcnt. Your patch doesnt avoid this.

When application calls recvmsg() we then perform the two atomics on skb
refcnt and data refcnt and free them, with cache line false sharing...

> 2) Increased memory usage
>    As you increase the number of listeners to a multicast group, you directly
> increase the number of times you clone and skb, putting increased memory
> pressure on the system.
> 

One skb_head is about 256 bytes (232 bytes on 64bit arches)

> while neither of these problems is a huge concern, I thought it would be nice if
> we could mitigate the effects of increased application instances on performance
> in this area.  As such I came up with this patch set.  I created a new skb
> fclone type called FCLONE_SCRATCH.  When available, it commandeers the
> internally fragmented space of an skb data buffer and uses that to allocate
> additional skbs during the clone operation. Since the skb->data area is
> allocated with a kmalloc operation (and is therefore nominally a power of 2 in
> size), and nominally network interfaces tend to have an mtu of around 1500
> bytes, we typically can reclaim several hundred bytes of space at the end of an
> skb (more if the incomming packet is not a full MTU in size).  This space, being
> exclusively accessible to the softirq doing the reclaim, can be quickly accesed
> without the need for additional locking, potntially providing lower jitter in
> napi context per frame during a receive operation, as well as some memory
> savings.
> 
> I'm still collecting stats on its performance, but I thought I would post now to
> get some early reviews and feedback on it.
> 

I really doubt you'll find a significative performance increase.

I do believe its a too complex : skb code is already a nightmare if you
ask me.

And your hack/idea wont work quite well if you have 8 receivers for each
frame.

What about finding another way to queue one skb to N receive queue(s),
so that several multicast sockets can share same skb head ?

I always found sk_receive queue being very inefficient, since a queue or
dequeue must dirty a lot of cache lines.

This forces us to use a spinlock to protect queue/enqueue operations,
but also the socket lock (because of the MSG_PEEK stuff and
sk_rmem_alloc / sk_forward_alloc)

sk_receive_queue.lock is the real jitter source.

Ideally, we could have a fast path using a small circular array per
socket, of say 8 or 16 pointers to skbs, or allow application or
sysadmin to size this array.

A circular buffer can be handled without any lock, using atomic
operations (cmpxchg()) on load/unload indexes. The array of pointers is
written only by the softirq handler cpu, read by consumers.

Since this array is small [and finite size], and skb shared, we dont
call skb_set_owner_r() anymore, avoiding expensive atomic ops on
sk->sk_rmem_alloc.

UDP receive path could become lockless, allowing the softirq handler to
run without being slowed down by concurrent recvmsg()

At recvmsg() time, N-1 threads would only perform the skb->refcnt
decrement, and the last one would free the skb and data as well.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
  2011-10-27 22:55 ` Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Eric Dumazet
@ 2011-10-28  1:37   ` Neil Horman
  2011-10-28  2:37     ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Horman @ 2011-10-28  1:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller

On Fri, Oct 28, 2011 at 12:55:46AM +0200, Eric Dumazet wrote:
> Le jeudi 27 octobre 2011 à 15:53 -0400, Neil Horman a écrit :
> > I had this idea awhile ago while I was looking at the receive path for multicast
> > frames.   The top of the mcast recieve path (in __udp4_lib_mcast_deliver, has a
> > loop in which we traverse a hash list linearly, looking for sockets that are
> > listening to a given multicast group.  For each matching socket we clone the skb
> > to enqueue it to the corresponding socket.  This creates two problems:
> > 
> > 1) Application driven jitter in the receive path
> >    As you add processes that listen to the same multcast group, you increase the
> > number of iterations you have to preform in this loop, which can lead to
> > increases in the amount of time you spend processing each frame in softirq
> > context, expecially if you are memory constrained, and the skb_clone operation
> > has to call all the way back into the buddy allocator for more ram.  This can
> > lead to needlessly dropped frames as rx latency increases in the stack.
> > 
> 
> Hmm... time to perform this loop not depends on memory constraints,
> since GFP_ATOMIC allocations are done. It succeed or not, immediately.
> 
Thats not entirely correct.  The time it takes to allocate a new object from the
slab can be very much affected by memory contraints.  Systems under memory
pressue may need to shrink slab caches, allocate from remote nodes, etc, all of
which contributes to additional allocation time.  This time is multiplied by the
number of sockets listening to a particular group.  I'm not saying its a huge
change, but its a change that I saw the ability to address.   This change (I
hope) tries, when able, to eliminate the jitter from that path.

> Time is consumed on the copy of the skb head, and refcnt
> increases/decreases on datarefcnt. Your patch doesnt avoid this.
> 
No it doesn't, you're correct.  I was really looking at lower hanging fruit with
this change.

> When application calls recvmsg() we then perform the two atomics on skb
> refcnt and data refcnt and free them, with cache line false sharing...
> 
> > 2) Increased memory usage
> >    As you increase the number of listeners to a multicast group, you directly
> > increase the number of times you clone and skb, putting increased memory
> > pressure on the system.
> > 
> 
> One skb_head is about 256 bytes (232 bytes on 64bit arches)
> 
Yes, thats right.  Depending on the size of the received frame I can allocate
anywhere from 1 to 4 additional skbs using otherwise unused memory at the tail
of the data segment with this patch.

> > while neither of these problems is a huge concern, I thought it would be nice if
> > we could mitigate the effects of increased application instances on performance
> > in this area.  As such I came up with this patch set.  I created a new skb
> > fclone type called FCLONE_SCRATCH.  When available, it commandeers the
> > internally fragmented space of an skb data buffer and uses that to allocate
> > additional skbs during the clone operation. Since the skb->data area is
> > allocated with a kmalloc operation (and is therefore nominally a power of 2 in
> > size), and nominally network interfaces tend to have an mtu of around 1500
> > bytes, we typically can reclaim several hundred bytes of space at the end of an
> > skb (more if the incomming packet is not a full MTU in size).  This space, being
> > exclusively accessible to the softirq doing the reclaim, can be quickly accesed
> > without the need for additional locking, potntially providing lower jitter in
> > napi context per frame during a receive operation, as well as some memory
> > savings.
> > 
> > I'm still collecting stats on its performance, but I thought I would post now to
> > get some early reviews and feedback on it.
> > 
> 
> I really doubt you'll find a significative performance increase.
> 
With the perf script I included in this set, I'm currently reducing the napi rx
path runtime by about 2 microsecond per iteration when multiple processes are
listening to the same group, and we can allocate skbs from the data area of the
received skb.  I'm not sure how accurate (or valuable) that is, but thats what
I'm getting.  While I've not measured, theres also the fact that kfree_skb doesn't
have any work to do with FCLONE_SCRATCH skbs, since they all get returned when
the data area gets kfreed.  Lastly, theres the savings of 256 bytes per
listening app that uses an FCLONE_SCRATCH skb, since that memory is already
reserved, and effectively free.  I can't really put a number on how significant
that is, but I think its valuable.


> I do believe its a too complex : skb code is already a nightmare if you
> ask me.
> 
I'm not sure I agree with that.  Its certainly complex in the sense that it adds
additional state to be checked in the skb_clone and kfree_skb paths, but beyond
that, it doesn't require any semantic changes to any of the other parts of the
skb api(s).  Is there something specific about this change that you find
unacceptibly complex, or something I can do to make it simpler?

> And your hack/idea wont work quite well if you have 8 receivers for each
> frame.
> 
It will work fine, to whatever degree its able, and this it will fall back on
the standard skb_clone behavior.  The performance gains will be reduced, of
course, which I think is what you are referring to, but the memory advantages
still exit.  If you have 8 receivers, but a given skb has only enough free space
at the end of the data segment to allocate 4 additional skbs, we'll allocate
those 4, and do a real kmem_cache_alloc of 4 more in skb_clone.  The last 4 have
to use the slab_allocation path as we otherwise would, so we're no worse off
than before, but we still get the memory savings (which would be about 1k per
received frame for this mcast group, which I think is significant).

> What about finding another way to queue one skb to N receive queue(s),
> so that several multicast sockets can share same skb head ?
> 
Yes, I'd experimented with some of this, and had mixed results.  Specifically I
came up with a way to use the additional space as an array of list heads, with
backpointers to the socket_queue each list_head was owned by and the skb that
owned the entry.  It was nice because it let me enqueue the same skb to hundreds
of sockets at the same time, which was really nice.  It was bad however, because
it completely broke socket accounting (any likely anything else that required on
any part of the socket state).  Any thoughts on ways I might improve on that.
If I could make some sort of reduced sk_buff so that I didn't have to allocate
all 256 bytes of a full sk_buff, that would be great!

> I always found sk_receive queue being very inefficient, since a queue or
> dequeue must dirty a lot of cache lines.
> 
> This forces us to use a spinlock to protect queue/enqueue operations,
> but also the socket lock (because of the MSG_PEEK stuff and
> sk_rmem_alloc / sk_forward_alloc)
> 
> sk_receive_queue.lock is the real jitter source.
> 
> Ideally, we could have a fast path using a small circular array per
> socket, of say 8 or 16 pointers to skbs, or allow application or
> sysadmin to size this array.
> 
> A circular buffer can be handled without any lock, using atomic
> operations (cmpxchg()) on load/unload indexes. The array of pointers is
> written only by the softirq handler cpu, read by consumers.
> 
> Since this array is small [and finite size], and skb shared, we dont
> call skb_set_owner_r() anymore, avoiding expensive atomic ops on
> sk->sk_rmem_alloc.
> 
> UDP receive path could become lockless, allowing the softirq handler to
> run without being slowed down by concurrent recvmsg()
> 
> At recvmsg() time, N-1 threads would only perform the skb->refcnt
> decrement, and the last one would free the skb and data as well.
> 
This seems like a reasonable idea, and I'm happy to experiment with it, but I
still like the idea of using that often available space at the end of an skb,
just because I think the memory savings is attractive.  Shall I roll them into
the same patch series, or do them separately?
Neil

> 
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
  2011-10-28  1:37   ` Neil Horman
@ 2011-10-28  2:37     ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2011-10-28  2:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, David S. Miller

Le jeudi 27 octobre 2011 à 21:37 -0400, Neil Horman a écrit :
> > 
> Yes, I'd experimented with some of this, and had mixed results.  Specifically I
> came up with a way to use the additional space as an array of list heads, with
> backpointers to the socket_queue each list_head was owned by and the skb that
> owned the entry.  It was nice because it let me enqueue the same skb to hundreds
> of sockets at the same time, which was really nice.  It was bad however, because
> it completely broke socket accounting (any likely anything else that required on
> any part of the socket state).  Any thoughts on ways I might improve on that.
> If I could make some sort of reduced sk_buff so that I didn't have to allocate
> all 256 bytes of a full sk_buff, that would be great!

It seems your objectives are contradictory (memory saving and cpu
saving). Most of the time we have to fight false sharing first.

I dont want to try to use the available space from skb, since its cache
unfriendly, in case you have one CPU 100% in softirq handling,
dispatching messages to XX other application cpus.

You dont want each application cpu slowing down other cpus by
manipulating a list anchor, contained in a shared cache line, highly
contended. We already have one high contention point (skb refcount or
data refcount).

Another point of interest is that modern NIC tend to use paged frag
skbs, with very litle space available in skb head. frag part is
readonly, and sometime with no available space neither.

So your idea only works on some (old gen) nics and some narrow
conditions.

I believe it adds too much complex code for this.

I understand you are disappointed, but maybe you should have shared your
ideas before coding them !

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-10-28  2:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-27 19:53 Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Neil Horman
2011-10-27 19:53 ` [RFC PATCH 1/5] net: add SKB_FCLONE_SCRATCH API Neil Horman
2011-10-27 19:53 ` [RFC PATCH 2/5] net: add FCLONE_SCRATCH use to ipv4 udp path Neil Horman
2011-10-27 19:53 ` [RFC PATCH 3/5] net: Add & modify tracepoints to skb FCLONE_SCRATCH paths Neil Horman
2011-10-27 19:53 ` [RFC PATCH 4/5] perf: add perf script to monitor efficiency increase in FCLONE_SCRATCH api Neil Horman
2011-10-27 19:53 ` [RFC PATCH 5/5] net: add FCLONE_SCRATCH use to ipv6 udp path Neil Horman
2011-10-27 22:55 ` Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter Eric Dumazet
2011-10-28  1:37   ` Neil Horman
2011-10-28  2:37     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).