All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: netdev@vger.kernel.org, Christoph Lameter <cl@linux.com>,
	tom@herbertland.com, Alexander Duyck <alexander.duyck@gmail.com>,
	ogerlitz@mellanox.com, gerlitz.or@gmail.com, brouer@redhat.com
Subject: Re: [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context
Date: Wed, 3 Feb 2016 11:38:40 +0100	[thread overview]
Message-ID: <20160203113840.215521ed@redhat.com> (raw)
In-Reply-To: <20160203005249.GA3103@ast-mbp.thefacebook.com>

On Tue, 2 Feb 2016 16:52:50 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Feb 02, 2016 at 10:12:01PM +0100, Jesper Dangaard Brouer wrote:
> > Think twice before applying
> >  - This patch can potentially introduce added latency in some workloads
> > 
> > This patch introduce bulk alloc of SKBs and allow reuse of SKBs
> > free'ed in same softirq cycle.  SKBs are normally free'ed during TX
> > completion, but most high speed drivers also cleanup TX ring during
> > NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
> > SKBs will be avail in the napi_alloc_cache->skb_cache.
> > 
> > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
> > the potential overshooting unused SKBs needed to free'ed when NAPI
> > cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
> > 
> > Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
> > (GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))
> >  Allocator SLUB:
> >   Single CPU/flow numbers: before: 2064446 pps -> after: 2083031 pps
> >   Improvement: +18585 pps, -4.3 nanosec, +0.9%
> >  Allocator SLAB:
> >   Single CPU/flow numbers: before: 2035949 pps -> after: 2033567 pps
> >   Regression: -2382 pps, +0.57 nanosec, -0.1 %
> > 
> > Even-though benchmarking does show an improvement for SLUB(+0.9%), I'm
> > not convinced bulk alloc will be a win in all situations:
> >  * I see stalls on walking the SLUB freelist (normal hidden by prefetch)
> >  * In case RX queue is not full, alloc and free more SKBs than needed
> > 
> > More testing is needed with more real life benchmarks.
> > 
> > Joint work with Alexander Duyck.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>  
> ...
> > -	skb = __build_skb(data, len);
> > -	if (unlikely(!skb)) {
> > +#define BULK_ALLOC_SIZE 8
> > +	if (!nc->skb_count) {
> > +		nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
> > +						      gfp_mask, BULK_ALLOC_SIZE,
> > +						      nc->skb_cache);
> > +	}
> > +	if (likely(nc->skb_count)) {
> > +		skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
> > +	} else {
> > +		/* alloc bulk failed */
> >  		skb_free_frag(data);
> >  		return NULL;
> >  	}
> >  
> > +	len -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +
> > +	memset(skb, 0, offsetof(struct sk_buff, tail));
> > +	skb->truesize = SKB_TRUESIZE(len);
> > +	atomic_set(&skb->users, 1);
> > +	skb->head = data;
> > +	skb->data = data;
> > +	skb_reset_tail_pointer(skb);
> > +	skb->end = skb->tail + len;
> > +	skb->mac_header = (typeof(skb->mac_header))~0U;
> > +	skb->transport_header = (typeof(skb->transport_header))~0U;
> > +
> > +	/* make sure we initialize shinfo sequentially */
> > +	shinfo = skb_shinfo(skb);
> > +	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > +	atomic_set(&shinfo->dataref, 1);
> > +	kmemcheck_annotate_variable(shinfo->destructor_arg);  
> 
> copy-pasting from __build_skb()...
> Either new helper is needed or extend __build_skb() to take
> pre-allocated 'raw_skb' pointer.

Guess should have create a helper for the basic skb setup.  I just kept
it here, as I was planning to do trick like removing the memset, as
discussed below.

> Overall I like the first 3 patches. I think they're useful
> on their won.

Great, hope they can go into net-next soon.

> As far as bulk alloc... have you considered splitting
> bulk alloc of skb and init of skb?

I have many consideration on the SKB memset, as it shows up quite high
in profiles, how we can either 1) avoid clearing this much, or 2) speed
up clearing by clearing more (full pages in allocator).

My idea (2) is about clearing larger memory chunks, as the
memset/rep-stos operation have a fixed startup cost.  But do it inside
the slab allocator's bulk alloc call.  During bulk alloc we can
identify objects that are "next-to-each-other" and exec one rep-stos
operation.  My measurements show we need 3x 256-byte object's cleared
together before it is a win.

*BUT* I actually like your idea better, below, of delaying the init
... more comments below.

> Like in the above
> +	skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
> will give cold pointer and first memset() will be missing cache.
> Either prefetch is needed the way slab_alloc_node() is doing
> in the line prefetch_freepointer(s, next_object);

I could prefetch the next elem in skb_cache[]. I have been playing with
that.  It didn't have much effect, as the bulk alloc (at least for
SLUB) have already walked the memory.  It helped a little to prefetch
the 3rd cache-line, on the memset speed... (but so small that is was
difficult to measure with enough confidence).

> or buld_alloc_skb and bulk_init_skb need to be two loops
> driven by drivers.
> Another idea is we can move skb_init all the way up till
> eth_type_trans() and the driver should prefetch both
> skb->data and skb pointers. Then eth_type_trans_and_skb_init()
> helper will read from cache and store into cache.
>
> Rephrasing the idea:
> when the drivers do napi_alloc_skb() they don't really
> need initialized 'struct sk_buff'. They either need skb->data
> to copy headers into or shinfo->frags to add a page to,
> the full init can wait till eth_type_trans_and_init()
> right before napi_gro_receive().
> Thoughts?

I actually like this idea better than my own.  Of delaying memset init
of the SKB. (I didn't understood it, until after you rephrased it ;-))

The drivers also need some extra fields, like hash and some flags.  But
we could easily organize sk_buff, to account for this.

I also like it because, delaying it until napi_gro_receive, we might
have a chance to avoid the clearing, e.g. if GRO can merge it, or for
other fastpath forwarding, we have not invented yet.  (Thinking out
loud, maybe even RPS could delay the memset, until it reach the remote
CPU, saving may cache-line transfers between the CPUs).

I think I like yours delayed memset idea the best.  Lets see if others
shoot it down? ;-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2016-02-03 10:38 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-23 12:46 [PATCH 0/4] net: mitigating kmem_cache slowpath for network stack in NAPI context Jesper Dangaard Brouer
2015-10-23 12:46 ` Jesper Dangaard Brouer
2015-10-23 12:46 ` [PATCH 1/4] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
2015-10-23 12:46   ` Jesper Dangaard Brouer
2015-10-23 12:46 ` [PATCH 2/4] net: bulk free SKBs that were delay free'ed due to IRQ context Jesper Dangaard Brouer
2015-10-23 12:46 ` [PATCH 3/4] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
2015-10-23 12:46   ` Jesper Dangaard Brouer
2015-10-23 12:46 ` [PATCH 4/4] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
2015-10-27  1:09 ` [PATCH 0/4] net: mitigating kmem_cache slowpath for network stack " David Miller
2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
2016-02-02 21:11   ` [net-next PATCH 01/11] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
2016-02-02 21:11   ` [net-next PATCH 02/11] net: bulk free SKBs that were delay free'ed due to IRQ context Jesper Dangaard Brouer
2016-02-02 21:11   ` [net-next PATCH 03/11] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
2016-02-02 21:12   ` [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
2016-02-03  0:52     ` Alexei Starovoitov
2016-02-03 10:38       ` Jesper Dangaard Brouer [this message]
2016-02-02 21:12   ` [net-next PATCH 05/11] mlx5: use napi_*_skb APIs to get bulk alloc and free Jesper Dangaard Brouer
2016-02-02 21:13   ` [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack Jesper Dangaard Brouer
2016-02-09 11:57     ` Saeed Mahameed
2016-02-10 20:26       ` Jesper Dangaard Brouer
2016-02-16  0:01         ` Saeed Mahameed
2016-02-02 21:13   ` [net-next PATCH 07/11] net: introduce napi_alloc_skb_hint() for more use-cases Jesper Dangaard Brouer
2016-02-02 22:29     ` kbuild test robot
2016-02-02 21:14   ` [net-next PATCH 08/11] mlx5: hint the NAPI alloc skb API about the expected bulk size Jesper Dangaard Brouer
2016-02-02 21:14   ` [net-next PATCH 09/11] RFC: dummy: bulk free SKBs Jesper Dangaard Brouer
2016-02-02 21:15   ` [net-next PATCH 10/11] RFC: net: API for RX handover of multiple SKBs to stack Jesper Dangaard Brouer
2016-02-02 21:15   ` [net-next PATCH 11/11] RFC: net: RPS bulk enqueue to backlog Jesper Dangaard Brouer
2016-02-07 19:25   ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches David Miller
2016-02-08 12:14     ` [net-next PATCH V2 0/3] net: mitigating kmem_cache free slowpath Jesper Dangaard Brouer
2016-02-08 12:14       ` Jesper Dangaard Brouer
2016-02-08 12:14       ` [net-next PATCH V2 1/3] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
2016-02-08 12:15       ` [net-next PATCH V2 2/3] net: bulk free SKBs that were delay free'ed due to IRQ context Jesper Dangaard Brouer
2016-02-08 12:15       ` [net-next PATCH V2 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
2016-02-11 16:59       ` [net-next PATCH V2 0/3] net: mitigating kmem_cache free slowpath David Miller
2016-02-13 11:12       ` Tilman Schmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160203113840.215521ed@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=cl@linux.com \
    --cc=gerlitz.or@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.