bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: "David Ahern" <dsahern@gmail.com>,
	"Frey Alfredsson" <freysteinn@freysteinn.com>,
	"Maciej Fijalkowski" <maciejromanfijalkowski@gmail.com>,
	"Andrii Nakryiko" <andrii.nakryiko@gmail.com>,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Network Development" <netdev@vger.kernel.org>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Marek Majtyka" <marekx.majtyka@intel.com>,
	"Marek Majtyka" <alardam@gmail.com>,
	"Jonathan Lemon" <jonathan.lemon@gmail.com>,
	intel-wired-lan <intel-wired-lan@lists.osuosl.org>,
	"Jakub Kicinski" <kuba@kernel.org>, bpf <bpf@vger.kernel.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	brouer@redhat.com
Subject: Re: [Intel-wired-lan] Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set)
Date: Thu, 10 Dec 2020 18:30:23 +0100	[thread overview]
Message-ID: <20201210183023.4b299334@carbon> (raw)
In-Reply-To: <CAJ8uoz25rtO63-4nOSV-yr8bORNbNSquiBBWiEouLs-ZUv2o=A@mail.gmail.com>

On Thu, 10 Dec 2020 15:14:18 +0100
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Wed, 9 Dec 2020 08:44:33 -0700
> > David Ahern <dsahern@gmail.com> wrote:
> >  
> > > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:  
> > > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > > > (up-to 16) so it should not be a problem to solve this by sharing
> > > > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > > > for fallback case,  you allocated a number a TX-queue and distribute
> > > > this across CPUs to avoid hitting a congested lock (above measurements
> > > > are the optimal non-congested atomic lock operation)  
> > >
> > > I have been meaning to ask you why 16 for the XDP batching? If the
> > > netdev budget is 64, why not something higher like 32 or 64?  
> >
> > Thanks you for asking as there are multiple good reasons and
> > consideration for this 16 batch size.  Notice cpumap have batch size 8,
> > which is also an explicit choice.  And AF_XDP went in the wrong
> > direction IMHO and I think have 256.  I designed this to be a choice in
> > the map code, for the level of bulking it needs/wants.  
> 
> FYI, as far as I know, there is nothing in AF_XDP that says bulking
> should be 256. There is a 256 number in the i40e driver that states
> the maximum number of packets to be sent within one napi_poll loop.
> But this is just a maximum number and only for that driver. (In case
> you wonder, that number was inherited from the original skb Tx
> implementation in the driver.) 

Ah, that explains the issue I have on the production system that runs
the EDT-pacer[2].  I see that i40e function i40e_clean_tx_irq() ignores
napi_budget but uses it own budget, that defaults to 256.  Looks like I
can adjust this via ethtool -C tx-frames-irq.   I turned this down to
64 (32 was giving worse results, and below 16 system acted strange).

Now the issue is gone, which was that if TX-DMA completion was running
(i40e_clean_tx_irq()) on the same CPU that send packets via FQ-pacer
qdisc, then the pacing was not accurate, and was sending too bursty.

System have already tuned "net/core/dev_weight" and RX/TX-bias to
reduce bulking, as this can influence latency and the EDT-pacing
accuracy. (It is a middlebox bridging VLANs and BPF-EDT tiemstamping and
FQ-pacing packets to solve bursts overflowing switch ports).

  sudo sysctl net/core/dev_weight
  net.core.dev_weight = 1
  net.core.dev_weight_rx_bias = 32
  net.core.dev_weight_tx_bias = 1

This net.core.dev_weight_tx_bias=1 (together with dev_weight=1) cause
qdisc transmit budget to become one packet, cycling through
NET_TX_SOFTIRQ which consumes time and gives a little more pacing space
for the packets.


> The actual batch size is controlled by
> the application. If it puts 1 packet in the Tx ring and calls send(),
> the batch size will be 1. If it puts 128 packets in the Tx ring and
> calls send(), you get a batch size of 128, and so on. It is flexible,
> so you can trade-off latency with throughput in the way the
> application desires. Rx batch size has also become flexible now with
> the introduction of Björn's prefer_busy_poll patch set [1].
> 
> [1] https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.topel@gmail.com/

This looks like a cool trick, to get even more accurate packet scheduling.

I played with the tunings, and could see changed behavior with mpstat,
but ended up tuning it off again, as I could not measure a direct
correlation with the bpftrace tools[3].


> > The low level explanation is that these 8 and 16 batch sizes are
> > optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
> > with 10 elements).  I'm betting on that memory backing these 8 or 16
> > packets have higher chance to remain/being in cache, and I can prefetch
> > them without evicting them from cache again.  In some cases the pointer
> > to these packets are queued into a ptr_ring, and it is more optimal to
> > write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.
> >
> > The general explanation is my goal to do bulking without adding latency.
> > This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
> > Sure, you/we can likely make the micro-benchmarks look better by using
> > 64 batch size, but that will introduce added latency and likely shoot
> > our-selves in the foot for real workloads.  With experience from
> > bufferbloat and real networks, we know that massive TX bulking have bad
> > effects.  Still XDP-redirect does massive bulking (NIC flush is after
> > full 64 budget) and we don't have pushback or a queue mechanism (so I
> > know we are already shooting ourselves in the foot) ...  Fortunately we
> > now have a PhD student working on queuing for XDP.
> >
> > It is also important to understand that this is an adaptive bulking
> > scheme, which comes from NAPI.  We don't wait for packets arriving
> > shortly, we pickup what NIC have available, but by only taking 8 or 16
> > packets (instead of emptying the entire RX-queue), and then spending
> > some time to send them along, I'm hoping that NIC could have gotten
> > some more frame.  For cpumap and veth (in-some-cases) they can start to
> > consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH
> > signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
> > update their internal queue state (and BQL), and if it gets close to
> > full they can choose to flush/doorbell the NIC earlier.  When doing
> > queuing for XDP we need to expose these NIC queue states, and having 4
> > calls with 16 packets (64 budget) also gives us more chances to get NIC
> > queue state info which the NIC already touch.
> >
> >
> > [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf

[2] https://github.com/netoptimizer/bpf-examples/tree/master/traffic-pacing-edt/

[3] https://github.com/netoptimizer/bpf-examples/tree/master/traffic-pacing-edt/bpftrace


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


  reply	other threads:[~2020-12-10 17:33 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-04 10:28 [PATCH v2 bpf 0/5] New netdev feature flags for XDP alardam
2020-12-04 10:28 ` [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set alardam
2020-12-04 12:18   ` Toke Høiland-Jørgensen
2020-12-04 12:46     ` Maciej Fijalkowski
2020-12-04 15:21       ` Daniel Borkmann
2020-12-04 17:20         ` Toke Høiland-Jørgensen
2020-12-04 22:19           ` Daniel Borkmann
2020-12-07 11:54             ` Jesper Dangaard Brouer
2020-12-07 12:08               ` Toke Høiland-Jørgensen
2020-12-07 12:03             ` Toke Høiland-Jørgensen
2020-12-07 12:54         ` Jesper Dangaard Brouer
2020-12-07 20:52           ` John Fastabend
2020-12-07 22:38             ` Saeed Mahameed
2020-12-07 23:07             ` Maciej Fijalkowski
2020-12-09  6:03               ` John Fastabend
2020-12-09  9:54                 ` Maciej Fijalkowski
2020-12-09 11:52                   ` Jesper Dangaard Brouer
2020-12-09 15:41                     ` David Ahern
2020-12-09 17:15                       ` Saeed Mahameed
2020-12-10  3:34                         ` David Ahern
2020-12-10  6:48                           ` Saeed Mahameed
2020-12-10 15:30                             ` David Ahern
2020-12-10 18:58                               ` Saeed Mahameed
2021-01-05 11:56                                 ` Marek Majtyka
2021-02-01 16:16                                   ` Toke Høiland-Jørgensen
2021-02-02 11:26                                     ` Marek Majtyka
2021-02-02 12:05                                       ` Toke Høiland-Jørgensen
2021-02-02 19:34                                         ` Jakub Kicinski
2021-02-03 12:50                                           ` Marek Majtyka
2021-02-03 17:02                                             ` Jakub Kicinski
2021-02-10 10:53                                               ` Toke Høiland-Jørgensen
2021-02-10 18:31                                                 ` Jakub Kicinski
2021-02-10 22:52                                                   ` Toke Høiland-Jørgensen
2021-02-12  1:26                                                     ` Jakub Kicinski
2021-02-12  2:05                                                       ` Alexei Starovoitov
2021-02-12  7:02                                                         ` Marek Majtyka
2021-02-16 14:30                                                           ` Toke Høiland-Jørgensen
2020-12-09 15:44                     ` David Ahern
2020-12-10 13:32                       ` Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set) Jesper Dangaard Brouer
2020-12-10 14:14                         ` [Intel-wired-lan] " Magnus Karlsson
2020-12-10 17:30                           ` Jesper Dangaard Brouer [this message]
2020-12-10 19:20                         ` Saeed Mahameed
2020-12-08  1:01             ` [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set David Ahern
2020-12-08  8:28               ` Jesper Dangaard Brouer
2020-12-08 11:58                 ` Toke Høiland-Jørgensen
2020-12-09  5:50                   ` John Fastabend
2020-12-09 10:26                     ` Toke Høiland-Jørgensen
2020-12-08  9:00             ` Jesper Dangaard Brouer
2020-12-08  9:42               ` Daniel Borkmann
2020-12-04 12:57   ` Maciej Fijalkowski
2020-12-04 10:28 ` [PATCH v2 bpf 2/5] drivers/net: turn XDP properties on alardam
2020-12-04 12:19   ` Toke Høiland-Jørgensen
2020-12-04 10:28 ` [PATCH v2 bpf 3/5] xsk: add usage of xdp properties flags alardam
2020-12-04 10:29 ` [PATCH v2 bpf 4/5] xsk: add check for full support of XDP in bind alardam
2020-12-04 10:29 ` [PATCH v2 bpf 5/5] ethtool: provide xdp info with XDP_PROPERTIES_GET alardam
2020-12-04 17:20 ` [PATCH v2 bpf 0/5] New netdev feature flags for XDP Jakub Kicinski
2020-12-04 17:26   ` Toke Høiland-Jørgensen
2020-12-04 19:22     ` Jakub Kicinski
2020-12-07 12:04       ` Toke Høiland-Jørgensen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201210183023.4b299334@carbon \
    --to=brouer@redhat.com \
    --cc=alardam@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=freysteinn@freysteinn.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jonathan.lemon@gmail.com \
    --cc=kuba@kernel.org \
    --cc=maciejromanfijalkowski@gmail.com \
    --cc=magnus.karlsson@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=marekx.majtyka@intel.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).