All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: David Ahern <dsahern@gmail.com>,
	Frey Alfredsson <freysteinn@freysteinn.com>
Cc: brouer@redhat.com,
	"Maciej Fijalkowski" <maciej.fijalkowski@intel.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	alardam@gmail.com, magnus.karlsson@intel.com,
	bjorn.topel@intel.com, andrii.nakryiko@gmail.com,
	kuba@kernel.org, ast@kernel.org, netdev@vger.kernel.org,
	davem@davemloft.net, hawk@kernel.org, jonathan.lemon@gmail.com,
	bpf@vger.kernel.org, jeffrey.t.kirsher@intel.com,
	maciejromanfijalkowski@gmail.com,
	intel-wired-lan@lists.osuosl.org,
	"Marek Majtyka" <marekx.majtyka@intel.com>,
	"Michael S. Tsirkin" <mst@redhat.com>
Subject: Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set)
Date: Thu, 10 Dec 2020 14:32:11 +0100	[thread overview]
Message-ID: <20201210143211.2490f7f4@carbon> (raw)
In-Reply-To: <6913010d-2fd6-6713-94e9-8f5b8ad4b708@gmail.com>

On Wed, 9 Dec 2020 08:44:33 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > (up-to 16) so it should not be a problem to solve this by sharing
> > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > for fallback case,  you allocated a number a TX-queue and distribute
> > this across CPUs to avoid hitting a congested lock (above measurements
> > are the optimal non-congested atomic lock operation)  
> 
> I have been meaning to ask you why 16 for the XDP batching? If the
> netdev budget is 64, why not something higher like 32 or 64?

Thanks you for asking as there are multiple good reasons and
consideration for this 16 batch size.  Notice cpumap have batch size 8,
which is also an explicit choice.  And AF_XDP went in the wrong
direction IMHO and I think have 256.  I designed this to be a choice in
the map code, for the level of bulking it needs/wants.

The low level explanation is that these 8 and 16 batch sizes are
optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
with 10 elements).  I'm betting on that memory backing these 8 or 16
packets have higher chance to remain/being in cache, and I can prefetch
them without evicting them from cache again.  In some cases the pointer
to these packets are queued into a ptr_ring, and it is more optimal to
write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.

The general explanation is my goal to do bulking without adding latency.
This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
Sure, you/we can likely make the micro-benchmarks look better by using
64 batch size, but that will introduce added latency and likely shoot
our-selves in the foot for real workloads.  With experience from
bufferbloat and real networks, we know that massive TX bulking have bad
effects.  Still XDP-redirect does massive bulking (NIC flush is after
full 64 budget) and we don't have pushback or a queue mechanism (so I
know we are already shooting ourselves in the foot) ...  Fortunately we
now have a PhD student working on queuing for XDP.

It is also important to understand that this is an adaptive bulking
scheme, which comes from NAPI.  We don't wait for packets arriving
shortly, we pickup what NIC have available, but by only taking 8 or 16
packets (instead of emptying the entire RX-queue), and then spending
some time to send them along, I'm hoping that NIC could have gotten
some more frame.  For cpumap and veth (in-some-cases) they can start to
consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH
signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
update their internal queue state (and BQL), and if it gets close to
full they can choose to flush/doorbell the NIC earlier.  When doing
queuing for XDP we need to expose these NIC queue states, and having 4
calls with 16 packets (64 budget) also gives us more chances to get NIC
queue state info which the NIC already touch.


[1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


WARNING: multiple messages have this Message-ID (diff)
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set)
Date: Thu, 10 Dec 2020 14:32:11 +0100	[thread overview]
Message-ID: <20201210143211.2490f7f4@carbon> (raw)
In-Reply-To: <6913010d-2fd6-6713-94e9-8f5b8ad4b708@gmail.com>

On Wed, 9 Dec 2020 08:44:33 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > (up-to 16) so it should not be a problem to solve this by sharing
> > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > for fallback case,  you allocated a number a TX-queue and distribute
> > this across CPUs to avoid hitting a congested lock (above measurements
> > are the optimal non-congested atomic lock operation)  
> 
> I have been meaning to ask you why 16 for the XDP batching? If the
> netdev budget is 64, why not something higher like 32 or 64?

Thanks you for asking as there are multiple good reasons and
consideration for this 16 batch size.  Notice cpumap have batch size 8,
which is also an explicit choice.  And AF_XDP went in the wrong
direction IMHO and I think have 256.  I designed this to be a choice in
the map code, for the level of bulking it needs/wants.

The low level explanation is that these 8 and 16 batch sizes are
optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
with 10 elements).  I'm betting on that memory backing these 8 or 16
packets have higher chance to remain/being in cache, and I can prefetch
them without evicting them from cache again.  In some cases the pointer
to these packets are queued into a ptr_ring, and it is more optimal to
write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.

The general explanation is my goal to do bulking without adding latency.
This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
Sure, you/we can likely make the micro-benchmarks look better by using
64 batch size, but that will introduce added latency and likely shoot
our-selves in the foot for real workloads.  With experience from
bufferbloat and real networks, we know that massive TX bulking have bad
effects.  Still XDP-redirect does massive bulking (NIC flush is after
full 64 budget) and we don't have pushback or a queue mechanism (so I
know we are already shooting ourselves in the foot) ...  Fortunately we
now have a PhD student working on queuing for XDP.

It is also important to understand that this is an adaptive bulking
scheme, which comes from NAPI.  We don't wait for packets arriving
shortly, we pickup what NIC have available, but by only taking 8 or 16
packets (instead of emptying the entire RX-queue), and then spending
some time to send them along, I'm hoping that NIC could have gotten
some more frame.  For cpumap and veth (in-some-cases) they can start to
consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH
signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
update their internal queue state (and BQL), and if it gets close to
full they can choose to flush/doorbell the NIC earlier.  When doing
queuing for XDP we need to expose these NIC queue states, and having 4
calls with 16 packets (64 budget) also gives us more chances to get NIC
queue state info which the NIC already touch.


[1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


  reply	other threads:[~2020-12-10 13:35 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-04 10:28 [PATCH v2 bpf 0/5] New netdev feature flags for XDP alardam
2020-12-04 10:28 ` [Intel-wired-lan] " alardam
2020-12-04 10:28 ` [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set alardam
2020-12-04 10:28   ` [Intel-wired-lan] " alardam
2020-12-04 12:18   ` Toke Høiland-Jørgensen
2020-12-04 12:18     ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-04 12:46     ` Maciej Fijalkowski
2020-12-04 12:46       ` [Intel-wired-lan] " Maciej Fijalkowski
2020-12-04 15:21       ` Daniel Borkmann
2020-12-04 15:21         ` [Intel-wired-lan] " Daniel Borkmann
2020-12-04 17:20         ` Toke Høiland-Jørgensen
2020-12-04 17:20           ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-04 22:19           ` Daniel Borkmann
2020-12-04 22:19             ` [Intel-wired-lan] " Daniel Borkmann
2020-12-07 11:54             ` Jesper Dangaard Brouer
2020-12-07 11:54               ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-12-07 12:08               ` Toke Høiland-Jørgensen
2020-12-07 12:08                 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-07 12:03             ` Toke Høiland-Jørgensen
2020-12-07 12:03               ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-07 12:54         ` Jesper Dangaard Brouer
2020-12-07 12:54           ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-12-07 20:52           ` John Fastabend
2020-12-07 20:52             ` [Intel-wired-lan] " John Fastabend
2020-12-07 22:38             ` Saeed Mahameed
2020-12-07 22:38               ` [Intel-wired-lan] " Saeed Mahameed
2020-12-07 23:07             ` Maciej Fijalkowski
2020-12-07 23:07               ` [Intel-wired-lan] " Maciej Fijalkowski
2020-12-09  6:03               ` John Fastabend
2020-12-09  6:03                 ` [Intel-wired-lan] " John Fastabend
2020-12-09  9:54                 ` Maciej Fijalkowski
2020-12-09  9:54                   ` [Intel-wired-lan] " Maciej Fijalkowski
2020-12-09 11:52                   ` Jesper Dangaard Brouer
2020-12-09 11:52                     ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-12-09 15:41                     ` David Ahern
2020-12-09 15:41                       ` [Intel-wired-lan] " David Ahern
2020-12-09 17:15                       ` Saeed Mahameed
2020-12-09 17:15                         ` [Intel-wired-lan] " Saeed Mahameed
2020-12-10  3:34                         ` David Ahern
2020-12-10  3:34                           ` [Intel-wired-lan] " David Ahern
2020-12-10  6:48                           ` Saeed Mahameed
2020-12-10  6:48                             ` [Intel-wired-lan] " Saeed Mahameed
2020-12-10 15:30                             ` David Ahern
2020-12-10 15:30                               ` [Intel-wired-lan] " David Ahern
2020-12-10 18:58                               ` Saeed Mahameed
2020-12-10 18:58                                 ` [Intel-wired-lan] " Saeed Mahameed
2021-01-05 11:56                                 ` Marek Majtyka
2021-01-05 11:56                                   ` [Intel-wired-lan] " Marek Majtyka
2021-02-01 16:16                                   ` Toke Høiland-Jørgensen
2021-02-01 16:16                                     ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2021-02-02 11:26                                     ` Marek Majtyka
2021-02-02 11:26                                       ` [Intel-wired-lan] " Marek Majtyka
2021-02-02 12:05                                       ` Toke Høiland-Jørgensen
2021-02-02 12:05                                         ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2021-02-02 19:34                                         ` Jakub Kicinski
2021-02-02 19:34                                           ` [Intel-wired-lan] " Jakub Kicinski
2021-02-03 12:50                                           ` Marek Majtyka
2021-02-03 12:50                                             ` [Intel-wired-lan] " Marek Majtyka
2021-02-03 17:02                                             ` Jakub Kicinski
2021-02-03 17:02                                               ` [Intel-wired-lan] " Jakub Kicinski
2021-02-10 10:53                                               ` Toke Høiland-Jørgensen
2021-02-10 10:53                                                 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2021-02-10 18:31                                                 ` Jakub Kicinski
2021-02-10 18:31                                                   ` [Intel-wired-lan] " Jakub Kicinski
2021-02-10 22:52                                                   ` Toke Høiland-Jørgensen
2021-02-10 22:52                                                     ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2021-02-12  1:26                                                     ` Jakub Kicinski
2021-02-12  1:26                                                       ` [Intel-wired-lan] " Jakub Kicinski
2021-02-12  2:05                                                       ` Alexei Starovoitov
2021-02-12  2:05                                                         ` [Intel-wired-lan] " Alexei Starovoitov
2021-02-12  7:02                                                         ` Marek Majtyka
2021-02-12  7:02                                                           ` [Intel-wired-lan] " Marek Majtyka
2021-02-16 14:30                                                           ` Toke Høiland-Jørgensen
2021-02-16 14:30                                                             ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-09 15:44                     ` David Ahern
2020-12-09 15:44                       ` [Intel-wired-lan] " David Ahern
2020-12-10 13:32                       ` Jesper Dangaard Brouer [this message]
2020-12-10 13:32                         ` [Intel-wired-lan] Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set) Jesper Dangaard Brouer
2020-12-10 14:14                         ` Magnus Karlsson
2020-12-10 14:14                           ` Magnus Karlsson
2020-12-10 17:30                           ` Jesper Dangaard Brouer
2020-12-10 17:30                             ` Jesper Dangaard Brouer
2020-12-10 19:20                         ` Saeed Mahameed
2020-12-10 19:20                           ` [Intel-wired-lan] " Saeed Mahameed
2020-12-08  1:01             ` [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set David Ahern
2020-12-08  1:01               ` [Intel-wired-lan] " David Ahern
2020-12-08  8:28               ` Jesper Dangaard Brouer
2020-12-08  8:28                 ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-12-08 11:58                 ` Toke Høiland-Jørgensen
2020-12-08 11:58                   ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-09  5:50                   ` John Fastabend
2020-12-09  5:50                     ` [Intel-wired-lan] " John Fastabend
2020-12-09 10:26                     ` Toke Høiland-Jørgensen
2020-12-09 10:26                       ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-08  9:00             ` Jesper Dangaard Brouer
2020-12-08  9:00               ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-12-08  9:42               ` Daniel Borkmann
2020-12-08  9:42                 ` [Intel-wired-lan] " Daniel Borkmann
2020-12-04 12:57   ` Maciej Fijalkowski
2020-12-04 12:57     ` [Intel-wired-lan] " Maciej Fijalkowski
2020-12-04 10:28 ` [PATCH v2 bpf 2/5] drivers/net: turn XDP properties on alardam
2020-12-04 10:28   ` [Intel-wired-lan] " alardam
2020-12-04 12:19   ` Toke Høiland-Jørgensen
2020-12-04 12:19     ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-09 19:05   ` kernel test robot
2020-12-09 19:05     ` kernel test robot
2020-12-04 10:28 ` [PATCH v2 bpf 3/5] xsk: add usage of xdp properties flags alardam
2020-12-04 10:28   ` [Intel-wired-lan] " alardam
2020-12-04 10:29 ` [PATCH v2 bpf 4/5] xsk: add check for full support of XDP in bind alardam
2020-12-04 10:29   ` [Intel-wired-lan] " alardam
2020-12-04 10:29 ` [PATCH v2 bpf 5/5] ethtool: provide xdp info with XDP_PROPERTIES_GET alardam
2020-12-04 10:29   ` [Intel-wired-lan] " alardam
2020-12-04 17:20 ` [PATCH v2 bpf 0/5] New netdev feature flags for XDP Jakub Kicinski
2020-12-04 17:20   ` [Intel-wired-lan] " Jakub Kicinski
2020-12-04 17:26   ` Toke Høiland-Jørgensen
2020-12-04 17:26     ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2020-12-04 19:22     ` Jakub Kicinski
2020-12-04 19:22       ` [Intel-wired-lan] " Jakub Kicinski
2020-12-07 12:04       ` Toke Høiland-Jørgensen
2020-12-07 12:04         ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201210143211.2490f7f4@carbon \
    --to=brouer@redhat.com \
    --cc=alardam@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=freysteinn@freysteinn.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jeffrey.t.kirsher@intel.com \
    --cc=john.fastabend@gmail.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=maciejromanfijalkowski@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=marekx.majtyka@intel.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.