netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: David Ahern <dsahern@kernel.org>
Cc: Mina Almasry <almasrymina@google.com>,
	netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Magnus Karlsson <magnus.karlsson@intel.com>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	sdf@google.com, Willem de Bruijn <willemb@google.com>,
	Kaiyuan Zhang <kaiyuanz@google.com>
Subject: Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice
Date: Tue, 15 Aug 2023 17:16:38 -0700	[thread overview]
Message-ID: <20230815171638.4c057dcd@kernel.org> (raw)
In-Reply-To: <7dd4f5b0-0edf-391b-c8b4-3fa82046ab7c@kernel.org>

On Sun, 13 Aug 2023 19:10:35 -0600 David Ahern wrote:
> Also, this suggests that the Rx queue is unique to the flow. I do not
> recall a netdev API to create H/W queues on the fly (only a passing
> comment from Kuba), so how is the H/W queue (or queue set since a
> completion queue is needed as well) created for the flow?
> And in turn if it is unique to the flow, what deletes the queue if
> an app does not do a proper cleanup? If the queue sticks around,
> the dmabuf references stick around.

Let's start sketching out the design for queue config.
Without sliding into scope creep, hopefully.

Step one - I think we can decompose the problem into:
 A) flow steering
 B) object lifetime and permissions
 C) queue configuration (incl. potentially creating / destroying queues)

These come together into use scenarios like:
 #1 - partitioning for containers - when high perf containers share
      a machine each should get an RSS context on the physical NIC
      to have predictable traffic<>CPU placement, they may also have
      different preferences on how the queues are configured, maybe
      XDP, too?
 #2 - fancy page pools within the host (e.g. huge pages)
 #3 - very fancy page pools not within the host (Mina's work)
 #4 - XDP redirect target (allowing XDP_REDIRECT without installing XDP
      on the target)
 #5 - busy polling - admittedly a bit theoretical, I don't know of
      anyone busy polling in real life, but one of the problems today
      is that setting it up requires scraping random bits of info from
      sysfs and a lot of hoping.

Flow steering (A) is there today, to a sufficient extent, I think,
so we can defer on that. Sooner or later we should probably figure
out if we want to continue down the unruly path of TC offloads or
just give up and beef up ethtool.

I don't have a good sense of what a good model for cleanup and
permissions is (B). All I know is that if we need to tie things to
processes netlink can do it, and we shouldn't have to create our
own FS and special file descriptors...

And then there's (C) which is the main part to talk about.
The first step IMHO is to straighten out the configuration process.
Currently we do:

 user -> thin ethtool API --------------------> driver
                              netdev core <---'

By "straighten" I mean more of a:

 user -> thin ethtool API ---> netdev core ---> driver

flow. This means core maintains the full expected configuration,
queue count and their parameters and driver creates those queues
as instructed.

I'd imagine we'd need 4 basic ops:
 - queue_mem_alloc(dev, cfg) -> queue_mem
 - queue_mem_free(dev, cfg, queue_mem)
 - queue_start(dev, queue info, cfg, queue_mem) -> errno
 - queue_stop(dev, queue info, cfg)

The mem_alloc/mem_free takes care of the commonly missed requirement to
not take the datapath down until resources are allocated for new config.

Core then sets all the queues up after ndo_open, and tears down before
ndo_stop. In case of an ethtool -L / -G call or enabling / disabling XDP
core can handle the entire reconfiguration dance.

The cfg object needs to contain all queue configuration, including 
the page pool parameters.

If we have an abstract model of the configuration in the core we can
modify it much more easily, I hope. I mean - the configuration will be
somewhat detached from what's instantiated in the drivers.

I'd prefer to go as far as we can without introducing a driver callback
to "check if it can support a config change", and try to rely on
(static) capabilities instead. This allows more of the validation to
happen in the core and also lends itself naturally to exporting the
capabilities to the user.

Checking the use cases:

 #1 - partitioning for containers - storing the cfg in the core gives
      us a neat ability to allow users to set the configuration on RSS
      context
 #2, #3 - page pools - we can make page_pool_create take cfg and read whatever
      params we want from there, memory provider, descriptor count, recycling
      ring size etc. Also for header-data-split we may want different settings
      per queue so again cfg comes in handy
 #4 - XDP redirect target - we should spawn XDP TX queues independently from
      the XDP configuration

That's all I have thought up in terms of direction.
Does that make sense? What are the main gaps? Other proposals?

  parent reply	other threads:[~2023-08-16  0:16 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device Mina Almasry
2023-08-10 16:04   ` Samudrala, Sridhar
2023-08-11  2:19     ` Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
2023-08-13 11:26   ` Leon Romanovsky
2023-08-14  1:10   ` David Ahern
2023-08-14  3:15     ` Mina Almasry
2023-08-16  0:16     ` Jakub Kicinski [this message]
2023-08-16 16:12       ` Willem de Bruijn
2023-08-18  1:33         ` David Ahern
2023-08-18  2:09           ` Jakub Kicinski
2023-08-18  2:21             ` David Ahern
2023-08-18 21:52             ` Mina Almasry
2023-08-19  1:34               ` David Ahern
2023-08-19  2:06                 ` Jakub Kicinski
2023-08-19  3:30                   ` David Ahern
2023-08-19 14:18                     ` Willem de Bruijn
2023-08-19 17:59                       ` Mina Almasry
2023-08-21 21:16                       ` Jakub Kicinski
2023-08-22  0:38                         ` Willem de Bruijn
2023-08-22  1:51                           ` Jakub Kicinski
2023-08-22  3:19                       ` David Ahern
2023-08-30 12:38   ` Yunsheng Lin
2023-09-08  0:47   ` David Wei
2023-08-10  1:57 ` [RFC PATCH v2 03/11] netdev: implement netdevice devmem allocator Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 04/11] memory-provider: updates to core provider API for devmem TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 05/11] memory-provider: implement dmabuf devmem memory provider Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 06/11] page-pool: add device memory support Mina Almasry
2023-08-19  9:51   ` Jesper Dangaard Brouer
2023-08-19 14:08     ` Willem de Bruijn
2023-08-19 15:22       ` Jesper Dangaard Brouer
2023-08-19 15:49         ` David Ahern
2023-08-19 16:12           ` Willem de Bruijn
2023-08-21 21:31             ` Jakub Kicinski
2023-08-22  0:58               ` Willem de Bruijn
2023-08-19 16:11         ` Willem de Bruijn
2023-08-19 20:24         ` Mina Almasry
2023-08-19 20:27           ` Mina Almasry
2023-09-08  2:32           ` David Wei
2023-08-22  6:05     ` Mina Almasry
2023-08-22 12:24       ` Jesper Dangaard Brouer
2023-08-22 23:33         ` Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 07/11] net: support non paged skb frags Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 08/11] net: add support for skbs with unreadable frags Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 09/11] tcp: implement recvmsg() RX path for devmem TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 10/11] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 11/11] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2023-08-10 10:29 ` [RFC PATCH v2 00/11] Device Memory TCP Christian König
2023-08-10 16:06   ` Jason Gunthorpe
2023-08-10 18:44   ` Mina Almasry
2023-08-10 18:58     ` Jason Gunthorpe
2023-08-11  1:56       ` Mina Almasry
2023-08-11 11:02     ` Christian König
2023-08-14  1:12 ` David Ahern
2023-08-14  2:11   ` Mina Almasry
2023-08-17 18:00   ` Pavel Begunkov
2023-08-17 22:18     ` Mina Almasry
2023-08-23 22:52       ` David Wei
2023-08-24  3:35         ` David Ahern
2023-08-15 13:38 ` David Laight
2023-08-15 14:41   ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230815171638.4c057dcd@kernel.org \
    --to=kuba@kernel.org \
    --cc=almasrymina@google.com \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=kaiyuanz@google.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@google.com \
    --cc=willemb@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).