intel-wired-lan.lists.osuosl.org archive mirror
 help / color / mirror / Atom feed
* [Intel-wired-lan] [RFC net-next (what uAPI?) ice: add support for more than 16 RSS queues for VF
@ 2024-04-26 13:22 Przemek Kitszel
  2024-04-30  1:59 ` Jakub Kicinski
  0 siblings, 1 reply; 3+ messages in thread
From: Przemek Kitszel @ 2024-04-26 13:22 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, Jiri Pirko, Samudrala, Sridhar, Ahmed Zaki,
	Pawel Chmielewski, Nguyen,  Anthony L, intel-wired-lan,
	Simon Horman, Mateusz Polchlopek, Jakub Kicinski, Knitter,
	Konrad

Main aim for this RFC is to ask for the preferred uAPI, but let me start
with some background. Then I describe considered uAPIs, from most
readily available to most implementation-needed.

## why
We want to extend ice+iavf in order to support more than 16 RSS-enabled
queues for a VF. To do so we have to assign a bigger RSS LUT than the
default VSI LUT. This feature is known as Extra Large VF to the users of
our OOT driver. iavf would be notified about changes by ice, it's not
touched in any way by this RFC.

## what
There are 3 types of the RSS LUT in e800 family, and the resource limits
are per whole card. Smallest, practically unlimited is VSI LUT, allows
for 16 RSS queues, then there is GLOBAL LUT, allows for 64 RSS queues,
finally there is PF LUT which allows 256 RSS queues. Sizes for LUTs (in
terms of entries) are respectively 64, 512, 2048 (that size could be
used later for ethtool -X). Hardware limits for e810 card are 16 GLOBAL
LUTs (unused at the moment) and one PF LUT (assigned to PF VSI) for each
port, (what means that the limit grows after a port split).

In case this is not yet clear, PF needs to arbitrage resource allocation
to avoid the case that there are two users of the given PF LUT. For
GLOBAL LUTs arbitrage could be deferred to FW, but PFs need to be able
to report usage. It is still possible (and preferred) to arbitrage
GLOBAL LUT distribution in SW, struct ice_adapter could be extended for
that.

Please find various interface considerations/propositions below.

## sysfs
Our OOT solution is based on sysfs, with one RW file (attribute) for
each VF and PF. Example flow to reassign PF LUT from PF to VF0:
echo 512 > /sys/class/net/$pf/device/rss_lut_pf_attr # switch to GLOBAL
echo 2048 > /sys/class/net/$pf/device/virtfn0/rss_lut_vf_attr
Obvious, easy to implement, but cursed sysfs.

## ethtool -L $vf
It's tempting to think about all of this Intel RSS LUT as an 
implementation detail that should be transparent to the user, but
anything other that the default flow would be obscured then: not all Rx
queues need RSS; one could benefit also from bigger table even if there
are less queues; etc.
So I'm opposed.
Any ethtool base interface would also be least convenient to implement,
since we will have additional virtchannel communication to pass request
to the PF.

## devlink - priv params + priv port params
Direct translation of sysfs interface into 2020's, would require a
bring-back of unused (thus removed) port params API.
Would require registering of entry for each VF, which does not feel
right (keep in mind that the entities that we want to distribute are
LUTs). Straightforward to implement.

# for PF:
devlink dev param set DEV name rss_lut_size value { 512 | 2048 } \
		cmode runtime
# for VF:
devlink dev param set DEV/PORT_INDEX name rss_lut_size \
		value { 64 | 512 | 2048 } cmode runtime

(I wonder why there is no "port" word for port version (VF) above in the
man page: https://man7.org/linux/man-pages/man8/devlink-port.8.html )

With `show` implemented too.

I don't know if this would require "porting" our "port representors" for
VFs, right now we don't have any devlink interface for VFs.

 From user perspective this is my most liked variant.

## devlink resources (with current API)
`devlink resource` is compelling, partially given the name sounds like a
perfect match. But when we dig just a little bit, the current Path+sizes
(min,max,step) is totally off to what is the most elegant picture of the
situation. In order to fit into existing uAPI, I would need to register
VFs as PF's resource, then GLOBAL LUT and PF LUT as a sub resource to
that (each VF gets two entries under it; plus two additional ones for
PF) I don't like it, I also feel like there is not that much use of
current resources API (it's not natural to use it for distribution, only
for limitation).

## devlink resources (with extended API)
It is possible to extend current `devlink resource` so instead of only
Path+size, there would be also Path+Owner option to use.
The default state for ice driver would be that PFs owns PF LUTs, GLOBAL
LUTs are all free.

example proposed flow to assign a GLOBAL LUT to VF0 and PF LUT to VF1:
pf=0000:03:00.0  # likely more meaningful than VSI idx, but open for
vf0=0000:03:00.1 #                                       suggestions
vf1=0000:03:00.2
devlink resource set pci/$pf path /lut/lut_table_512 owner $pf
devlink resource set pci/$pf path /lut/lut_table_2048 owner free
devlink resource set pci/$pf path /lut/lut_table_512 owner $vf0
devlink resource set pci/$pf path /lut/lut_table_2048 owner $vf1

(ASSUMING 1 Port)
the above 4 commands would default to transfer ownership of "1 unit",
the $pf still has 17 LUT entries under it, and would display 2 GLOBALs
as taken by particular owners, same for the PF LUT, and finally display
14 GLOBAL LUT units as free.

This option will be the best if 16 Global LUTs were given to each port,
but that is not the case.
(ASSUMPTION off)
`show` command would have to aggregate over all ports, or simply don't
show free global LUTs (show only those in use).

If it would be possible to register a "merged device" (==all PFs or
physical card), for which we don't have a driver, as an entry for
devlink resource command, then it would be straightforward to implement
`show` subcommand for multi port case.

## conclusion
I like the most devlink priv params, with devlink resources modified to
have an owner semantics as the second option.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Intel-wired-lan] [RFC net-next (what uAPI?) ice: add support for more than 16 RSS queues for VF
  2024-04-26 13:22 [Intel-wired-lan] [RFC net-next (what uAPI?) ice: add support for more than 16 RSS queues for VF Przemek Kitszel
@ 2024-04-30  1:59 ` Jakub Kicinski
  2024-05-06 14:34   ` Przemek Kitszel
  0 siblings, 1 reply; 3+ messages in thread
From: Jakub Kicinski @ 2024-04-30  1:59 UTC (permalink / raw)
  To: Przemek Kitszel
  Cc: Pawel Chmielewski, Jiri Pirko, netdev, Knitter, Konrad,
	Ahmed Zaki, Samudrala, Sridhar, intel-wired-lan, Simon Horman,
	Mateusz Polchlopek, Nguyen, Anthony L, Paolo Abeni

On Fri, 26 Apr 2024 15:22:02 +0200 Przemek Kitszel wrote:
> ## devlink resources (with current API)
> `devlink resource` is compelling, partially given the name sounds like a
> perfect match. But when we dig just a little bit, the current Path+sizes
> (min,max,step) is totally off to what is the most elegant picture of the
> situation. In order to fit into existing uAPI, I would need to register
> VFs as PF's resource, then GLOBAL LUT and PF LUT as a sub resource to
> that (each VF gets two entries under it; plus two additional ones for
> PF) I don't like it, I also feel like there is not that much use of
> current resources API (it's not natural to use it for distribution, only
> for limitation).

Can you share more on how that would look like? 

From the description it does not sound so bad. Maybe with some CLI / UI
changes it will be fine?

> ## devlink resources (with extended API)
> It is possible to extend current `devlink resource` so instead of only
> Path+size, there would be also Path+Owner option to use.
> The default state for ice driver would be that PFs owns PF LUTs, GLOBAL
> LUTs are all free.
> 
> example proposed flow to assign a GLOBAL LUT to VF0 and PF LUT to VF1:
> pf=0000:03:00.0  # likely more meaningful than VSI idx, but open for
> vf0=0000:03:00.1 #                                       suggestions
> vf1=0000:03:00.2
> devlink resource set pci/$pf path /lut/lut_table_512 owner $pf
> devlink resource set pci/$pf path /lut/lut_table_2048 owner free
> devlink resource set pci/$pf path /lut/lut_table_512 owner $vf0
> devlink resource set pci/$pf path /lut/lut_table_2048 owner $vf1

Don't we want some level of over-subscription to be allowed?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Intel-wired-lan] [RFC net-next (what uAPI?) ice: add support for more than 16 RSS queues for VF
  2024-04-30  1:59 ` Jakub Kicinski
@ 2024-05-06 14:34   ` Przemek Kitszel
  0 siblings, 0 replies; 3+ messages in thread
From: Przemek Kitszel @ 2024-05-06 14:34 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: Pawel Chmielewski, netdev, Knitter, Konrad, Ahmed Zaki,
	Samudrala, Sridhar, intel-wired-lan, Simon Horman,
	Mateusz Polchlopek, Nguyen, Anthony L, Paolo Abeni

On 4/30/24 03:59, Jakub Kicinski wrote:
> On Fri, 26 Apr 2024 15:22:02 +0200 Przemek Kitszel wrote:
>> ## devlink resources (with current API)
>> `devlink resource` is compelling, partially given the name sounds like a
>> perfect match. But when we dig just a little bit, the current Path+sizes
>> (min,max,step) is totally off to what is the most elegant picture of the
>> situation. In order to fit into existing uAPI, I would need to register
>> VFs as PF's resource, then GLOBAL LUT and PF LUT as a sub resource to
>> that (each VF gets two entries under it; plus two additional ones for
>> PF) I don't like it, I also feel like there is not that much use of
>> current resources API (it's not natural to use it for distribution, only
>> for limitation).
> 
> Can you share more on how that would look like?

something like below (though I have added one more layer to illustrate
broader idea, it could be chopped down)

[1]

> 
>  From the description it does not sound so bad. Maybe with some CLI / UI
> changes it will be fine?
> 
>> ## devlink resources (with extended API)
>> It is possible to extend current `devlink resource` so instead of only
>> Path+size, there would be also Path+Owner option to use.
>> The default state for ice driver would be that PFs owns PF LUTs, GLOBAL
>> LUTs are all free.
>>
>> example proposed flow to assign a GLOBAL LUT to VF0 and PF LUT to VF1:
>> pf=0000:03:00.0  # likely more meaningful than VSI idx, but open for
>> vf0=0000:03:00.1 #                                       suggestions
>> vf1=0000:03:00.2
>> devlink resource set pci/$pf path /lut/lut_table_512 owner $pf
>> devlink resource set pci/$pf path /lut/lut_table_2048 owner free
>> devlink resource set pci/$pf path /lut/lut_table_512 owner $vf0
>> devlink resource set pci/$pf path /lut/lut_table_2048 owner $vf1
> 
> Don't we want some level of over-subscription to be allowed?

In theory we could reuse/share tables, especially with the case of auto
filled ones, not sure if I would want to implement that with the very
first series, but good point to design interface with that in mind.
To move in this direction we could make numbered LUTs an entity that is
set/show'ed (digression: this would feel more like mmap() than malloc())
(The imaginary output below does not do that).

The main problem with the [1] above for "current API" for me is lack of
aggregate device [2] in the structures represented by devlink. Let me
show what it would be with one more layer (so I put PFs under that, and
VFs one layer down).

Here is an example with 2 PFs, one of with with 3 VFs, each of them with
different LUT

$ devlink resource show $device
$device:
   name lut size 4 unit entry
     resources:
       name lut_table_2048 size 2 unit entry size_min 0 size_max 8 
size_gran 1
       name lut_table_512 size 2 unit entry size_min 0 size_max 16 
size_gran 1
   name functions size 5 unit entry
     resources:
       name pf0
         resources:
           name lut_table_2048 size 0 size_max 1 ...
           name lut_table_512 size 1 size_max 1 ...
           name vf0
             resources:
               name lut_table_2048 size 1 size_max 1 ...
               name lut_table_512 size 0 size_max 1 ...
           name vf1
             resources:
               name lut_table_2048 size 0 size_max 1 ...
               name lut_table_512 size 1 size_max 1 ...
           name vf2
             resources:
               name lut_table_2048 size 0 size_max 1 ...
               name lut_table_512 size 0 size_max 1 ...
       name pf1
         resources:
           name lut_table_2048 size 1 ...
           name lut_table_512 size 0 ...

where $device stands for the aggregate device (covers 8 PFs in case of
8-port split used)
and all named PF/VFs in the output would need "dummy" size calculations
(I would like to remove this need though)

When all resources are assigned, all "size 0" entries should have also
"size_max 0" to indicate no room for growth.

[2] aggregate device discussion:
https://lore.kernel.org/intel-wired-lan/cfe84890-f999-4b97-a012-6f9fd16da8e3@intel.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-05-06 14:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26 13:22 [Intel-wired-lan] [RFC net-next (what uAPI?) ice: add support for more than 16 RSS queues for VF Przemek Kitszel
2024-04-30  1:59 ` Jakub Kicinski
2024-05-06 14:34   ` Przemek Kitszel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).