netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] current devlink extension plan for NICs
@ 2020-03-19 19:27 Jiri Pirko
  2020-03-20  3:32 ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Jiri Pirko @ 2020-03-19 19:27 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta

Hi all.

First, I would like to apologize for very long email. But I think it
would be beneficial to the see the whole picture, where we are going.

Currently in Mellanox we are working on several features with need of
extension of the current devlink infrastructure. I took a stab at
putting it all together in a single txt file, inlined below.

Most of the stuff is based on a new object called "slice" (called
"subdev" originally in Yuval's patchsets sent some while ago).

The text describes how things should behave and provides a draft
of user facing console input/outputs. I think it is important to clear
that up before we go in and implement the devlink core and
driver pieces.

I would like to ask you to read this and comment. Especially, I would
like to ask vendors if what is described fits the needs of your
NIC/e-switch.

Please note that something is already implemented, but most of this
isn't (see "what needs to be implemented" section).




==================================================================
||                                                              ||
||            Overall illustration of example setup             ||
||                                                              ||
==================================================================

Note that there are 2 hosts in the picture. Host A may be the smartnic host,
Host B may be one of the hosts which gets PF. Also, you might omit
the Host B and just see Host A like an ordinary nic in a host.

Note that the PF is merged with physical port representor.
That is due to simpler and flawless transition from legacy mode and back.
The devlink_ports and netdevs for physical ports are staying during
the transition.

                        +-----------+
                        |phys port 2+-----------------------------------+
                        +-----------+                                   |
                        +-----------+                                   |
                        |phys port 1+---------------------------------+ |
                        +-----------+                                 | |
                                                                      | |
+------------------------------------------------------------------+  | |
|  devlink instance for the whole ASIC                   HOST A    |  | |
|                                                                  |  | |
|  pci/0000:06:00.0  (devlink dev)                                 |  | |
|  +->health reporters, params, info, sb, dpipe,                   |  | |
|  |  resource, traps                                              |  | |
|  |                                                               |  | |
|  +-+port_pci/0000:06:00.0/0+-----------------------+-------------|--+ |
|  | |  flavour physical pfnum 0  (phys port and pf) ^             |    |
|  | |  netdev enp6s0f0np1                           |             |    |
|  | +->health reporters, params                     |             |    |
|  | |                                               |             |    |
|  | +->slice_pci/0000:06:00.0/0+--------------------+             |    |
|  |       flavour physical                                        |    |
|  |                                                               |    |
|  +-+port_pci/0000:06:00.0/1+-----------------------+-------------|----+
|  | |  flavour physical pfnum 1  (phys port and pf) |             |
|  | |  netdev enp6s0f0np2                           |             |
|  | +->health reporters, params                     |             |
|  | |                                               |             |
|  | +->slice_pci/0000:06:00.0/1+--------------------+             |
|  |       flavour physical                                        |
|  |                                                               |
|  +-+-+port_pci/0000:06:00.0/2+-----------+-------------------+   |
|  | | |  flavour pcipf pfnum 2            ^                   |   |
|  | | |  netdev enp6s0f0pf2               |                   |   |
|  | + +->params                           |                   |   |
|  | |                                     |                   |   |
|  | +->slice_pci/0000:06:00.0/2+----------+                   |   |
|  |       flavour pcipf                                       |   |
|  |                                                           |   |
|  +-+-+port_pci/0000:06:00.0/3+-----------+----------------+  |   |
|  | | |  flavour pcivf pfnum 2 vfnum 0    ^                |  |   |
|  | | |  netdev enp6s0pf2vf0              |                |  |   |
|  | | +->params                           |                |  |   |
|  | |                                     |                |  |   |
|  | +-+slice_pci/0000:06:00.0/3+----------+                |  |   |
|  |   |   flavour pcivf                                    |  |   |
|  |   +->rate (qos), mpgroup, mac                          |  |   |
|  |                                                        |  |   |
|  +-+-+port_pci/0000:06:00.0/4+-----------+-------------+  |  |   |
|  | | |  flavour pcivf pfnum 0 vfnum 0    ^             |  |  |   |
|  | | |  netdev enp6s0pf0vf0              |             |  |  |   |
|  | | +->params                           |             |  |  |   |
|  | |                                     |             |  |  |   |
|  | +-+slice_pci/0000:06:00.0/4+----------+             |  |  |   |
|  |   |   flavour pcivf                                 |  |  |   |
|  |   +->rate (qos), mpgroup, mac                       |  |  |   |
|  |                                                     |  |  |   |
|  +-+-+port_pci/0000:06:00.0/5+-----------+----------+  |  |  |   |
|  | | |  flavour pcisf pfnum 0 sfnum 1    ^          |  |  |  |   |
|  | | |  netdev enp6s0pf0sf1              |          |  |  |  |   |
|  | | +->params                           |          |  |  |  |   |
|  | |                                     |          |  |  |  |   |
|  | +-+slice_pci/0000:06:00.0/5+----------+          |  |  |  |   |
|  |   |   flavour pcisf                              |  |  |  |   |
|  |   +->rate (qos), mpgroup, mac                    |  |  |  |   |
|  |                                                  |  |  |  |   |
|  +-+slice_pci/0000:06:00.0/6+--------------------+  |  |  |  |   |
|        flavour pcivf pfnum 0 vfnum 1             |  |  |  |  |   |
|            (non-ethernet (IB, NVE)               |  |  |  |  |   |
|                                                  |  |  |  |  |   |
+------------------------------------------------------------------+
                                                   |  |  |  |  |
                                                   |  |  |  |  |
                                                   |  |  |  |  |
+----------------------------------------------+   |  |  |  |  |
|  devlink instance PF (other host)    HOST B  |   |  |  |  |  |
|                                              |   |  |  |  |  |
|  pci/0000:10:00.0  (devlink dev)             |   |  |  |  |  |
|  +->health reporters, info                   |   |  |  |  |  |
|  |                                           |   |  |  |  |  |
|  +-+port_pci/0000:10:00.0/1+---------------------------------+
|    |  flavour virtual                        |   |  |  |  |
|    |  netdev enp16s0f0                       |   |  |  |  |
|    +->health reporters                       |   |  |  |  |
|                                              |   |  |  |  |
+----------------------------------------------+   |  |  |  |
                                                   |  |  |  |
+----------------------------------------------+   |  |  |  |
|  devlink instance VF (other host)    HOST B  |   |  |  |  |
|                                              |   |  |  |  |
|  pci/0000:10:00.1  (devlink dev)             |   |  |  |  |
|  +->health reporters, info                   |   |  |  |  |
|  |                                           |   |  |  |  |
|  +-+port_pci/0000:10:00.1/1+------------------------------+
|    |  flavour virtual                        |   |  |  |
|    |  netdev enp16s0f0v0                     |   |  |  |
|    +->health reporters                       |   |  |  |
|                                              |   |  |  |
+----------------------------------------------+   |  |  |
                                                   |  |  |
+----------------------------------------------+   |  |  |
|  devlink instance VF                 HOST A  |   |  |  |
|                                              |   |  |  |
|  pci/0000:06:00.1  (devlink dev)             |   |  |  |
|  +->health reporters, info                   |   |  |  |
|  |                                           |   |  |  |
|  +-+port_pci/0000:06:00.1/1+---------------------------+
|    |  flavour virtual                        |   |  |
|    |  netdev enp6s0f0v0                      |   |  |
|    +->health reporters                       |   |  |
|                                              |   |  |
+----------------------------------------------+   |  |
                                                   |  |
+----------------------------------------------+   |  |
|  devlink instance SF                 HOST A  |   |  |
|                                              |   |  |
|  pci/0000:06:00.0%sf1    (devlink dev)       |   |  |
|  +->health reporters, info                   |   |  |
|  |                                           |   |  |
|  +-+port_pci/0000:06:00.0%sf1/1+--------------------+
|    |  flavour virtual                        |   |
|    |  netdev enp6s0f0s1                      |   |
|    +->health reporters                       |   |
|                                              |   |
+----------------------------------------------+   |
                                                   |
+----------------------------------------------+   |
|  devlink instance VF                 HOST A  |   |
|                                              |   |
|  pci/0000:06:00.2  (devlink dev)+----------------+
|  +->health reporters, info                   |
|                                              |
+----------------------------------------------+




==================================================================
||                                                              ||
||                 what needs to be implemented                 ||
||                                                              ||
==================================================================

1) physical port "pfnum". When PF and physical port representor
   are merged, the instance of devlink_port representing the physical port
   and PF needs to have "pfnum" attribute to be in sync
   with other PF port representors.

2) per-port health reporters are not implemented yet.

3) devlink_port instance in PF/VF/SF flavour "virtual". In PF/VF/SF devlink
   instance (in VM for example), there would make sense to have devlink_port
   instance. At least to carry link to netdevice name (otherwise we have
   no easy way to find out devlink instance and netdevice belong to each other).
   I was thinking about flavour name, we have to distinguish from eswitch
   devlink port flavours "pcipf, pcivf, pcisf".

   This was recently implemented by Parav:
commit 0a303214f8cb8e2043a03f7b727dba620e07e68d
Merge: c04d102ba56e 162add8cbae4
Author: David S. Miller <davem@davemloft.net>
Date:   Tue Mar 3 15:40:40 2020 -0800

    Merge branch 'devlink-virtual-port'

   What is missing is the "virtual" flavour for nested PF.

4) slice is not implemented yet. This is the original "vdev/subdev" concept.
   See below section "Slice user cmdline API draft".

5) SFs are not implemented.
   See below section "SF (subfunction) user cmdline API draft".

6) rate for slice are not implemented yet.
   See below section "Slice rate user cmdline API draft".

7) mpgroup for slice is not implemented yet.
   See below section "Slice mpgroup user cmdline API draft".

8) VF manual creation using devlink is not implemented yet.
   See below section "VF manual creation and activation user cmdline API draft".
 
9) PF aliasing. One devlink instance and multiple PFs sharing it as they have one
   merged e-switch.



==================================================================
||                                                              ||
||                  Issues/open questions                       ||
||                                                              ||
==================================================================

1) "pfnum" has to be per-asic(/devlink instance), not per-host.
   That means that in smartNIC scenario, we cannot have "pfnum 0"
   for smartNIC and "pfnum 0" for host as well.
   
2) Q: for TX, RX queues reporters, should it be bound to devlink_port?
   For which flavours this might make sense?
   Most probably for flavours "physical"/"virtual".
   How about the reporters in VF/SF?

3) How the management of nested switch is handled. The PFs created dynamically
   or the ones in hosts in smartnic scenario may themselves be each a manager
   of nested e-switch. How to toggle this capability?
   During creation by a cmdline option?
   During lifetime in case the PF does not have any childs (VFs/SFs)?



==================================================================
||                                                              ||
||                Slice user cmdline API draft                  ||
||                                                              ||
==================================================================

Note that some of the "devlink port" attributes may be forgotten or misordered.

Slices and ports are created together by device driver. The driver defines
the relationships during creation.


$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 0
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1 slice 1
pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0 slice 2
pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1 slice 3

$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr 10:22:33:44:55:77 state active
pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2

In these 2 outputs you can see the relationships. Attributes "slice" and "port"
indicate the slice-port pairs.

Also, there is a fixed "state" attribute with value "active". This is by
default as the VFs are always created active. In future, it is planned
to implement manual VF creation and activation, similar to what is below
described for SFs.

Note that for non-ethernet slice (the last one) does not have any
related port port. It can be for example NVE or IB. But since
the "hw_addr" attribute is also omitted, it isn't IB.

 
Now set a different MAC address for VF1 on PF0:
$ devlink slice set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff

$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2



==================================================================
||                                                              ||
||          SF (subfunction) user cmdline API draft             ||
||                                                              ||
==================================================================

Note that some of the "devlink port" attributes may be forgotten or misordered.

Note that some of the "devlink slice" attributes in show commands
are omitted on purpose.

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2

$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66

There is one VF on the NIC.


Now create subfunction of SF0 on PF1, index of the slice is going to be 100
and hw_address aa:bb:cc:aa:bb:cc.

$ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc

The devlink kernel code calls down to device driver (devlink op) and asks
it to create a slice with particular attributes. Driver then instatiates
the slice and port in the same way it is done for VF:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
pci/0000:06:00.0/3: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10 slice 100

$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state inactive

Note that the SF slice is created but not active. That means the
entities are created on devlink side, the e-switch port representor
is created, but the SF device itself it not yet out there (same host
or different, depends on where the parent PF is - in this case the same host).
User might use e-switch port representor enp6s0pf1sf10 to do settings,
putting it into bridge, adding TC rules, etc.
It's like the cable is unplugged on the other side.


Now we activate (deploy) the SF:
$ devlink slice set pci/0000:06:00.0/100 state active

$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state active

Upon the activation, the device driver asks the device to instantiate
the actual SF device on particular PF. Does not matter if that is
on the same host or not.

On the other side, the PF driver instance gets the event from device
that particular SF was activated. It's the cue to put the device on bus
probe it and instantiate netdev and devlink instances for it.

For every SF a device is created on wirtbus with an ID assigned by the
virtbus code. For example:
/sys/bus/virtbus/devices/mlx5_sf.1

$ cat /sys/bus/virtbus/devices/mlx5_sf.1/sfnum
10

/sys/bus/virtbus/devices/mlx5_sf.1 is a symlink to:
../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_sf.1

New devlink instance is named using alias:
$ devlink dev show
pci/0000:06:00.0%sf10

$ devlink port show
pci/0000:06:00.0%sf10/0: flavour virtual type eth netdev netdev enp6s0f0s10

You see that the udev used the sysfs files and symlink to assemble netdev name.

Note that this kind of aliasing is not implemented. Needs to be done in
devlink code in kernel. During SF devlink instance creation, there should
be passed parent PF device pointer and sfnum from which the alias dev_name
is assembled. This ensures persistent naming consistent in both smartnic
and host usecase.

If the user on smartnic or host does not want the virtbus device to get
probed automatically (for any reason), he can do it by:

$ echo "0" > /sys/bus/virtbus/drivers_autoprobe

This is enabled by default.


User can deactivate the slice by:

$ devlink slice set pci/0000:06:00.0/100 state inactive

This eventually leads to event delivered to PF driver, which is a
cue to remove the SF device from virtbus and remove all related devlink
and netdev instances.

The slice may be activated again.

Now on the teardown process, user might either remove the SF slice
right away, without deactivation. However, it is possible to remove
deactivated SF too. To remove the SF, user should do:

$ devlink slice del pci/0000:06:00.0/100

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2

$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66



==================================================================
||                                                              ||
||   VF manual creation and activation user cmdline API draft   ||
||                                                              ||
==================================================================

To enter manual mode, the user has to turn off VF dummies creation:
$ devlink dev set pci/0000:06:00.0 vf_dummies disabled
$ devlink dev show
pci/0000:06:00.0: vf_dummies disabled

It is "enabled" by default in order not to break existing users.

By setting the "vf_dummies" attribute to "disabled", the driver
removes all dummy VFs. Only physical ports are present:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2

Then the user is able to create them in a similar way as SFs:

$ devlink slice add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8 hw_addr aa:bb:cc:aa:bb:cc

The devlink kernel code calls down to device driver (devlink op) and asks
it to create a slice with particular attributes. Driver then instatiates
the slice and port:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf0 slice 99

$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state inactive

Now we activate (deploy) the VF:
$ devlink slice set pci/0000:06:00.0/99 state active

$ devlink slice show
pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state active



==================================================================
||                                                              ||
||                             PFs                              ||
||                                                              ||
==================================================================

There are 2 flavours of PFs:
1) Parent PF. That is coupled with uplink port. The slice flavour is
   therefore "physical", to be in sync of the flavour of the uplink port.
   In case this Parent PF is actually a leg of upstream embedded switch,
   the slice flavour is "virtual" (same as the port flavour).

   $ devlink port show
   pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0

   $ devlink slice show
   pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active

   This slice is shown in both "switchdev" and "legacy" modes.

   If there is another parent PF, say "0000:06:00.1", that share the
   same embedded switch, the aliasing is established for devlink handles.

   The user can use devlink handles:
   pci/0000:06:00.0
   pci/0000:06:00.1
   as equivalents, pointing to the same devlink instance.

   Parent PFs are the ones that may be in control of managing
   embedded switch, on any hierarchy level.

2) Child PF. This is a leg of a PF put to the parent PF. It is
   represented by a slice, and a port (with a netdevice):

   $ devlink port show
   pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
   pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2 slice 20

   $ devlink slice show
   pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
   pci/0000:06:00.0/20: flavour pcipf pfnum 1 port 1 hw_addr aa:bb:cc:aa:bb:87 state active  <<<<<<<<<<

   This is a typical smartnic scenario. You would see this list on
   the smartnic CPU. The slice pci/0000:06:00.0/20 is a leg to
   one of the hosts. If you send packets to enp6s0f0pf2, they will
   go to he host.

   Note that inside the host, the PF is represented again as "Parent PF"
   and may be used to configure nested embedded switch.



==================================================================
||                                                              ||
||               Slice operational state extension              ||
||                                                              ||
==================================================================

In addition to the "state" attribute that serves for the purpose
of setting the "admin state", there is "opstate" attribute added to
reflect the operational state of the slice:


    opstate                description
    -----------            ------------
    1. attached    State when slice device is bound to the host
                   driver. When the slice device is unbound from the
                   host driver, slice device exits this state and
                   enters detaching state.

    2. detaching   State when host is notified to deactivate the
                   slice device and slice device may be undergoing
                   detachment from host driver. When slice device is
                   fully detached from the host driver, slice exits
                   this state and enters detached state.

    3. detached    State when driver is fully unbound, it enters
                   into detached state.

slice state machine:
--------------------
                               slice state set inactive
                              ----<------------------<---
                              | or  slice delete        |
                              |                         |
  __________              ____|_______              ____|_______
 |          | slice add  |            |slice state |            |
 |          |-------->---|            |------>-----|            |
 | invalid  |            |  inactive  | set active |   active   |
 |          | slice del  |            |            |            |
 |__________|--<---------|____________|            |____________|

slice device operational state machine:
---------------------------------------
  __________                ____________              ___________
 |          | slice state  |            |driver bus  |           |
 | invalid  |-------->-----|  detached  |------>-----| attached  |
 |          | set active   |            | probe()    |           |
 |__________|              |____________|            |___________|
                                 |                        |
                                 ^                    slice set
                                 |                    set inactive
                            successful detach             |
                              or pf reset                 |
                             ____|_______                 |
                            |            | driver bus     |
                 -----------| detaching  |---<-------------
                 |          |            | remove()
                 ^          |____________|
                 |   timeout      |
                 --<---------------



==================================================================
||                                                              ||
||             Slice rate user cmdline API draft                ||
||                                                              ||
==================================================================

Note that some of the "devlink slice" attributes in show commands
are omitted on purpose.


$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum
pci/0000:06:00.0/1: flavour pcivf pfnum 0 vfnum 1
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1
pci/0000:06:00.0/4: flavour pcisf pfnum 0 sfnum 1

Slice object is extended with new rate object.


$ devlink slice rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf

This shows the leafs created by default alongside with slice objects. No min or
max tx rates were set, so their values are omitted.


Now create new node rate object:

$ devlink slice rate add pci/0000:06:00.0/somerategroup

$ devlink slice rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node

New node rate object was created - the last line.


Now create another new node object was created, this time with some attributes:

$ devlink slice rate add pci/0000:06:00.0/secondrategroup min_tx_rate 20 max_tx_rate 1000

$ devlink slice rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000

Another new node object was created - the last line. The object has min and max
tx rates set, so they are displayed after the object type.


Now set node named somerategroup min/max rate using rate object:

$ devlink slice rate set pci/0000:06:00.0/somerategroup min_tx_rate 50 max_tx_rate 5000

$ devlink slice rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf slice rate using rate object:

$ devlink slice rate set pci/0000:06:00.0/2 min_tx_rate 10 max_tx_rate 10000

$ devlink slice rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf slice with index 2 parent node using rate object:

$ devlink slice rate set pci/0000:06:00.0/2 parent somerategroup

$ devlink slice rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf slice with index 1 parent node using rate object:

$ devlink slice rate set pci/0000:06:00.0/1 parent somerategroup

$ devlink slice rate
pci/0000:06:00.0/1: type leaf parent somerategroup
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now unset leaf slice with index 1 parent node using rate object:

$ devlink slice rate set pci/0000:06:00.0/1 noparent

$ devlink slice rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now delete node object:

$ devlink slice rate del pci/0000:06:00.0/somerategroup

$ devlink slice rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf

Rate node object was removed and its only child pci/0000:06:00.0/2 automatically
detached.



==================================================================
||                                                              ||
||            Slice ib groupping user cmdline API draft         ||
||                                                              ||
==================================================================

Note that some of the "devlink slice" attributes in show commands
are omitted on purpose.

The reason for this IB groupping is that the VFs inside virtual machine
get information (via device) about which two of more VF devices should
be combined together to form one multi-port IB device. In the virtual
machine it is driver's responsibility to setup the combined
multi-port IB devices.

Consider following setup:

$ devlink slice show
pci/0000:06:00.0/0: flavour physical pfnum 0
pci/0000:06:00.0/1: flavour physical pfnum 1
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1
pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0
pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1


Each VF/SF slice has IB leaf object related to it:

$ devlink slice ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf

You see that by default, each slice is marked as a leaf.
There is no groupping set.


User may add a ib group node by issuing following command:

$ devlink slice ib add pci/0000:06:00.0/somempgroup1

$ devlink slice ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node

New node ib node object was created - the last line.


Now set leaf slice with index 2 parent node using ib object:

$ devlink slice ib set pci/0000:06:00.0/2 parent someibgroup1

$ devlink slice ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node


Now set leaf slice with index 5 parent node using ib object:

$ devlink slice ib set pci/0000:06:00.0/5 parent someibgroup1

$ devlink slice ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf parent someibgroup1
pci/0000:06:00.0/someibgroup1: type node

Now you can see there are 2 leaf devices configured to have one parent.


To remove the parent link, user should issue following command:

$ devlink slice ib set pci/0000:06:00.0/5 noparent

$ devlink slice ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node


Now delete node object:

$ devlink slice ib del pci/0000:06:00.0/somempgroup1
$ devlink slice ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf

Node object was removed and its only child pci/0000:06:00.0/2 automatically
detached.


It is not possible to delete leafs:

$ devlink slice ib del pci/0000:06:00.0/2
devlink answers: Operation not supported



==================================================================
||                                                              ||
||            Dynamic PFs user cmdline API draft                ||
||                                                              ||
==================================================================

User might want to create another PF, similar as VF.
TODO

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-19 19:27 [RFC] current devlink extension plan for NICs Jiri Pirko
@ 2020-03-20  3:32 ` Jakub Kicinski
  2020-03-20  7:35   ` Jiri Pirko
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-20  3:32 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta

On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:
> Hi all.
> 
> First, I would like to apologize for very long email. But I think it
> would be beneficial to the see the whole picture, where we are going.
> 
> Currently in Mellanox we are working on several features with need of
> extension of the current devlink infrastructure. I took a stab at
> putting it all together in a single txt file, inlined below.

Does it make sense to transcribe this (modulo discussion) 
as documentation? I noticed you don't use rst notation.
 
> Most of the stuff is based on a new object called "slice" (called
> "subdev" originally in Yuval's patchsets sent some while ago).
> 
> The text describes how things should behave and provides a draft
> of user facing console input/outputs. I think it is important to clear
> that up before we go in and implement the devlink core and
> driver pieces.
> 
> I would like to ask you to read this and comment. Especially, I would
> like to ask vendors if what is described fits the needs of your
> NIC/e-switch.
> 
> Please note that something is already implemented, but most of this
> isn't (see "what needs to be implemented" section).
> 
> 
> 
> 
> ==================================================================
> ||                                                              ||
> ||            Overall illustration of example setup             ||
> ||                                                              ||
> ==================================================================
> 
> Note that there are 2 hosts in the picture. Host A may be the smartnic host,
> Host B may be one of the hosts which gets PF. Also, you might omit
> the Host B and just see Host A like an ordinary nic in a host.

Could you enumerate the use scenarios for the SmartNIC?

Is SmartNIC always "in-line", i.e. separating the host from the network?

Do we need a distinction based on whether the SmartNIC controls Host's
eswitch vs just the Host in its entirety (i.e. VF switching vs bare
metal)?

I find it really useful to be able to list use cases, and constraints
first. Then go onto the design.

> Note that the PF is merged with physical port representor.
> That is due to simpler and flawless transition from legacy mode and back.
> The devlink_ports and netdevs for physical ports are staying during
> the transition.

When users put an interface under bridge or a bond they have to move 
IP addresses etc. onto the bond. Changing mode to "switchdev" is a more
destructive operation and there should be no expectation that
configuration survives.

The merging of the PF with the physical port representor is flawed.

People push Qdisc offloads into devlink because of design shortcuts
like this.

>                         +-----------+
>                         |phys port 2+-----------------------------------+
>                         +-----------+                                   |
>                         +-----------+                                   |
>                         |phys port 1+---------------------------------+ |
>                         +-----------+                                 | |
>                                                                       | |
> +------------------------------------------------------------------+  | |
> |  devlink instance for the whole ASIC                   HOST A    |  | |
> |                                                                  |  | |
> |  pci/0000:06:00.0  (devlink dev)                                 |  | |
> |  +->health reporters, params, info, sb, dpipe,                   |  | |
> |  |  resource, traps                                              |  | |
> |  |                                                               |  | |
> |  +-+port_pci/0000:06:00.0/0+-----------------------+-------------|--+ |
> |  | |  flavour physical pfnum 0  (phys port and pf) ^             |    |

Please no.

> |  | |  netdev enp6s0f0np1                           |             |    |
> |  | +->health reporters, params                     |             |    |
> |  | |                                               |             |    |
> |  | +->slice_pci/0000:06:00.0/0+--------------------+             |    |
> |  |       flavour physical                                        |    |
> |  |                                                               |    |
> |  +-+port_pci/0000:06:00.0/1+-----------------------+-------------|----+
> |  | |  flavour physical pfnum 1  (phys port and pf) |             |
> |  | |  netdev enp6s0f0np2                           |             |
> |  | +->health reporters, params                     |             |
> |  | |                                               |             |
> |  | +->slice_pci/0000:06:00.0/1+--------------------+             |
> |  |       flavour physical                                        |
> |  |                                                               |
> |  +-+-+port_pci/0000:06:00.0/2+-----------+-------------------+   |
> |  | | |  flavour pcipf pfnum 2            ^                   |   |
> |  | | |  netdev enp6s0f0pf2               |                   |   |
> |  | + +->params                           |                   |   |
> |  | |                                     |                   |   |
> |  | +->slice_pci/0000:06:00.0/2+----------+                   |   |
> |  |       flavour pcipf                                       |   |
> |  |                                                           |   |
> |  +-+-+port_pci/0000:06:00.0/3+-----------+----------------+  |   |
> |  | | |  flavour pcivf pfnum 2 vfnum 0    ^                |  |   |
> |  | | |  netdev enp6s0pf2vf0              |                |  |   |
> |  | | +->params                           |                |  |   |
> |  | |                                     |                |  |   |
> |  | +-+slice_pci/0000:06:00.0/3+----------+                |  |   |
> |  |   |   flavour pcivf                                    |  |   |
> |  |   +->rate (qos), mpgroup, mac                          |  |   |
> |  |                                                        |  |   |
> |  +-+-+port_pci/0000:06:00.0/4+-----------+-------------+  |  |   |
> |  | | |  flavour pcivf pfnum 0 vfnum 0    ^             |  |  |   |

So PF 0 is both on the SmartNIC where it is physical and on the Hosts?
Is this just error in the diagram?

> |  | | |  netdev enp6s0pf0vf0              |             |  |  |   |
> |  | | +->params                           |             |  |  |   |
> |  | |                                     |             |  |  |   |
> |  | +-+slice_pci/0000:06:00.0/4+----------+             |  |  |   |
> |  |   |   flavour pcivf                                 |  |  |   |
> |  |   +->rate (qos), mpgroup, mac                       |  |  |   |
> |  |                                                     |  |  |   |
> |  +-+-+port_pci/0000:06:00.0/5+-----------+----------+  |  |  |   |
> |  | | |  flavour pcisf pfnum 0 sfnum 1    ^          |  |  |  |   |
> |  | | |  netdev enp6s0pf0sf1              |          |  |  |  |   |
> |  | | +->params                           |          |  |  |  |   |
> |  | |                                     |          |  |  |  |   |
> |  | +-+slice_pci/0000:06:00.0/5+----------+          |  |  |  |   |
> |  |   |   flavour pcisf                              |  |  |  |   |
> |  |   +->rate (qos), mpgroup, mac                    |  |  |  |   |
> |  |                                                  |  |  |  |   |
> |  +-+slice_pci/0000:06:00.0/6+--------------------+  |  |  |  |   |
> |        flavour pcivf pfnum 0 vfnum 1             |  |  |  |  |   |
> |            (non-ethernet (IB, NVE)               |  |  |  |  |   |
> |                                                  |  |  |  |  |   |
> +------------------------------------------------------------------+
>                                                    |  |  |  |  |
>                                                    |  |  |  |  |
>                                                    |  |  |  |  |
> +----------------------------------------------+   |  |  |  |  |
> |  devlink instance PF (other host)    HOST B  |   |  |  |  |  |
> |                                              |   |  |  |  |  |
> |  pci/0000:10:00.0  (devlink dev)             |   |  |  |  |  |
> |  +->health reporters, info                   |   |  |  |  |  |
> |  |                                           |   |  |  |  |  |
> |  +-+port_pci/0000:10:00.0/1+---------------------------------+
> |    |  flavour virtual                        |   |  |  |  |
> |    |  netdev enp16s0f0                       |   |  |  |  |
> |    +->health reporters                       |   |  |  |  |
> |                                              |   |  |  |  |
> +----------------------------------------------+   |  |  |  |
>                                                    |  |  |  |
> +----------------------------------------------+   |  |  |  |
> |  devlink instance VF (other host)    HOST B  |   |  |  |  |
> |                                              |   |  |  |  |
> |  pci/0000:10:00.1  (devlink dev)             |   |  |  |  |
> |  +->health reporters, info                   |   |  |  |  |
> |  |                                           |   |  |  |  |
> |  +-+port_pci/0000:10:00.1/1+------------------------------+
> |    |  flavour virtual                        |   |  |  |
> |    |  netdev enp16s0f0v0                     |   |  |  |
> |    +->health reporters                       |   |  |  |
> |                                              |   |  |  |
> +----------------------------------------------+   |  |  |
>                                                    |  |  |
> +----------------------------------------------+   |  |  |
> |  devlink instance VF                 HOST A  |   |  |  |
> |                                              |   |  |  |
> |  pci/0000:06:00.1  (devlink dev)             |   |  |  |
> |  +->health reporters, info                   |   |  |  |
> |  |                                           |   |  |  |
> |  +-+port_pci/0000:06:00.1/1+---------------------------+
> |    |  flavour virtual                        |   |  |
> |    |  netdev enp6s0f0v0                      |   |  |
> |    +->health reporters                       |   |  |
> |                                              |   |  |
> +----------------------------------------------+   |  |
>                                                    |  |
> +----------------------------------------------+   |  |
> |  devlink instance SF                 HOST A  |   |  |
> |                                              |   |  |
> |  pci/0000:06:00.0%sf1    (devlink dev)       |   |  |
> |  +->health reporters, info                   |   |  |
> |  |                                           |   |  |
> |  +-+port_pci/0000:06:00.0%sf1/1+--------------------+
> |    |  flavour virtual                        |   |
> |    |  netdev enp6s0f0s1                      |   |
> |    +->health reporters                       |   |
> |                                              |   |
> +----------------------------------------------+   |
>                                                    |
> +----------------------------------------------+   |
> |  devlink instance VF                 HOST A  |   |
> |                                              |   |
> |  pci/0000:06:00.2  (devlink dev)+----------------+
> |  +->health reporters, info                   |
> |                                              |
> +----------------------------------------------+
> 
> 
> 
> 
> ==================================================================
> ||                                                              ||
> ||                 what needs to be implemented                 ||
> ||                                                              ||
> ==================================================================
> 
> 1) physical port "pfnum". When PF and physical port representor
>    are merged, the instance of devlink_port representing the physical port
>    and PF needs to have "pfnum" attribute to be in sync
>    with other PF port representors.

See above.

> 2) per-port health reporters are not implemented yet.

Which health reports are visible on a SmartNIC port? 

The Host ones or the SmartNIC ones?

I think queue reporters should be per-queue, see below.

> 3) devlink_port instance in PF/VF/SF flavour "virtual". In PF/VF/SF devlink
>    instance (in VM for example), there would make sense to have devlink_port
>    instance. At least to carry link to netdevice name (otherwise we have
>    no easy way to find out devlink instance and netdevice belong to each other).
>    I was thinking about flavour name, we have to distinguish from eswitch
>    devlink port flavours "pcipf, pcivf, pcisf".

Virtual is the flavor for the VF port, IIUC, so what's left to name?
Do you mean pick a phys_port_name format?

>    This was recently implemented by Parav:
> commit 0a303214f8cb8e2043a03f7b727dba620e07e68d
> Merge: c04d102ba56e 162add8cbae4
> Author: David S. Miller <davem@davemloft.net>
> Date:   Tue Mar 3 15:40:40 2020 -0800
> 
>     Merge branch 'devlink-virtual-port'
> 
>    What is missing is the "virtual" flavour for nested PF.
> 
> 4) slice is not implemented yet. This is the original "vdev/subdev" concept.
>    See below section "Slice user cmdline API draft".
> 
> 5) SFs are not implemented.
>    See below section "SF (subfunction) user cmdline API draft".
> 
> 6) rate for slice are not implemented yet.
>    See below section "Slice rate user cmdline API draft".
> 
> 7) mpgroup for slice is not implemented yet.
>    See below section "Slice mpgroup user cmdline API draft".
> 
> 8) VF manual creation using devlink is not implemented yet.
>    See below section "VF manual creation and activation user cmdline API draft".
>  
> 9) PF aliasing. One devlink instance and multiple PFs sharing it as they have one
>    merged e-switch.
> 
> 
> 
> ==================================================================
> ||                                                              ||
> ||                  Issues/open questions                       ||
> ||                                                              ||
> ==================================================================
> 
> 1) "pfnum" has to be per-asic(/devlink instance), not per-host.
>    That means that in smartNIC scenario, we cannot have "pfnum 0"
>    for smartNIC and "pfnum 0" for host as well.

Right, exactly, NFP already does that.

> 2) Q: for TX, RX queues reporters, should it be bound to devlink_port?
>    For which flavours this might make sense?
>    Most probably for flavours "physical"/"virtual".
>    How about the reporters in VF/SF?

I think with the work Magnus is doing we should have queues as first
class citizens to be able to allocate them to ports.

Would this mean we can hang reporters off queues?

> 3) How the management of nested switch is handled. The PFs created dynamically
>    or the ones in hosts in smartnic scenario may themselves be each a manager
>    of nested e-switch. How to toggle this capability?
>    During creation by a cmdline option?
>    During lifetime in case the PF does not have any childs (VFs/SFs)?

Maybe the grouping of functions into devlink instances would help? 
SmartNIC could control if the host can perform switching between
functions by either putting them in the same Host side devlink 
instance or not.

> ==================================================================
> ||                                                              ||
> ||                Slice user cmdline API draft                  ||
> ||                                                              ||
> ==================================================================
> 
> Note that some of the "devlink port" attributes may be forgotten or misordered.
> 
> Slices and ports are created together by device driver. The driver defines
> the relationships during creation.
> 
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 0
> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1 slice 1
> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0 slice 2
> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1 slice 3
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr 10:22:33:44:55:77 state active
> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
> 
> In these 2 outputs you can see the relationships. Attributes "slice" and "port"
> indicate the slice-port pairs.
> 
> Also, there is a fixed "state" attribute with value "active". This is by
> default as the VFs are always created active. In future, it is planned
> to implement manual VF creation and activation, similar to what is below
> described for SFs.
> 
> Note that for non-ethernet slice (the last one) does not have any
> related port port. It can be for example NVE or IB. But since
> the "hw_addr" attribute is also omitted, it isn't IB.
> 
>  
> Now set a different MAC address for VF1 on PF0:
> $ devlink slice set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2

What are slices?

> ==================================================================
> ||                                                              ||
> ||          SF (subfunction) user cmdline API draft             ||
> ||                                                              ||
> ==================================================================
> 
> Note that some of the "devlink port" attributes may be forgotten or misordered.
> 
> Note that some of the "devlink slice" attributes in show commands
> are omitted on purpose.
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> 
> There is one VF on the NIC.
> 
> 
> Now create subfunction of SF0 on PF1, index of the slice is going to be 100
> and hw_address aa:bb:cc:aa:bb:cc.
> 
> $ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc

Why is the SF number specified by the user rather than allocated?

> The devlink kernel code calls down to device driver (devlink op) and asks
> it to create a slice with particular attributes. Driver then instatiates
> the slice and port in the same way it is done for VF:
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
> pci/0000:06:00.0/3: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10 slice 100
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state inactive
> 
> Note that the SF slice is created but not active. That means the
> entities are created on devlink side, the e-switch port representor
> is created, but the SF device itself it not yet out there (same host
> or different, depends on where the parent PF is - in this case the same host).
> User might use e-switch port representor enp6s0pf1sf10 to do settings,
> putting it into bridge, adding TC rules, etc.
> It's like the cable is unplugged on the other side.

If it's just "cable unplugged" can't we just use the fact the
representor is down to indicate no traffic can flow?

> Now we activate (deploy) the SF:
> $ devlink slice set pci/0000:06:00.0/100 state active
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state active
> 
> Upon the activation, the device driver asks the device to instantiate
> the actual SF device on particular PF. Does not matter if that is
> on the same host or not.
> 
> On the other side, the PF driver instance gets the event from device
> that particular SF was activated. It's the cue to put the device on bus
> probe it and instantiate netdev and devlink instances for it.

Seems backwards. It's the PF that wants the new function, why can't it
just create it and either get an error from the other side or never get
link up?

> For every SF a device is created on wirtbus with an ID assigned by the
> virtbus code. For example:
> /sys/bus/virtbus/devices/mlx5_sf.1
> 
> $ cat /sys/bus/virtbus/devices/mlx5_sf.1/sfnum
> 10
> 
> /sys/bus/virtbus/devices/mlx5_sf.1 is a symlink to:
> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_sf.1
> 
> New devlink instance is named using alias:
> $ devlink dev show
> pci/0000:06:00.0%sf10
> 
> $ devlink port show
> pci/0000:06:00.0%sf10/0: flavour virtual type eth netdev netdev enp6s0f0s10
> 
> You see that the udev used the sysfs files and symlink to assemble netdev name.
> 
> Note that this kind of aliasing is not implemented. Needs to be done in
> devlink code in kernel. During SF devlink instance creation, there should
> be passed parent PF device pointer and sfnum from which the alias dev_name
> is assembled. This ensures persistent naming consistent in both smartnic
> and host usecase.
> 
> If the user on smartnic or host does not want the virtbus device to get
> probed automatically (for any reason), he can do it by:
> 
> $ echo "0" > /sys/bus/virtbus/drivers_autoprobe
> 
> This is enabled by default.
> 
> 
> User can deactivate the slice by:
> 
> $ devlink slice set pci/0000:06:00.0/100 state inactive
> 
> This eventually leads to event delivered to PF driver, which is a
> cue to remove the SF device from virtbus and remove all related devlink
> and netdev instances.
> 
> The slice may be activated again.
> 
> Now on the teardown process, user might either remove the SF slice
> right away, without deactivation. However, it is possible to remove
> deactivated SF too. To remove the SF, user should do:
> 
> $ devlink slice del pci/0000:06:00.0/100
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66

The destruction also seems wrong way around. Could you explain why it
makes sense to create from SmartNIC side?

> ==================================================================
> ||                                                              ||
> ||   VF manual creation and activation user cmdline API draft   ||
> ||                                                              ||
> ==================================================================
> 
> To enter manual mode, the user has to turn off VF dummies creation:
> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
> $ devlink dev show
> pci/0000:06:00.0: vf_dummies disabled
> 
> It is "enabled" by default in order not to break existing users.
> 
> By setting the "vf_dummies" attribute to "disabled", the driver
> removes all dummy VFs. Only physical ports are present:
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> 
> Then the user is able to create them in a similar way as SFs:
> 
> $ devlink slice add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8 hw_addr aa:bb:cc:aa:bb:cc
> 
> The devlink kernel code calls down to device driver (devlink op) and asks
> it to create a slice with particular attributes. Driver then instatiates
> the slice and port:
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> pci/0000:06:00.0/2: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf0 slice 99
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state inactive
> 
> Now we activate (deploy) the VF:
> $ devlink slice set pci/0000:06:00.0/99 state active
> 
> $ devlink slice show
> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state active
> 
> ==================================================================
> ||                                                              ||
> ||                             PFs                              ||
> ||                                                              ||
> ==================================================================
> 
> There are 2 flavours of PFs:
> 1) Parent PF. That is coupled with uplink port. The slice flavour is
>    therefore "physical", to be in sync of the flavour of the uplink port.
>    In case this Parent PF is actually a leg of upstream embedded switch,
>    the slice flavour is "virtual" (same as the port flavour).
> 
>    $ devlink port show
>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> 
>    $ devlink slice show
>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> 
>    This slice is shown in both "switchdev" and "legacy" modes.
> 
>    If there is another parent PF, say "0000:06:00.1", that share the
>    same embedded switch, the aliasing is established for devlink handles.
> 
>    The user can use devlink handles:
>    pci/0000:06:00.0
>    pci/0000:06:00.1
>    as equivalents, pointing to the same devlink instance.
> 
>    Parent PFs are the ones that may be in control of managing
>    embedded switch, on any hierarchy level.
> 
> 2) Child PF. This is a leg of a PF put to the parent PF. It is
>    represented by a slice, and a port (with a netdevice):
> 
>    $ devlink port show
>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2 slice 20
> 
>    $ devlink slice show
>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>    pci/0000:06:00.0/20: flavour pcipf pfnum 1 port 1 hw_addr aa:bb:cc:aa:bb:87 state active  <<<<<<<<<<
> 
>    This is a typical smartnic scenario. You would see this list on
>    the smartnic CPU. The slice pci/0000:06:00.0/20 is a leg to
>    one of the hosts. If you send packets to enp6s0f0pf2, they will
>    go to he host.
> 
>    Note that inside the host, the PF is represented again as "Parent PF"
>    and may be used to configure nested embedded switch.

This parent/child PF I don't understand. Does it stem from some HW
limitations you have?

> ==================================================================
> ||                                                              ||
> ||               Slice operational state extension              ||
> ||                                                              ||
> ==================================================================
> 
> In addition to the "state" attribute that serves for the purpose
> of setting the "admin state", there is "opstate" attribute added to
> reflect the operational state of the slice:
> 
> 
>     opstate                description
>     -----------            ------------
>     1. attached    State when slice device is bound to the host
>                    driver. When the slice device is unbound from the
>                    host driver, slice device exits this state and
>                    enters detaching state.
> 
>     2. detaching   State when host is notified to deactivate the
>                    slice device and slice device may be undergoing
>                    detachment from host driver. When slice device is
>                    fully detached from the host driver, slice exits
>                    this state and enters detached state.
> 
>     3. detached    State when driver is fully unbound, it enters
>                    into detached state.
> 
> slice state machine:
> --------------------
>                                slice state set inactive
>                               ----<------------------<---
>                               | or  slice delete        |
>                               |                         |
>   __________              ____|_______              ____|_______
>  |          | slice add  |            |slice state |            |
>  |          |-------->---|            |------>-----|            |
>  | invalid  |            |  inactive  | set active |   active   |
>  |          | slice del  |            |            |            |
>  |__________|--<---------|____________|            |____________|
> 
> slice device operational state machine:
> ---------------------------------------
>   __________                ____________              ___________
>  |          | slice state  |            |driver bus  |           |
>  | invalid  |-------->-----|  detached  |------>-----| attached  |
>  |          | set active   |            | probe()    |           |
>  |__________|              |____________|            |___________|
>                                  |                        |
>                                  ^                    slice set
>                                  |                    set inactive
>                             successful detach             |
>                               or pf reset                 |
>                              ____|_______                 |
>                             |            | driver bus     |
>                  -----------| detaching  |---<-------------
>                  |          |            | remove()
>                  ^          |____________|
>                  |   timeout      |
>                  --<---------------
> 
> 
> 
> ==================================================================
> ||                                                              ||
> ||             Slice rate user cmdline API draft                ||
> ||                                                              ||
> ==================================================================
> 
> Note that some of the "devlink slice" attributes in show commands
> are omitted on purpose.
> 
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum
> pci/0000:06:00.0/1: flavour pcivf pfnum 0 vfnum 1
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0
> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1
> pci/0000:06:00.0/4: flavour pcisf pfnum 0 sfnum 1
> 
> Slice object is extended with new rate object.
> 
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf
> pci/0000:06:00.0/2: type leaf
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> 
> This shows the leafs created by default alongside with slice objects. No min or
> max tx rates were set, so their values are omitted.
> 
> 
> Now create new node rate object:
> 
> $ devlink slice rate add pci/0000:06:00.0/somerategroup
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf
> pci/0000:06:00.0/2: type leaf
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/somerategroup: type node
> 
> New node rate object was created - the last line.
> 
> 
> Now create another new node object was created, this time with some attributes:
> 
> $ devlink slice rate add pci/0000:06:00.0/secondrategroup min_tx_rate 20 max_tx_rate 1000
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf
> pci/0000:06:00.0/2: type leaf
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/somerategroup: type node
> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> 
> Another new node object was created - the last line. The object has min and max
> tx rates set, so they are displayed after the object type.
> 
> 
> Now set node named somerategroup min/max rate using rate object:
> 
> $ devlink slice rate set pci/0000:06:00.0/somerategroup min_tx_rate 50 max_tx_rate 5000
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf
> pci/0000:06:00.0/2: type leaf
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> 
> 
> Now set leaf slice rate using rate object:
> 
> $ devlink slice rate set pci/0000:06:00.0/2 min_tx_rate 10 max_tx_rate 10000
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf
> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> 
> 
> Now set leaf slice with index 2 parent node using rate object:
> 
> $ devlink slice rate set pci/0000:06:00.0/2 parent somerategroup
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf
> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> 
> 
> Now set leaf slice with index 1 parent node using rate object:
> 
> $ devlink slice rate set pci/0000:06:00.0/1 parent somerategroup
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf parent somerategroup
> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> 
> 
> Now unset leaf slice with index 1 parent node using rate object:
> 
> $ devlink slice rate set pci/0000:06:00.0/1 noparent
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf
> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> 
> 
> Now delete node object:
> 
> $ devlink slice rate del pci/0000:06:00.0/somerategroup
> 
> $ devlink slice rate
> pci/0000:06:00.0/1: type leaf
> pci/0000:06:00.0/2: type leaf
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> 
> Rate node object was removed and its only child pci/0000:06:00.0/2 automatically
> detached.

Tomorrow we will support CoDel, ECN or any other queuing construct. 
How many APIs do we want to have to configure the same thing? :/

> ==================================================================
> ||                                                              ||
> ||            Slice ib groupping user cmdline API draft         ||
> ||                                                              ||
> ==================================================================
> 
> Note that some of the "devlink slice" attributes in show commands
> are omitted on purpose.
> 
> The reason for this IB groupping is that the VFs inside virtual machine
> get information (via device) about which two of more VF devices should
> be combined together to form one multi-port IB device. In the virtual
> machine it is driver's responsibility to setup the combined
> multi-port IB devices.
> 
> Consider following setup:
> 
> $ devlink slice show
> pci/0000:06:00.0/0: flavour physical pfnum 0
> pci/0000:06:00.0/1: flavour physical pfnum 1
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0
> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1
> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0
> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1
> 
> 
> Each VF/SF slice has IB leaf object related to it:
> 
> $ devlink slice ib
> pci/0000:06:00.0/2: type leaf
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/5: type leaf
> 
> You see that by default, each slice is marked as a leaf.
> There is no groupping set.
> 
> 
> User may add a ib group node by issuing following command:
> 
> $ devlink slice ib add pci/0000:06:00.0/somempgroup1
> 
> $ devlink slice ib
> pci/0000:06:00.0/2: type leaf
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/5: type leaf
> pci/0000:06:00.0/someibgroup1: type node
> 
> New node ib node object was created - the last line.
> 
> 
> Now set leaf slice with index 2 parent node using ib object:
> 
> $ devlink slice ib set pci/0000:06:00.0/2 parent someibgroup1
> 
> $ devlink slice ib
> pci/0000:06:00.0/2: type leaf parent someibgroup1
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/5: type leaf
> pci/0000:06:00.0/someibgroup1: type node
> 
> 
> Now set leaf slice with index 5 parent node using ib object:
> 
> $ devlink slice ib set pci/0000:06:00.0/5 parent someibgroup1
> 
> $ devlink slice ib
> pci/0000:06:00.0/2: type leaf parent someibgroup1
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/5: type leaf parent someibgroup1
> pci/0000:06:00.0/someibgroup1: type node
> 
> Now you can see there are 2 leaf devices configured to have one parent.
> 
> 
> To remove the parent link, user should issue following command:
> 
> $ devlink slice ib set pci/0000:06:00.0/5 noparent
> 
> $ devlink slice ib
> pci/0000:06:00.0/2: type leaf parent someibgroup1
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/5: type leaf
> pci/0000:06:00.0/someibgroup1: type node
> 
> 
> Now delete node object:
> 
> $ devlink slice ib del pci/0000:06:00.0/somempgroup1
> $ devlink slice ib
> pci/0000:06:00.0/2: type leaf
> pci/0000:06:00.0/3: type leaf
> pci/0000:06:00.0/4: type leaf
> pci/0000:06:00.0/5: type leaf
> 
> Node object was removed and its only child pci/0000:06:00.0/2 automatically
> detached.
> 
> 
> It is not possible to delete leafs:
> 
> $ devlink slice ib del pci/0000:06:00.0/2
> devlink answers: Operation not supported
> 
> 
> 
> ==================================================================
> ||                                                              ||
> ||            Dynamic PFs user cmdline API draft                ||
> ||                                                              ||
> ==================================================================
> 
> User might want to create another PF, similar as VF.
> TODO


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-20  3:32 ` Jakub Kicinski
@ 2020-03-20  7:35   ` Jiri Pirko
  2020-03-20 21:25     ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Jiri Pirko @ 2020-03-20  7:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta

Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@kernel.org wrote:
>On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:
>> Hi all.
>> 
>> First, I would like to apologize for very long email. But I think it
>> would be beneficial to the see the whole picture, where we are going.
>> 
>> Currently in Mellanox we are working on several features with need of
>> extension of the current devlink infrastructure. I took a stab at
>> putting it all together in a single txt file, inlined below.
>
>Does it make sense to transcribe this (modulo discussion) 
>as documentation? I noticed you don't use rst notation.

That is a plan. I want to convert it in parts to the documentation.


> 
>> Most of the stuff is based on a new object called "slice" (called
>> "subdev" originally in Yuval's patchsets sent some while ago).
>> 
>> The text describes how things should behave and provides a draft
>> of user facing console input/outputs. I think it is important to clear
>> that up before we go in and implement the devlink core and
>> driver pieces.
>> 
>> I would like to ask you to read this and comment. Especially, I would
>> like to ask vendors if what is described fits the needs of your
>> NIC/e-switch.
>> 
>> Please note that something is already implemented, but most of this
>> isn't (see "what needs to be implemented" section).
>> 
>> 
>> 
>> 
>> ==================================================================
>> ||                                                              ||
>> ||            Overall illustration of example setup             ||
>> ||                                                              ||
>> ==================================================================
>> 
>> Note that there are 2 hosts in the picture. Host A may be the smartnic host,
>> Host B may be one of the hosts which gets PF. Also, you might omit
>> the Host B and just see Host A like an ordinary nic in a host.
>
>Could you enumerate the use scenarios for the SmartNIC?
>
>Is SmartNIC always "in-line", i.e. separating the host from the network?

As far as I know, it is. The host is always given a PF which is a leg of
eswitch managed on the SmartNIC.


>
>Do we need a distinction based on whether the SmartNIC controls Host's
>eswitch vs just the Host in its entirety (i.e. VF switching vs bare
>metal)?

I have this described in the "PFs" section. Basically we need to have a
toggle to say "host is managing it's own nested eswitch.


>
>I find it really useful to be able to list use cases, and constraints
>first. Then go onto the design.
>
>> Note that the PF is merged with physical port representor.
>> That is due to simpler and flawless transition from legacy mode and back.
>> The devlink_ports and netdevs for physical ports are staying during
>> the transition.
>
>When users put an interface under bridge or a bond they have to move 
>IP addresses etc. onto the bond. Changing mode to "switchdev" is a more
>destructive operation and there should be no expectation that
>configuration survives.

Yeah, I was saying the same thing when our arch came up with this, but
I now think it is just fine. It is drivers responsibility to do the
shift. And the entities representing the uplink port: netdevs and
devlink_port instances. They can easily stay during the transition. The
transition only applies to the eswitch and VF entities.


>
>The merging of the PF with the physical port representor is flawed.

Why?


>
>People push Qdisc offloads into devlink because of design shortcuts
>like this.

Could you please explain how it is related to "Qdisc offloads"


>
>>                         +-----------+
>>                         |phys port 2+-----------------------------------+
>>                         +-----------+                                   |
>>                         +-----------+                                   |
>>                         |phys port 1+---------------------------------+ |
>>                         +-----------+                                 | |
>>                                                                       | |
>> +------------------------------------------------------------------+  | |
>> |  devlink instance for the whole ASIC                   HOST A    |  | |
>> |                                                                  |  | |
>> |  pci/0000:06:00.0  (devlink dev)                                 |  | |
>> |  +->health reporters, params, info, sb, dpipe,                   |  | |
>> |  |  resource, traps                                              |  | |
>> |  |                                                               |  | |
>> |  +-+port_pci/0000:06:00.0/0+-----------------------+-------------|--+ |
>> |  | |  flavour physical pfnum 0  (phys port and pf) ^             |    |
>
>Please no.

What exactly "no"?


>
>> |  | |  netdev enp6s0f0np1                           |             |    |
>> |  | +->health reporters, params                     |             |    |
>> |  | |                                               |             |    |
>> |  | +->slice_pci/0000:06:00.0/0+--------------------+             |    |
>> |  |       flavour physical                                        |    |
>> |  |                                                               |    |
>> |  +-+port_pci/0000:06:00.0/1+-----------------------+-------------|----+
>> |  | |  flavour physical pfnum 1  (phys port and pf) |             |
>> |  | |  netdev enp6s0f0np2                           |             |
>> |  | +->health reporters, params                     |             |
>> |  | |                                               |             |
>> |  | +->slice_pci/0000:06:00.0/1+--------------------+             |
>> |  |       flavour physical                                        |
>> |  |                                                               |
>> |  +-+-+port_pci/0000:06:00.0/2+-----------+-------------------+   |
>> |  | | |  flavour pcipf pfnum 2            ^                   |   |
>> |  | | |  netdev enp6s0f0pf2               |                   |   |
>> |  | + +->params                           |                   |   |
>> |  | |                                     |                   |   |
>> |  | +->slice_pci/0000:06:00.0/2+----------+                   |   |
>> |  |       flavour pcipf                                       |   |
>> |  |                                                           |   |
>> |  +-+-+port_pci/0000:06:00.0/3+-----------+----------------+  |   |
>> |  | | |  flavour pcivf pfnum 2 vfnum 0    ^                |  |   |
>> |  | | |  netdev enp6s0pf2vf0              |                |  |   |
>> |  | | +->params                           |                |  |   |
>> |  | |                                     |                |  |   |
>> |  | +-+slice_pci/0000:06:00.0/3+----------+                |  |   |
>> |  |   |   flavour pcivf                                    |  |   |
>> |  |   +->rate (qos), mpgroup, mac                          |  |   |
>> |  |                                                        |  |   |
>> |  +-+-+port_pci/0000:06:00.0/4+-----------+-------------+  |  |   |
>> |  | | |  flavour pcivf pfnum 0 vfnum 0    ^             |  |  |   |
>
>So PF 0 is both on the SmartNIC where it is physical and on the Hosts?
>Is this just error in the diagram?

I think it is error in the reading. This is the VF representation
of VF pci/0000:06:00.1, on the same host A, which is the SmartNIC host.


>
>> |  | | |  netdev enp6s0pf0vf0              |             |  |  |   |
>> |  | | +->params                           |             |  |  |   |
>> |  | |                                     |             |  |  |   |
>> |  | +-+slice_pci/0000:06:00.0/4+----------+             |  |  |   |
>> |  |   |   flavour pcivf                                 |  |  |   |
>> |  |   +->rate (qos), mpgroup, mac                       |  |  |   |
>> |  |                                                     |  |  |   |
>> |  +-+-+port_pci/0000:06:00.0/5+-----------+----------+  |  |  |   |
>> |  | | |  flavour pcisf pfnum 0 sfnum 1    ^          |  |  |  |   |
>> |  | | |  netdev enp6s0pf0sf1              |          |  |  |  |   |
>> |  | | +->params                           |          |  |  |  |   |
>> |  | |                                     |          |  |  |  |   |
>> |  | +-+slice_pci/0000:06:00.0/5+----------+          |  |  |  |   |
>> |  |   |   flavour pcisf                              |  |  |  |   |
>> |  |   +->rate (qos), mpgroup, mac                    |  |  |  |   |
>> |  |                                                  |  |  |  |   |
>> |  +-+slice_pci/0000:06:00.0/6+--------------------+  |  |  |  |   |
>> |        flavour pcivf pfnum 0 vfnum 1             |  |  |  |  |   |
>> |            (non-ethernet (IB, NVE)               |  |  |  |  |   |
>> |                                                  |  |  |  |  |   |
>> +------------------------------------------------------------------+
>>                                                    |  |  |  |  |
>>                                                    |  |  |  |  |
>>                                                    |  |  |  |  |
>> +----------------------------------------------+   |  |  |  |  |
>> |  devlink instance PF (other host)    HOST B  |   |  |  |  |  |
>> |                                              |   |  |  |  |  |
>> |  pci/0000:10:00.0  (devlink dev)             |   |  |  |  |  |
>> |  +->health reporters, info                   |   |  |  |  |  |
>> |  |                                           |   |  |  |  |  |
>> |  +-+port_pci/0000:10:00.0/1+---------------------------------+
>> |    |  flavour virtual                        |   |  |  |  |
>> |    |  netdev enp16s0f0                       |   |  |  |  |
>> |    +->health reporters                       |   |  |  |  |
>> |                                              |   |  |  |  |
>> +----------------------------------------------+   |  |  |  |
>>                                                    |  |  |  |
>> +----------------------------------------------+   |  |  |  |
>> |  devlink instance VF (other host)    HOST B  |   |  |  |  |
>> |                                              |   |  |  |  |
>> |  pci/0000:10:00.1  (devlink dev)             |   |  |  |  |
>> |  +->health reporters, info                   |   |  |  |  |
>> |  |                                           |   |  |  |  |
>> |  +-+port_pci/0000:10:00.1/1+------------------------------+
>> |    |  flavour virtual                        |   |  |  |
>> |    |  netdev enp16s0f0v0                     |   |  |  |
>> |    +->health reporters                       |   |  |  |
>> |                                              |   |  |  |
>> +----------------------------------------------+   |  |  |
>>                                                    |  |  |
>> +----------------------------------------------+   |  |  |
>> |  devlink instance VF                 HOST A  |   |  |  |
>> |                                              |   |  |  |
>> |  pci/0000:06:00.1  (devlink dev)             |   |  |  |
>> |  +->health reporters, info                   |   |  |  |
>> |  |                                           |   |  |  |
>> |  +-+port_pci/0000:06:00.1/1+---------------------------+
>> |    |  flavour virtual                        |   |  |
>> |    |  netdev enp6s0f0v0                      |   |  |
>> |    +->health reporters                       |   |  |
>> |                                              |   |  |
>> +----------------------------------------------+   |  |
>>                                                    |  |
>> +----------------------------------------------+   |  |
>> |  devlink instance SF                 HOST A  |   |  |
>> |                                              |   |  |
>> |  pci/0000:06:00.0%sf1    (devlink dev)       |   |  |
>> |  +->health reporters, info                   |   |  |
>> |  |                                           |   |  |
>> |  +-+port_pci/0000:06:00.0%sf1/1+--------------------+
>> |    |  flavour virtual                        |   |
>> |    |  netdev enp6s0f0s1                      |   |
>> |    +->health reporters                       |   |
>> |                                              |   |
>> +----------------------------------------------+   |
>>                                                    |
>> +----------------------------------------------+   |
>> |  devlink instance VF                 HOST A  |   |
>> |                                              |   |
>> |  pci/0000:06:00.2  (devlink dev)+----------------+
>> |  +->health reporters, info                   |
>> |                                              |
>> +----------------------------------------------+
>> 
>> 
>> 
>> 
>> ==================================================================
>> ||                                                              ||
>> ||                 what needs to be implemented                 ||
>> ||                                                              ||
>> ==================================================================
>> 
>> 1) physical port "pfnum". When PF and physical port representor
>>    are merged, the instance of devlink_port representing the physical port
>>    and PF needs to have "pfnum" attribute to be in sync
>>    with other PF port representors.
>
>See above.
>
>> 2) per-port health reporters are not implemented yet.
>
>Which health reports are visible on a SmartNIC port? 

I think there is usecase for SmartNIC uplink/pf port health reporters.
Those are the ones which we have for TX/RX queues on devlink instance
now (that was a mistake). 

>
>The Host ones or the SmartNIC ones?

In the host, I think there is a usecase for VF/SF devlink_port health
reporters - also for TX/RX queues.


>
>I think queue reporters should be per-queue, see below.

Depends. There are reporters, like "fw" that are per-asic.


>
>> 3) devlink_port instance in PF/VF/SF flavour "virtual". In PF/VF/SF devlink
>>    instance (in VM for example), there would make sense to have devlink_port
>>    instance. At least to carry link to netdevice name (otherwise we have
>>    no easy way to find out devlink instance and netdevice belong to each other).
>>    I was thinking about flavour name, we have to distinguish from eswitch
>>    devlink port flavours "pcipf, pcivf, pcisf".
>
>Virtual is the flavor for the VF port, IIUC, so what's left to name?
>Do you mean pick a phys_port_name format?

No. "virtual" is devlink_port flavour in the host in the VF devlink
instance. See port_pci/0000:10:00.0/1 above for example.


>
>>    This was recently implemented by Parav:
>> commit 0a303214f8cb8e2043a03f7b727dba620e07e68d
>> Merge: c04d102ba56e 162add8cbae4
>> Author: David S. Miller <davem@davemloft.net>
>> Date:   Tue Mar 3 15:40:40 2020 -0800
>> 
>>     Merge branch 'devlink-virtual-port'
>> 
>>    What is missing is the "virtual" flavour for nested PF.
>> 
>> 4) slice is not implemented yet. This is the original "vdev/subdev" concept.
>>    See below section "Slice user cmdline API draft".
>> 
>> 5) SFs are not implemented.
>>    See below section "SF (subfunction) user cmdline API draft".
>> 
>> 6) rate for slice are not implemented yet.
>>    See below section "Slice rate user cmdline API draft".
>> 
>> 7) mpgroup for slice is not implemented yet.
>>    See below section "Slice mpgroup user cmdline API draft".
>> 
>> 8) VF manual creation using devlink is not implemented yet.
>>    See below section "VF manual creation and activation user cmdline API draft".
>>  
>> 9) PF aliasing. One devlink instance and multiple PFs sharing it as they have one
>>    merged e-switch.
>> 
>> 
>> 
>> ==================================================================
>> ||                                                              ||
>> ||                  Issues/open questions                       ||
>> ||                                                              ||
>> ==================================================================
>> 
>> 1) "pfnum" has to be per-asic(/devlink instance), not per-host.
>>    That means that in smartNIC scenario, we cannot have "pfnum 0"
>>    for smartNIC and "pfnum 0" for host as well.
>
>Right, exactly, NFP already does that.
>
>> 2) Q: for TX, RX queues reporters, should it be bound to devlink_port?
>>    For which flavours this might make sense?
>>    Most probably for flavours "physical"/"virtual".
>>    How about the reporters in VF/SF?
>
>I think with the work Magnus is doing we should have queues as first

Can you point me to the "work"?


>class citizens to be able to allocate them to ports.
>
>Would this mean we can hang reporters off queues?

Yes, If we have a "queue object", the per-queue reporter would make sense.


>
>> 3) How the management of nested switch is handled. The PFs created dynamically
>>    or the ones in hosts in smartnic scenario may themselves be each a manager
>>    of nested e-switch. How to toggle this capability?
>>    During creation by a cmdline option?
>>    During lifetime in case the PF does not have any childs (VFs/SFs)?
>
>Maybe the grouping of functions into devlink instances would help? 
>SmartNIC could control if the host can perform switching between
>functions by either putting them in the same Host side devlink 
>instance or not.

I'm not sure I follow. There is a number of PFs created "on probe".
Those are fixed and driver knows where to put them.
The comment was about possible "dynamic PF" created by user in the same
was he creates SF, by devlink cmdline.

Now the PF itself can have a "nested eswitch" to manage. The "parent
eswitch" where the PF was created would only see one leg to the "nested
eswitch".

This "nested eswitch management" might or might not be required. Depends
on a usecare. The question was, how to configure that I as a user
want this or not.


>
>> ==================================================================
>> ||                                                              ||
>> ||                Slice user cmdline API draft                  ||
>> ||                                                              ||
>> ==================================================================
>> 
>> Note that some of the "devlink port" attributes may be forgotten or misordered.
>> 
>> Slices and ports are created together by device driver. The driver defines
>> the relationships during creation.
>> 
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 0
>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1 slice 1
>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0 slice 2
>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1 slice 3
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr 10:22:33:44:55:77 state active
>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
>> 
>> In these 2 outputs you can see the relationships. Attributes "slice" and "port"
>> indicate the slice-port pairs.
>> 
>> Also, there is a fixed "state" attribute with value "active". This is by
>> default as the VFs are always created active. In future, it is planned
>> to implement manual VF creation and activation, similar to what is below
>> described for SFs.
>> 
>> Note that for non-ethernet slice (the last one) does not have any
>> related port port. It can be for example NVE or IB. But since
>> the "hw_addr" attribute is also omitted, it isn't IB.
>> 
>>  
>> Now set a different MAC address for VF1 on PF0:
>> $ devlink slice set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
>
>What are slices?

Slice is basically a piece of ASIC. pf/vf/sf. They serve for
configuration of the "other side of the wire". Like the mac. Hypervizor
admin can use the slite to set the mac address of a VF which is in the
virtual machine. Basically this should be a replacement of "ip vf"
command.


>
>> ==================================================================
>> ||                                                              ||
>> ||          SF (subfunction) user cmdline API draft             ||
>> ||                                                              ||
>> ==================================================================
>> 
>> Note that some of the "devlink port" attributes may be forgotten or misordered.
>> 
>> Note that some of the "devlink slice" attributes in show commands
>> are omitted on purpose.
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>> 
>> There is one VF on the NIC.
>> 
>> 
>> Now create subfunction of SF0 on PF1, index of the slice is going to be 100
>> and hw_address aa:bb:cc:aa:bb:cc.
>> 
>> $ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc
>
>Why is the SF number specified by the user rather than allocated?

Because it is snown in representor netdevice name. And you need to have
it predetermined: enp6s0pf1sf10


>
>> The devlink kernel code calls down to device driver (devlink op) and asks
>> it to create a slice with particular attributes. Driver then instatiates
>> the slice and port in the same way it is done for VF:
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>> pci/0000:06:00.0/3: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10 slice 100
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state inactive
>> 
>> Note that the SF slice is created but not active. That means the
>> entities are created on devlink side, the e-switch port representor
>> is created, but the SF device itself it not yet out there (same host
>> or different, depends on where the parent PF is - in this case the same host).
>> User might use e-switch port representor enp6s0pf1sf10 to do settings,
>> putting it into bridge, adding TC rules, etc.
>> It's like the cable is unplugged on the other side.
>
>If it's just "cable unplugged" can't we just use the fact the
>representor is down to indicate no traffic can flow?

It is not "cable unplugged". This "state inactive/action" is admin
state. You as a eswitch admin say, "I'm done with configuring a slice
(MAC) and a representor (bridge, TC, etc) for this particular SF and
I want the HOST to instantiate the SF instance (with the configured
MAC).


>
>> Now we activate (deploy) the SF:
>> $ devlink slice set pci/0000:06:00.0/100 state active
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state active
>> 
>> Upon the activation, the device driver asks the device to instantiate
>> the actual SF device on particular PF. Does not matter if that is
>> on the same host or not.
>> 
>> On the other side, the PF driver instance gets the event from device
>> that particular SF was activated. It's the cue to put the device on bus
>> probe it and instantiate netdev and devlink instances for it.
>
>Seems backwards. It's the PF that wants the new function, why can't it
>just create it and either get an error from the other side or never get
>link up?

We discussed that many times internally. I think it makes sense that
the SF is created by the same entity that manages the related eswitch
SF-representor. In other words, the "devlink slice" and related "devlink
port" object are under the same devlink instance.

If the PF in host manages nested switch, it can create the SF inside and
manages the neste eswitch SF-representor as you describe.

It is a matter of "nested eswitch manager on/off" configuration.

I think this is clean model and it is known who has what
responsibilities.


>
>> For every SF a device is created on wirtbus with an ID assigned by the
>> virtbus code. For example:
>> /sys/bus/virtbus/devices/mlx5_sf.1
>> 
>> $ cat /sys/bus/virtbus/devices/mlx5_sf.1/sfnum
>> 10
>> 
>> /sys/bus/virtbus/devices/mlx5_sf.1 is a symlink to:
>> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_sf.1
>> 
>> New devlink instance is named using alias:
>> $ devlink dev show
>> pci/0000:06:00.0%sf10
>> 
>> $ devlink port show
>> pci/0000:06:00.0%sf10/0: flavour virtual type eth netdev netdev enp6s0f0s10
>> 
>> You see that the udev used the sysfs files and symlink to assemble netdev name.
>> 
>> Note that this kind of aliasing is not implemented. Needs to be done in
>> devlink code in kernel. During SF devlink instance creation, there should
>> be passed parent PF device pointer and sfnum from which the alias dev_name
>> is assembled. This ensures persistent naming consistent in both smartnic
>> and host usecase.
>> 
>> If the user on smartnic or host does not want the virtbus device to get
>> probed automatically (for any reason), he can do it by:
>> 
>> $ echo "0" > /sys/bus/virtbus/drivers_autoprobe
>> 
>> This is enabled by default.
>> 
>> 
>> User can deactivate the slice by:
>> 
>> $ devlink slice set pci/0000:06:00.0/100 state inactive
>> 
>> This eventually leads to event delivered to PF driver, which is a
>> cue to remove the SF device from virtbus and remove all related devlink
>> and netdev instances.
>> 
>> The slice may be activated again.
>> 
>> Now on the teardown process, user might either remove the SF slice
>> right away, without deactivation. However, it is possible to remove
>> deactivated SF too. To remove the SF, user should do:
>> 
>> $ devlink slice del pci/0000:06:00.0/100
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>
>The destruction also seems wrong way around. Could you explain why it
>makes sense to create from SmartNIC side?

See above.


>
>> ==================================================================
>> ||                                                              ||
>> ||   VF manual creation and activation user cmdline API draft   ||
>> ||                                                              ||
>> ==================================================================
>> 
>> To enter manual mode, the user has to turn off VF dummies creation:
>> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
>> $ devlink dev show
>> pci/0000:06:00.0: vf_dummies disabled
>> 
>> It is "enabled" by default in order not to break existing users.
>> 
>> By setting the "vf_dummies" attribute to "disabled", the driver
>> removes all dummy VFs. Only physical ports are present:
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> 
>> Then the user is able to create them in a similar way as SFs:
>> 
>> $ devlink slice add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8 hw_addr aa:bb:cc:aa:bb:cc
>> 
>> The devlink kernel code calls down to device driver (devlink op) and asks
>> it to create a slice with particular attributes. Driver then instatiates
>> the slice and port:
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> pci/0000:06:00.0/2: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf0 slice 99
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state inactive
>> 
>> Now we activate (deploy) the VF:
>> $ devlink slice set pci/0000:06:00.0/99 state active
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state active
>> 
>> ==================================================================
>> ||                                                              ||
>> ||                             PFs                              ||
>> ||                                                              ||
>> ==================================================================
>> 
>> There are 2 flavours of PFs:
>> 1) Parent PF. That is coupled with uplink port. The slice flavour is
>>    therefore "physical", to be in sync of the flavour of the uplink port.
>>    In case this Parent PF is actually a leg of upstream embedded switch,
>>    the slice flavour is "virtual" (same as the port flavour).
>> 
>>    $ devlink port show
>>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
>> 
>>    $ devlink slice show
>>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> 
>>    This slice is shown in both "switchdev" and "legacy" modes.
>> 
>>    If there is another parent PF, say "0000:06:00.1", that share the
>>    same embedded switch, the aliasing is established for devlink handles.
>> 
>>    The user can use devlink handles:
>>    pci/0000:06:00.0
>>    pci/0000:06:00.1
>>    as equivalents, pointing to the same devlink instance.
>> 
>>    Parent PFs are the ones that may be in control of managing
>>    embedded switch, on any hierarchy level.
>> 
>> 2) Child PF. This is a leg of a PF put to the parent PF. It is
>>    represented by a slice, and a port (with a netdevice):
>> 
>>    $ devlink port show
>>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
>>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2 slice 20
>> 
>>    $ devlink slice show
>>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>    pci/0000:06:00.0/20: flavour pcipf pfnum 1 port 1 hw_addr aa:bb:cc:aa:bb:87 state active  <<<<<<<<<<
>> 
>>    This is a typical smartnic scenario. You would see this list on
>>    the smartnic CPU. The slice pci/0000:06:00.0/20 is a leg to
>>    one of the hosts. If you send packets to enp6s0f0pf2, they will
>>    go to he host.
>> 
>>    Note that inside the host, the PF is represented again as "Parent PF"
>>    and may be used to configure nested embedded switch.
>
>This parent/child PF I don't understand. Does it stem from some HW
>limitations you have?

No limitation. It's just a name for 2 roles. I didn't know how else to
name it for the documentation purposes. Perhaps you can help me.

The child can simply manage a "nested eswich". The "parent eswitch"
would see one leg (pf representor) one way or another. Only in case the
"nested eswitch" is there, the child would manage it - have separate
representors for vfs/sfs under its devlink instance.


>
>> ==================================================================
>> ||                                                              ||
>> ||               Slice operational state extension              ||
>> ||                                                              ||
>> ==================================================================
>> 
>> In addition to the "state" attribute that serves for the purpose
>> of setting the "admin state", there is "opstate" attribute added to
>> reflect the operational state of the slice:
>> 
>> 
>>     opstate                description
>>     -----------            ------------
>>     1. attached    State when slice device is bound to the host
>>                    driver. When the slice device is unbound from the
>>                    host driver, slice device exits this state and
>>                    enters detaching state.
>> 
>>     2. detaching   State when host is notified to deactivate the
>>                    slice device and slice device may be undergoing
>>                    detachment from host driver. When slice device is
>>                    fully detached from the host driver, slice exits
>>                    this state and enters detached state.
>> 
>>     3. detached    State when driver is fully unbound, it enters
>>                    into detached state.
>> 
>> slice state machine:
>> --------------------
>>                                slice state set inactive
>>                               ----<------------------<---
>>                               | or  slice delete        |
>>                               |                         |
>>   __________              ____|_______              ____|_______
>>  |          | slice add  |            |slice state |            |
>>  |          |-------->---|            |------>-----|            |
>>  | invalid  |            |  inactive  | set active |   active   |
>>  |          | slice del  |            |            |            |
>>  |__________|--<---------|____________|            |____________|
>> 
>> slice device operational state machine:
>> ---------------------------------------
>>   __________                ____________              ___________
>>  |          | slice state  |            |driver bus  |           |
>>  | invalid  |-------->-----|  detached  |------>-----| attached  |
>>  |          | set active   |            | probe()    |           |
>>  |__________|              |____________|            |___________|
>>                                  |                        |
>>                                  ^                    slice set
>>                                  |                    set inactive
>>                             successful detach             |
>>                               or pf reset                 |
>>                              ____|_______                 |
>>                             |            | driver bus     |
>>                  -----------| detaching  |---<-------------
>>                  |          |            | remove()
>>                  ^          |____________|
>>                  |   timeout      |
>>                  --<---------------
>> 
>> 
>> 
>> ==================================================================
>> ||                                                              ||
>> ||             Slice rate user cmdline API draft                ||
>> ||                                                              ||
>> ==================================================================
>> 
>> Note that some of the "devlink slice" attributes in show commands
>> are omitted on purpose.
>> 
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum
>> pci/0000:06:00.0/1: flavour pcivf pfnum 0 vfnum 1
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0
>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1
>> pci/0000:06:00.0/4: flavour pcisf pfnum 0 sfnum 1
>> 
>> Slice object is extended with new rate object.
>> 
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf
>> pci/0000:06:00.0/2: type leaf
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> 
>> This shows the leafs created by default alongside with slice objects. No min or
>> max tx rates were set, so their values are omitted.
>> 
>> 
>> Now create new node rate object:
>> 
>> $ devlink slice rate add pci/0000:06:00.0/somerategroup
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf
>> pci/0000:06:00.0/2: type leaf
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/somerategroup: type node
>> 
>> New node rate object was created - the last line.
>> 
>> 
>> Now create another new node object was created, this time with some attributes:
>> 
>> $ devlink slice rate add pci/0000:06:00.0/secondrategroup min_tx_rate 20 max_tx_rate 1000
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf
>> pci/0000:06:00.0/2: type leaf
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/somerategroup: type node
>> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> 
>> Another new node object was created - the last line. The object has min and max
>> tx rates set, so they are displayed after the object type.
>> 
>> 
>> Now set node named somerategroup min/max rate using rate object:
>> 
>> $ devlink slice rate set pci/0000:06:00.0/somerategroup min_tx_rate 50 max_tx_rate 5000
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf
>> pci/0000:06:00.0/2: type leaf
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> 
>> 
>> Now set leaf slice rate using rate object:
>> 
>> $ devlink slice rate set pci/0000:06:00.0/2 min_tx_rate 10 max_tx_rate 10000
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf
>> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> 
>> 
>> Now set leaf slice with index 2 parent node using rate object:
>> 
>> $ devlink slice rate set pci/0000:06:00.0/2 parent somerategroup
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf
>> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> 
>> 
>> Now set leaf slice with index 1 parent node using rate object:
>> 
>> $ devlink slice rate set pci/0000:06:00.0/1 parent somerategroup
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf parent somerategroup
>> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> 
>> 
>> Now unset leaf slice with index 1 parent node using rate object:
>> 
>> $ devlink slice rate set pci/0000:06:00.0/1 noparent
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf
>> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> 
>> 
>> Now delete node object:
>> 
>> $ devlink slice rate del pci/0000:06:00.0/somerategroup
>> 
>> $ devlink slice rate
>> pci/0000:06:00.0/1: type leaf
>> pci/0000:06:00.0/2: type leaf
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> 
>> Rate node object was removed and its only child pci/0000:06:00.0/2 automatically
>> detached.
>
>Tomorrow we will support CoDel, ECN or any other queuing construct. 
>How many APIs do we want to have to configure the same thing? :/

Rigth, I don't see other way. Please help me to figure out this
differently. Note this is for configuring HW limits on the TX side of VF.

We have only devlink_port and related netdev as representor of the VF
eswitch port. We cannot use this netdev to configure qdisc on RX.


>
>> ==================================================================
>> ||                                                              ||
>> ||            Slice ib groupping user cmdline API draft         ||
>> ||                                                              ||
>> ==================================================================
>> 
>> Note that some of the "devlink slice" attributes in show commands
>> are omitted on purpose.
>> 
>> The reason for this IB groupping is that the VFs inside virtual machine
>> get information (via device) about which two of more VF devices should
>> be combined together to form one multi-port IB device. In the virtual
>> machine it is driver's responsibility to setup the combined
>> multi-port IB devices.
>> 
>> Consider following setup:
>> 
>> $ devlink slice show
>> pci/0000:06:00.0/0: flavour physical pfnum 0
>> pci/0000:06:00.0/1: flavour physical pfnum 1
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0
>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1
>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0
>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1
>> 
>> 
>> Each VF/SF slice has IB leaf object related to it:
>> 
>> $ devlink slice ib
>> pci/0000:06:00.0/2: type leaf
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/5: type leaf
>> 
>> You see that by default, each slice is marked as a leaf.
>> There is no groupping set.
>> 
>> 
>> User may add a ib group node by issuing following command:
>> 
>> $ devlink slice ib add pci/0000:06:00.0/somempgroup1
>> 
>> $ devlink slice ib
>> pci/0000:06:00.0/2: type leaf
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/5: type leaf
>> pci/0000:06:00.0/someibgroup1: type node
>> 
>> New node ib node object was created - the last line.
>> 
>> 
>> Now set leaf slice with index 2 parent node using ib object:
>> 
>> $ devlink slice ib set pci/0000:06:00.0/2 parent someibgroup1
>> 
>> $ devlink slice ib
>> pci/0000:06:00.0/2: type leaf parent someibgroup1
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/5: type leaf
>> pci/0000:06:00.0/someibgroup1: type node
>> 
>> 
>> Now set leaf slice with index 5 parent node using ib object:
>> 
>> $ devlink slice ib set pci/0000:06:00.0/5 parent someibgroup1
>> 
>> $ devlink slice ib
>> pci/0000:06:00.0/2: type leaf parent someibgroup1
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/5: type leaf parent someibgroup1
>> pci/0000:06:00.0/someibgroup1: type node
>> 
>> Now you can see there are 2 leaf devices configured to have one parent.
>> 
>> 
>> To remove the parent link, user should issue following command:
>> 
>> $ devlink slice ib set pci/0000:06:00.0/5 noparent
>> 
>> $ devlink slice ib
>> pci/0000:06:00.0/2: type leaf parent someibgroup1
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/5: type leaf
>> pci/0000:06:00.0/someibgroup1: type node
>> 
>> 
>> Now delete node object:
>> 
>> $ devlink slice ib del pci/0000:06:00.0/somempgroup1
>> $ devlink slice ib
>> pci/0000:06:00.0/2: type leaf
>> pci/0000:06:00.0/3: type leaf
>> pci/0000:06:00.0/4: type leaf
>> pci/0000:06:00.0/5: type leaf
>> 
>> Node object was removed and its only child pci/0000:06:00.0/2 automatically
>> detached.
>> 
>> 
>> It is not possible to delete leafs:
>> 
>> $ devlink slice ib del pci/0000:06:00.0/2
>> devlink answers: Operation not supported
>> 
>> 
>> 
>> ==================================================================
>> ||                                                              ||
>> ||            Dynamic PFs user cmdline API draft                ||
>> ||                                                              ||
>> ==================================================================
>> 
>> User might want to create another PF, similar as VF.
>> TODO
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-20  7:35   ` Jiri Pirko
@ 2020-03-20 21:25     ` Jakub Kicinski
  2020-03-21  9:07       ` Parav Pandit
                         ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-20 21:25 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Fri, 20 Mar 2020 08:35:55 +0100 Jiri Pirko wrote:
> Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@kernel.org wrote:
> >On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:  
> >> ==================================================================
> >> ||                                                              ||
> >> ||            Overall illustration of example setup             ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> Note that there are 2 hosts in the picture. Host A may be the smartnic host,
> >> Host B may be one of the hosts which gets PF. Also, you might omit
> >> the Host B and just see Host A like an ordinary nic in a host.  
> >
> >Could you enumerate the use scenarios for the SmartNIC?
> >
> >Is SmartNIC always "in-line", i.e. separating the host from the network?  
> 
> As far as I know, it is. The host is always given a PF which is a leg of
> eswitch managed on the SmartNIC.

Cool, I was hoping that's the case. One less configuration mode :)

> >Do we need a distinction based on whether the SmartNIC controls Host's
> >eswitch vs just the Host in its entirety (i.e. VF switching vs bare
> >metal)?  
> 
> I have this described in the "PFs" section. Basically we need to have a
> toggle to say "host is managing it's own nested eswitch.
> 
> >I find it really useful to be able to list use cases, and constraints
> >first. Then go onto the design.
> >  
> >> Note that the PF is merged with physical port representor.
> >> That is due to simpler and flawless transition from legacy mode and back.
> >> The devlink_ports and netdevs for physical ports are staying during
> >> the transition.  
> >
> >When users put an interface under bridge or a bond they have to move 
> >IP addresses etc. onto the bond. Changing mode to "switchdev" is a more
> >destructive operation and there should be no expectation that
> >configuration survives.  
> 
> Yeah, I was saying the same thing when our arch came up with this, but
> I now think it is just fine. It is drivers responsibility to do the
> shift. And the entities representing the uplink port: netdevs and
> devlink_port instances. They can easily stay during the transition. The
> transition only applies to the eswitch and VF entities.

If PF is split from the uplink I think the MAC address should stay with
the PF, not the uplink (which becomes just a repr in a Host case).

> >The merging of the PF with the physical port representor is flawed.  
> 
> Why?

See below.

> >People push Qdisc offloads into devlink because of design shortcuts
> >like this.  
> 
> Could you please explain how it is related to "Qdisc offloads"

Certain users have designed with constrained PCIe bandwidth in the
server. Meaning NIC needs to do buffering much like a switch would.
So we need to separate the uplink from the PF to attach the Qdisc
offload for configuring details of PCIe queuing.

> >>                         +-----------+
> >>                         |phys port 2+-----------------------------------+
> >>                         +-----------+                                   |
> >>                         +-----------+                                   |
> >>                         |phys port 1+---------------------------------+ |
> >>                         +-----------+                                 | |
> >>                                                                       | |
> >> +------------------------------------------------------------------+  | |
> >> |  devlink instance for the whole ASIC                   HOST A    |  | |
> >> |                                                                  |  | |
> >> |  pci/0000:06:00.0  (devlink dev)                                 |  | |
> >> |  +->health reporters, params, info, sb, dpipe,                   |  | |
> >> |  |  resource, traps                                              |  | |
> >> |  |                                                               |  | |
> >> |  +-+port_pci/0000:06:00.0/0+-----------------------+-------------|--+ |
> >> |  | |  flavour physical pfnum 0  (phys port and pf) ^             |    |  
> >
> >Please no.  
> 
> What exactly "no"?

Dual flavorness, and PF being phys port.

> >> |  | |  netdev enp6s0f0np1                           |             |    |
> >> |  | +->health reporters, params                     |             |    |
> >> |  | |                                               |             |    |
> >> |  | +->slice_pci/0000:06:00.0/0+--------------------+             |    |
> >> |  |       flavour physical                                        |    |
> >> |  |                                                               |    |
> >> |  +-+port_pci/0000:06:00.0/1+-----------------------+-------------|----+
> >> |  | |  flavour physical pfnum 1  (phys port and pf) |             |
> >> |  | |  netdev enp6s0f0np2                           |             |
> >> |  | +->health reporters, params                     |             |
> >> |  | |                                               |             |
> >> |  | +->slice_pci/0000:06:00.0/1+--------------------+             |
> >> |  |       flavour physical                                        |
> >> |  |                                                               |
> >> |  +-+-+port_pci/0000:06:00.0/2+-----------+-------------------+   |
> >> |  | | |  flavour pcipf pfnum 2            ^                   |   |
> >> |  | | |  netdev enp6s0f0pf2               |                   |   |
> >> |  | + +->params                           |                   |   |
> >> |  | |                                     |                   |   |
> >> |  | +->slice_pci/0000:06:00.0/2+----------+                   |   |
> >> |  |       flavour pcipf                                       |   |
> >> |  |                                                           |   |
> >> |  +-+-+port_pci/0000:06:00.0/3+-----------+----------------+  |   |
> >> |  | | |  flavour pcivf pfnum 2 vfnum 0    ^                |  |   |
> >> |  | | |  netdev enp6s0pf2vf0              |                |  |   |
> >> |  | | +->params                           |                |  |   |
> >> |  | |                                     |                |  |   |
> >> |  | +-+slice_pci/0000:06:00.0/3+----------+                |  |   |
> >> |  |   |   flavour pcivf                                    |  |   |
> >> |  |   +->rate (qos), mpgroup, mac                          |  |   |
> >> |  |                                                        |  |   |
> >> |  +-+-+port_pci/0000:06:00.0/4+-----------+-------------+  |  |   |
> >> |  | | |  flavour pcivf pfnum 0 vfnum 0    ^             |  |  |   |  
> >
> >So PF 0 is both on the SmartNIC where it is physical and on the Hosts?
> >Is this just error in the diagram?  
> 
> I think it is error in the reading. This is the VF representation
> of VF pci/0000:06:00.1, on the same host A, which is the SmartNIC host.

Hm, I see pf 0 as the first port that has a line to uplink,
and here pf 0 vf 0 has a line to Host B.

The VF above is "pfnum 2 vfnum 0" which makes sense, PF 2 is 
Host B's PF. So VF 0 of PF 2 is also on Host B.

> >> |  | | |  netdev enp6s0pf0vf0              |             |  |  |   |
> >> |  | | +->params                           |             |  |  |   |
> >> |  | |                                     |             |  |  |   |
> >> |  | +-+slice_pci/0000:06:00.0/4+----------+             |  |  |   |
> >> |  |   |   flavour pcivf                                 |  |  |   |
> >> |  |   +->rate (qos), mpgroup, mac                       |  |  |   |
> >> |  |                                                     |  |  |   |
> >> |  +-+-+port_pci/0000:06:00.0/5+-----------+----------+  |  |  |   |
> >> |  | | |  flavour pcisf pfnum 0 sfnum 1    ^          |  |  |  |   |
> >> |  | | |  netdev enp6s0pf0sf1              |          |  |  |  |   |
> >> |  | | +->params                           |          |  |  |  |   |
> >> |  | |                                     |          |  |  |  |   |
> >> |  | +-+slice_pci/0000:06:00.0/5+----------+          |  |  |  |   |
> >> |  |   |   flavour pcisf                              |  |  |  |   |
> >> |  |   +->rate (qos), mpgroup, mac                    |  |  |  |   |
> >> |  |                                                  |  |  |  |   |
> >> |  +-+slice_pci/0000:06:00.0/6+--------------------+  |  |  |  |   |
> >> |        flavour pcivf pfnum 0 vfnum 1             |  |  |  |  |   |
> >> |            (non-ethernet (IB, NVE)               |  |  |  |  |   |
> >> |                                                  |  |  |  |  |   |
> >> +------------------------------------------------------------------+
> >>                                                    |  |  |  |  |
> >>                                                    |  |  |  |  |
> >>                                                    |  |  |  |  |
> >> +----------------------------------------------+   |  |  |  |  |
> >> |  devlink instance PF (other host)    HOST B  |   |  |  |  |  |
> >> |                                              |   |  |  |  |  |
> >> |  pci/0000:10:00.0  (devlink dev)             |   |  |  |  |  |
> >> |  +->health reporters, info                   |   |  |  |  |  |
> >> |  |                                           |   |  |  |  |  |
> >> |  +-+port_pci/0000:10:00.0/1+---------------------------------+
> >> |    |  flavour virtual                        |   |  |  |  |
> >> |    |  netdev enp16s0f0                       |   |  |  |  |
> >> |    +->health reporters                       |   |  |  |  |
> >> |                                              |   |  |  |  |
> >> +----------------------------------------------+   |  |  |  |
> >>                                                    |  |  |  |
> >> +----------------------------------------------+   |  |  |  |
> >> |  devlink instance VF (other host)    HOST B  |   |  |  |  |
> >> |                                              |   |  |  |  |
> >> |  pci/0000:10:00.1  (devlink dev)             |   |  |  |  |
> >> |  +->health reporters, info                   |   |  |  |  |
> >> |  |                                           |   |  |  |  |
> >> |  +-+port_pci/0000:10:00.1/1+------------------------------+
> >> |    |  flavour virtual                        |   |  |  |
> >> |    |  netdev enp16s0f0v0                     |   |  |  |
> >> |    +->health reporters                       |   |  |  |
> >> |                                              |   |  |  |
> >> +----------------------------------------------+   |  |  |
> >>                                                    |  |  |
> >> +----------------------------------------------+   |  |  |
> >> |  devlink instance VF                 HOST A  |   |  |  |
> >> |                                              |   |  |  |
> >> |  pci/0000:06:00.1  (devlink dev)             |   |  |  |
> >> |  +->health reporters, info                   |   |  |  |
> >> |  |                                           |   |  |  |
> >> |  +-+port_pci/0000:06:00.1/1+---------------------------+
> >> |    |  flavour virtual                        |   |  |
> >> |    |  netdev enp6s0f0v0                      |   |  |
> >> |    +->health reporters                       |   |  |
> >> |                                              |   |  |
> >> +----------------------------------------------+   |  |
> >>                                                    |  |
> >> +----------------------------------------------+   |  |
> >> |  devlink instance SF                 HOST A  |   |  |
> >> |                                              |   |  |
> >> |  pci/0000:06:00.0%sf1    (devlink dev)       |   |  |
> >> |  +->health reporters, info                   |   |  |
> >> |  |                                           |   |  |
> >> |  +-+port_pci/0000:06:00.0%sf1/1+--------------------+
> >> |    |  flavour virtual                        |   |
> >> |    |  netdev enp6s0f0s1                      |   |
> >> |    +->health reporters                       |   |
> >> |                                              |   |
> >> +----------------------------------------------+   |
> >>                                                    |
> >> +----------------------------------------------+   |
> >> |  devlink instance VF                 HOST A  |   |
> >> |                                              |   |
> >> |  pci/0000:06:00.2  (devlink dev)+----------------+
> >> |  +->health reporters, info                   |
> >> |                                              |
> >> +----------------------------------------------+
> >> 
> >> 
> >> 
> >> 
> >> ==================================================================
> >> ||                                                              ||
> >> ||                 what needs to be implemented                 ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> 1) physical port "pfnum". When PF and physical port representor
> >>    are merged, the instance of devlink_port representing the physical port
> >>    and PF needs to have "pfnum" attribute to be in sync
> >>    with other PF port representors.  
> >
> >See above.
> >  
> >> 2) per-port health reporters are not implemented yet.  
> >
> >Which health reports are visible on a SmartNIC port?   
> 
> I think there is usecase for SmartNIC uplink/pf port health reporters.
> Those are the ones which we have for TX/RX queues on devlink instance
> now (that was a mistake). 
> 
> >
> >The Host ones or the SmartNIC ones?  
> 
> In the host, I think there is a usecase for VF/SF devlink_port health
> reporters - also for TX/RX queues.
> 
> >I think queue reporters should be per-queue, see below.  
> 
> Depends. There are reporters, like "fw" that are per-asic.
> 
> 
> >  
> >> 3) devlink_port instance in PF/VF/SF flavour "virtual". In PF/VF/SF devlink
> >>    instance (in VM for example), there would make sense to have devlink_port
> >>    instance. At least to carry link to netdevice name (otherwise we have
> >>    no easy way to find out devlink instance and netdevice belong to each other).
> >>    I was thinking about flavour name, we have to distinguish from eswitch
> >>    devlink port flavours "pcipf, pcivf, pcisf".  
> >
> >Virtual is the flavor for the VF port, IIUC, so what's left to name?
> >Do you mean pick a phys_port_name format?  
> 
> No. "virtual" is devlink_port flavour in the host in the VF devlink

Yeah, I'm not 100% sure what you're describing as missing.

Perhaps you could rephrase this point?

> >>    This was recently implemented by Parav:
> >> commit 0a303214f8cb8e2043a03f7b727dba620e07e68d
> >> Merge: c04d102ba56e 162add8cbae4
> >> Author: David S. Miller <davem@davemloft.net>
> >> Date:   Tue Mar 3 15:40:40 2020 -0800
> >> 
> >>     Merge branch 'devlink-virtual-port'
> >> 
> >>    What is missing is the "virtual" flavour for nested PF.
> >> 
> >> 4) slice is not implemented yet. This is the original "vdev/subdev" concept.
> >>    See below section "Slice user cmdline API draft".
> >> 
> >> 5) SFs are not implemented.
> >>    See below section "SF (subfunction) user cmdline API draft".
> >> 
> >> 6) rate for slice are not implemented yet.
> >>    See below section "Slice rate user cmdline API draft".
> >> 
> >> 7) mpgroup for slice is not implemented yet.
> >>    See below section "Slice mpgroup user cmdline API draft".
> >> 
> >> 8) VF manual creation using devlink is not implemented yet.
> >>    See below section "VF manual creation and activation user cmdline API draft".
> >>  
> >> 9) PF aliasing. One devlink instance and multiple PFs sharing it as they have one
> >>    merged e-switch.
> >> 
> >> 
> >> 
> >> ==================================================================
> >> ||                                                              ||
> >> ||                  Issues/open questions                       ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> 1) "pfnum" has to be per-asic(/devlink instance), not per-host.
> >>    That means that in smartNIC scenario, we cannot have "pfnum 0"
> >>    for smartNIC and "pfnum 0" for host as well.  
> >
> >Right, exactly, NFP already does that.
> >  
> >> 2) Q: for TX, RX queues reporters, should it be bound to devlink_port?
> >>    For which flavours this might make sense?
> >>    Most probably for flavours "physical"/"virtual".
> >>    How about the reporters in VF/SF?  
> >
> >I think with the work Magnus is doing we should have queues as first  
> 
> Can you point me to the "work"?

There was a presentation at LPC last year, and some API proposal
circulated off-list :( Let's CC Magnus.

> >class citizens to be able to allocate them to ports.
> >
> >Would this mean we can hang reporters off queues?  
> 
> Yes, If we have a "queue object", the per-queue reporter would make sense.
> 
> 
> >  
> >> 3) How the management of nested switch is handled. The PFs created dynamically
> >>    or the ones in hosts in smartnic scenario may themselves be each a manager
> >>    of nested e-switch. How to toggle this capability?
> >>    During creation by a cmdline option?
> >>    During lifetime in case the PF does not have any childs (VFs/SFs)?  
> >
> >Maybe the grouping of functions into devlink instances would help? 
> >SmartNIC could control if the host can perform switching between
> >functions by either putting them in the same Host side devlink 
> >instance or not.  
> 
> I'm not sure I follow. There is a number of PFs created "on probe".
> Those are fixed and driver knows where to put them.
> The comment was about possible "dynamic PF" created by user in the same
> was he creates SF, by devlink cmdline.

How does the driver differentiate between a dynamic and static PF, 
and why are they different in the first place? :S

Also, once the PFs are created user may want to use them together 
or delegate to a VM/namespace. So when I was thinking we'd need some 
sort of a secure handshake between PFs and FW for the host to prove 
to FW that the PFs belong to the same domain of control, and their
resources (and eswitches) can be pooled.

I'm digressing..

> Now the PF itself can have a "nested eswitch" to manage. The "parent
> eswitch" where the PF was created would only see one leg to the "nested
> eswitch".
> 
> This "nested eswitch management" might or might not be required. Depends
> on a usecare. The question was, how to configure that I as a user
> want this or not.

Ack. I'm extending your question. I think the question is not only who
controls the eswitch but also which PFs share the eswitch.

I think eswitch is just one capability, but SmartNIC will want to
control which ports see what capabilities in general. crypto offloads
and such.

I presume in your model if host controls eswitch the smartNIC sees just
what what comes out of Hosts single "uplink"? What if SmartNIC wants
the host to be able to control the forwarding but not loose the ability
to tap the VF to VF traffic?

> >> ==================================================================
> >> ||                                                              ||
> >> ||                Slice user cmdline API draft                  ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> Note that some of the "devlink port" attributes may be forgotten or misordered.
> >> 
> >> Slices and ports are created together by device driver. The driver defines
> >> the relationships during creation.
> >> 
> >> 
> >> $ devlink port show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 0
> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1 slice 1
> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0 slice 2
> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1 slice 3
> >> 
> >> $ devlink slice show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr 10:22:33:44:55:77 state active
> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
> >> 
> >> In these 2 outputs you can see the relationships. Attributes "slice" and "port"
> >> indicate the slice-port pairs.
> >> 
> >> Also, there is a fixed "state" attribute with value "active". This is by
> >> default as the VFs are always created active. In future, it is planned
> >> to implement manual VF creation and activation, similar to what is below
> >> described for SFs.
> >> 
> >> Note that for non-ethernet slice (the last one) does not have any
> >> related port port. It can be for example NVE or IB. But since
> >> the "hw_addr" attribute is also omitted, it isn't IB.
> >> 
> >>  
> >> Now set a different MAC address for VF1 on PF0:
> >> $ devlink slice set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff
> >> 
> >> $ devlink slice show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2  
> >
> >What are slices?  
> 
> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
> configuration of the "other side of the wire". Like the mac. Hypervizor
> admin can use the slite to set the mac address of a VF which is in the
> virtual machine. Basically this should be a replacement of "ip vf"
> command.

I lost my mail archive but didn't we already have a long thread with
Parav about this?

> >> ==================================================================
> >> ||                                                              ||
> >> ||          SF (subfunction) user cmdline API draft             ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> Note that some of the "devlink port" attributes may be forgotten or misordered.
> >> 
> >> Note that some of the "devlink slice" attributes in show commands
> >> are omitted on purpose.
> >> 
> >> $ devlink port show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
> >> 
> >> $ devlink slice show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> >> 
> >> There is one VF on the NIC.
> >> 
> >> 
> >> Now create subfunction of SF0 on PF1, index of the slice is going to be 100
> >> and hw_address aa:bb:cc:aa:bb:cc.
> >> 
> >> $ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc  
> >
> >Why is the SF number specified by the user rather than allocated?  
> 
> Because it is snown in representor netdevice name. And you need to have
> it predetermined: enp6s0pf1sf10

I'd think you need to know what was assigned, not necessarily pick
upfront.. I feel like we had this conversation before as well.

> >> The devlink kernel code calls down to device driver (devlink op) and asks
> >> it to create a slice with particular attributes. Driver then instatiates
> >> the slice and port in the same way it is done for VF:
> >> 
> >> $ devlink port show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
> >> pci/0000:06:00.0/3: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10 slice 100
> >> 
> >> $ devlink slice show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> >> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state inactive
> >> 
> >> Note that the SF slice is created but not active. That means the
> >> entities are created on devlink side, the e-switch port representor
> >> is created, but the SF device itself it not yet out there (same host
> >> or different, depends on where the parent PF is - in this case the same host).
> >> User might use e-switch port representor enp6s0pf1sf10 to do settings,
> >> putting it into bridge, adding TC rules, etc.
> >> It's like the cable is unplugged on the other side.  
> >
> >If it's just "cable unplugged" can't we just use the fact the
> >representor is down to indicate no traffic can flow?  
> 
> It is not "cable unplugged". This "state inactive/action" is admin
> state. You as a eswitch admin say, "I'm done with configuring a slice
> (MAC) and a representor (bridge, TC, etc) for this particular SF and
> I want the HOST to instantiate the SF instance (with the configured
> MAC).

I'm not opposed, I just don't understand the need. If ASIC will not 
RX or TX any traffic from/to this new entity until repr is brought up
there should be no problem.

> >> Now we activate (deploy) the SF:
> >> $ devlink slice set pci/0000:06:00.0/100 state active
> >> 
> >> $ devlink slice show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> >> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state active
> >> 
> >> Upon the activation, the device driver asks the device to instantiate
> >> the actual SF device on particular PF. Does not matter if that is
> >> on the same host or not.
> >> 
> >> On the other side, the PF driver instance gets the event from device
> >> that particular SF was activated. It's the cue to put the device on bus
> >> probe it and instantiate netdev and devlink instances for it.  
> >
> >Seems backwards. It's the PF that wants the new function, why can't it
> >just create it and either get an error from the other side or never get
> >link up?  
> 
> We discussed that many times internally. I think it makes sense that
> the SF is created by the same entity that manages the related eswitch
> SF-representor. In other words, the "devlink slice" and related "devlink
> port" object are under the same devlink instance.
> 
> If the PF in host manages nested switch, it can create the SF inside and
> manages the neste eswitch SF-representor as you describe.
> 
> It is a matter of "nested eswitch manager on/off" configuration.
> 
> I think this is clean model and it is known who has what
> responsibilities.

I see so you want the creation to be controlled by the same entity that
controls the eswitch..

To me the creation should be on the side that actually needs/will use
the new port. And if it's not eswitch manager then eswitch manager
needs to ack it.

The precedence is probably not a strong argument, but that'd be the
same way VFs work.. I don't think you can change how VFs work, right?

> >> ==================================================================
> >> ||                                                              ||
> >> ||   VF manual creation and activation user cmdline API draft   ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> To enter manual mode, the user has to turn off VF dummies creation:
> >> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
> >> $ devlink dev show
> >> pci/0000:06:00.0: vf_dummies disabled
> >> 
> >> It is "enabled" by default in order not to break existing users.
> >> 
> >> By setting the "vf_dummies" attribute to "disabled", the driver
> >> removes all dummy VFs. Only physical ports are present:
> >> 
> >> $ devlink port show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> 
> >> Then the user is able to create them in a similar way as SFs:
> >> 
> >> $ devlink slice add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8 hw_addr aa:bb:cc:aa:bb:cc
> >> 
> >> The devlink kernel code calls down to device driver (devlink op) and asks
> >> it to create a slice with particular attributes. Driver then instatiates
> >> the slice and port:
> >> 
> >> $ devlink port show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf0 slice 99
> >> 
> >> $ devlink slice show
> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state inactive
> >> 
> >> Now we activate (deploy) the VF:
> >> $ devlink slice set pci/0000:06:00.0/99 state active
> >> 
> >> $ devlink slice show
> >> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state active
> >> 
> >> ==================================================================
> >> ||                                                              ||
> >> ||                             PFs                              ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> There are 2 flavours of PFs:
> >> 1) Parent PF. That is coupled with uplink port. The slice flavour is
> >>    therefore "physical", to be in sync of the flavour of the uplink port.
> >>    In case this Parent PF is actually a leg of upstream embedded switch,
> >>    the slice flavour is "virtual" (same as the port flavour).
> >> 
> >>    $ devlink port show
> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> >> 
> >>    $ devlink slice show
> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> 
> >>    This slice is shown in both "switchdev" and "legacy" modes.
> >> 
> >>    If there is another parent PF, say "0000:06:00.1", that share the
> >>    same embedded switch, the aliasing is established for devlink handles.
> >> 
> >>    The user can use devlink handles:
> >>    pci/0000:06:00.0
> >>    pci/0000:06:00.1
> >>    as equivalents, pointing to the same devlink instance.
> >> 
> >>    Parent PFs are the ones that may be in control of managing
> >>    embedded switch, on any hierarchy level.
> >> 
> >> 2) Child PF. This is a leg of a PF put to the parent PF. It is
> >>    represented by a slice, and a port (with a netdevice):
> >> 
> >>    $ devlink port show
> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> >>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2 slice 20
> >> 
> >>    $ devlink slice show
> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >>    pci/0000:06:00.0/20: flavour pcipf pfnum 1 port 1 hw_addr aa:bb:cc:aa:bb:87 state active  <<<<<<<<<<
> >> 
> >>    This is a typical smartnic scenario. You would see this list on
> >>    the smartnic CPU. The slice pci/0000:06:00.0/20 is a leg to
> >>    one of the hosts. If you send packets to enp6s0f0pf2, they will
> >>    go to he host.
> >> 
> >>    Note that inside the host, the PF is represented again as "Parent PF"
> >>    and may be used to configure nested embedded switch.  
> >
> >This parent/child PF I don't understand. Does it stem from some HW
> >limitations you have?  
> 
> No limitation. It's just a name for 2 roles. I didn't know how else to
> name it for the documentation purposes. Perhaps you can help me.
> 
> The child can simply manage a "nested eswich". The "parent eswitch"
> would see one leg (pf representor) one way or another. Only in case the
> "nested eswitch" is there, the child would manage it - have separate
> representors for vfs/sfs under its devlink instance.

I see! I wouldn't use the term PF. I think we need a notion of 
a "virtual" port within the NIC to model the eswitch being managed 
by the Host.

If Host manages the Eswitch - SmartNIC will no longer deal with its
PCIe ports, but only with its virtual uplink.

> >> ==================================================================
> >> ||                                                              ||
> >> ||               Slice operational state extension              ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> In addition to the "state" attribute that serves for the purpose
> >> of setting the "admin state", there is "opstate" attribute added to
> >> reflect the operational state of the slice:
> >> 
> >> 
> >>     opstate                description
> >>     -----------            ------------
> >>     1. attached    State when slice device is bound to the host
> >>                    driver. When the slice device is unbound from the
> >>                    host driver, slice device exits this state and
> >>                    enters detaching state.
> >> 
> >>     2. detaching   State when host is notified to deactivate the
> >>                    slice device and slice device may be undergoing
> >>                    detachment from host driver. When slice device is
> >>                    fully detached from the host driver, slice exits
> >>                    this state and enters detached state.
> >> 
> >>     3. detached    State when driver is fully unbound, it enters
> >>                    into detached state.
> >> 
> >> slice state machine:
> >> --------------------
> >>                                slice state set inactive
> >>                               ----<------------------<---
> >>                               | or  slice delete        |
> >>                               |                         |
> >>   __________              ____|_______              ____|_______
> >>  |          | slice add  |            |slice state |            |
> >>  |          |-------->---|            |------>-----|            |
> >>  | invalid  |            |  inactive  | set active |   active   |
> >>  |          | slice del  |            |            |            |
> >>  |__________|--<---------|____________|            |____________|
> >> 
> >> slice device operational state machine:
> >> ---------------------------------------
> >>   __________                ____________              ___________
> >>  |          | slice state  |            |driver bus  |           |
> >>  | invalid  |-------->-----|  detached  |------>-----| attached  |
> >>  |          | set active   |            | probe()    |           |
> >>  |__________|              |____________|            |___________|
> >>                                  |                        |
> >>                                  ^                    slice set
> >>                                  |                    set inactive
> >>                             successful detach             |
> >>                               or pf reset                 |
> >>                              ____|_______                 |
> >>                             |            | driver bus     |
> >>                  -----------| detaching  |---<-------------
> >>                  |          |            | remove()
> >>                  ^          |____________|
> >>                  |   timeout      |
> >>                  --<---------------
> >> 
> >> 
> >> 
> >> ==================================================================
> >> ||                                                              ||
> >> ||             Slice rate user cmdline API draft                ||
> >> ||                                                              ||
> >> ==================================================================
> >> 
> >> Note that some of the "devlink slice" attributes in show commands
> >> are omitted on purpose.
> >> 
> >> 
> >> $ devlink slice show
> >> pci/0000:06:00.0/0: flavour physical pfnum
> >> pci/0000:06:00.0/1: flavour pcivf pfnum 0 vfnum 1
> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0
> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1
> >> pci/0000:06:00.0/4: flavour pcisf pfnum 0 sfnum 1
> >> 
> >> Slice object is extended with new rate object.
> >> 
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf
> >> pci/0000:06:00.0/2: type leaf
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> 
> >> This shows the leafs created by default alongside with slice objects. No min or
> >> max tx rates were set, so their values are omitted.
> >> 
> >> 
> >> Now create new node rate object:
> >> 
> >> $ devlink slice rate add pci/0000:06:00.0/somerategroup
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf
> >> pci/0000:06:00.0/2: type leaf
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> pci/0000:06:00.0/somerategroup: type node
> >> 
> >> New node rate object was created - the last line.
> >> 
> >> 
> >> Now create another new node object was created, this time with some attributes:
> >> 
> >> $ devlink slice rate add pci/0000:06:00.0/secondrategroup min_tx_rate 20 max_tx_rate 1000
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf
> >> pci/0000:06:00.0/2: type leaf
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> pci/0000:06:00.0/somerategroup: type node
> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> >> 
> >> Another new node object was created - the last line. The object has min and max
> >> tx rates set, so they are displayed after the object type.
> >> 
> >> 
> >> Now set node named somerategroup min/max rate using rate object:
> >> 
> >> $ devlink slice rate set pci/0000:06:00.0/somerategroup min_tx_rate 50 max_tx_rate 5000
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf
> >> pci/0000:06:00.0/2: type leaf
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> >> 
> >> 
> >> Now set leaf slice rate using rate object:
> >> 
> >> $ devlink slice rate set pci/0000:06:00.0/2 min_tx_rate 10 max_tx_rate 10000
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf
> >> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> >> 
> >> 
> >> Now set leaf slice with index 2 parent node using rate object:
> >> 
> >> $ devlink slice rate set pci/0000:06:00.0/2 parent somerategroup
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf
> >> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> >> 
> >> 
> >> Now set leaf slice with index 1 parent node using rate object:
> >> 
> >> $ devlink slice rate set pci/0000:06:00.0/1 parent somerategroup
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf parent somerategroup
> >> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> >> 
> >> 
> >> Now unset leaf slice with index 1 parent node using rate object:
> >> 
> >> $ devlink slice rate set pci/0000:06:00.0/1 noparent
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf
> >> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
> >> 
> >> 
> >> Now delete node object:
> >> 
> >> $ devlink slice rate del pci/0000:06:00.0/somerategroup
> >> 
> >> $ devlink slice rate
> >> pci/0000:06:00.0/1: type leaf
> >> pci/0000:06:00.0/2: type leaf
> >> pci/0000:06:00.0/3: type leaf
> >> pci/0000:06:00.0/4: type leaf
> >> 
> >> Rate node object was removed and its only child pci/0000:06:00.0/2 automatically
> >> detached.  
> >
> >Tomorrow we will support CoDel, ECN or any other queuing construct. 
> >How many APIs do we want to have to configure the same thing? :/  
> 
> Rigth, I don't see other way. Please help me to figure out this
> differently. Note this is for configuring HW limits on the TX side of VF.
> 
> We have only devlink_port and related netdev as representor of the VF
> eswitch port. We cannot use this netdev to configure qdisc on RX.

Ah, right. This is the TX case we abuse act_police for in OvS offload :S
Yeah, we don't have an API for that.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-20 21:25     ` Jakub Kicinski
@ 2020-03-21  9:07       ` Parav Pandit
  2020-03-23 19:31         ` Jakub Kicinski
  2020-03-21  9:35       ` Jiri Pirko
  2020-03-23 21:32       ` Andy Gospodarek
  2 siblings, 1 reply; 50+ messages in thread
From: Parav Pandit @ 2020-03-21  9:07 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: netdev, davem, Yuval Avnery, jgg, Saeed Mahameed, leon,
	andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On 3/21/2020 2:55 AM, Jakub Kicinski wrote:
> On Fri, 20 Mar 2020 08:35:55 +0100 Jiri Pirko wrote:
>> Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@kernel.org wrote:
>>> On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:  

>> I'm not sure I follow. There is a number of PFs created "on probe".
>> Those are fixed and driver knows where to put them.
>> The comment was about possible "dynamic PF" created by user in the same
>> was he creates SF, by devlink cmdline.
> 
> How does the driver differentiate between a dynamic and static PF, 
> and why are they different in the first place? :S
> 

I don't see a need of PF driver to differentiate them as static/dynamic.
In either case pci probe(), remove() are triggered regardless.

They are different because a smartnic administrator wants to create them
dynamically on user request and provide to the system where this
smartnic is connected.

> Also, once the PFs are created user may want to use them together 

>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 0
>>>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1 slice 1
>>>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0 slice 2
>>>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1 slice 3
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>>>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr 10:22:33:44:55:77 state active
>>>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>>>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>>>> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
>>>>
>>>> In these 2 outputs you can see the relationships. Attributes "slice" and "port"
>>>> indicate the slice-port pairs.
>>>>
>>>> Also, there is a fixed "state" attribute with value "active". This is by
>>>> default as the VFs are always created active. In future, it is planned
>>>> to implement manual VF creation and activation, similar to what is below
>>>> described for SFs.
>>>>
>>>> Note that for non-ethernet slice (the last one) does not have any
>>>> related port port. It can be for example NVE or IB. But since
>>>> the "hw_addr" attribute is also omitted, it isn't IB.
>>>>
>>>>  
>>>> Now set a different MAC address for VF1 on PF0:
>>>> $ devlink slice set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>>>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
>>>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>>>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>>>> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2  
>>>
>>> What are slices?  
>>
>> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
>> configuration of the "other side of the wire". Like the mac. Hypervizor
>> admin can use the slite to set the mac address of a VF which is in the
>> virtual machine. Basically this should be a replacement of "ip vf"
>> command.
> 
> I lost my mail archive but didn't we already have a long thread with
> Parav about this?
>

I have more verbose RFC that talks about motivation of slice and similar
examples which I will post along with actual patches.

But I think Jiri captured thie jist already here, i.e. slice drives the
device life cycle (with/without eswitch).

slice device activation model is more than just port unplug/plug event.
for which we shouldn't overload the devlink port which doesn't undergo
state transition explained for slice.

>>>> ==================================================================
>>>> ||                                                              ||
>>>> ||          SF (subfunction) user cmdline API draft             ||
>>>> ||                                                              ||
>>>> ==================================================================
>>>>
>>>> Note that some of the "devlink port" attributes may be forgotten or misordered.
>>>>
>>>> Note that some of the "devlink slice" attributes in show commands
>>>> are omitted on purpose.
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>>>>
>>>> There is one VF on the NIC.
>>>>
>>>>
>>>> Now create subfunction of SF0 on PF1, index of the slice is going to be 100
>>>> and hw_address aa:bb:cc:aa:bb:cc.
>>>>
>>>> $ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc  
>>>
>>> Why is the SF number specified by the user rather than allocated?  
>>

>> Because it is snown in representor netdevice name. And you need to have
>> it predetermined: enp6s0pf1sf10
> 
> I'd think you need to know what was assigned, not necessarily pick
> upfront.. I feel like we had this conversation before as well.
> 
Assigning by user ensures that netdev and/or rdma device of a slice (sub
function), its representor, health reporters, devlink instance of the
slice gets predictable identity.

>>>> The devlink kernel code calls down to device driver (devlink op) and asks
>>>> it to create a slice with particular attributes. Driver then instatiates
>>>> the slice and port in the same way it is done for VF:
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>>>> pci/0000:06:00.0/3: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10 slice 100
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>>>> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state inactive
>>>>
>>>> Note that the SF slice is created but not active. That means the
>>>> entities are created on devlink side, the e-switch port representor
>>>> is created, but the SF device itself it not yet out there (same host
>>>> or different, depends on where the parent PF is - in this case the same host).
>>>> User might use e-switch port representor enp6s0pf1sf10 to do settings,
>>>> putting it into bridge, adding TC rules, etc.
>>>> It's like the cable is unplugged on the other side.  
>>>
>>> If it's just "cable unplugged" can't we just use the fact the
>>> representor is down to indicate no traffic can flow?  
>>
>> It is not "cable unplugged". This "state inactive/action" is admin
>> state. You as a eswitch admin say, "I'm done with configuring a slice
>> (MAC) and a representor (bridge, TC, etc) for this particular SF and
>> I want the HOST to instantiate the SF instance (with the configured
>> MAC).
> 
> I'm not opposed, I just don't understand the need. If ASIC will not 
> RX or TX any traffic from/to this new entity until repr is brought up
> there should be no problem.
> 
You may still want device to be active and unplug the cable or plug the
cable later on.

Additionally, once a slice is inactivated, portion of the PCI BAR space
should be inactivated too used by this slice, so that a device/fw can
drop such access by an untrusted software that may happen even after
inactivating the slice.

>>>> Now we activate (deploy) the SF:
>>>> $ devlink slice set pci/0000:06:00.0/100 state active
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>>>> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state active
>>>>
>>>> Upon the activation, the device driver asks the device to instantiate
>>>> the actual SF device on particular PF. Does not matter if that is
>>>> on the same host or not.
>>>>
>>>> On the other side, the PF driver instance gets the event from device
>>>> that particular SF was activated. It's the cue to put the device on bus
>>>> probe it and instantiate netdev and devlink instances for it.  
>>>
>>> Seems backwards. It's the PF that wants the new function, why can't it
>>> just create it and either get an error from the other side or never get
>>> link up?  
>>
>> We discussed that many times internally. I think it makes sense that
>> the SF is created by the same entity that manages the related eswitch
>> SF-representor. In other words, the "devlink slice" and related "devlink
>> port" object are under the same devlink instance.
>>
>> If the PF in host manages nested switch, it can create the SF inside and
>> manages the neste eswitch SF-representor as you describe.
>>
>> It is a matter of "nested eswitch manager on/off" configuration.
>>
>> I think this is clean model and it is known who has what
>> responsibilities.
> 
> I see so you want the creation to be controlled by the same entity that
> controls the eswitch..
> 
> To me the creation should be on the side that actually needs/will use
> the new port. And if it's not eswitch manager then eswitch manager
> needs to ack it.
>

There are few reasons to create them on eswitch manager system as below.

1. Creation and deletion on one system and synchronizing it with eswitch
system requires multiple back-n-forth calls between two systems.

2. When this happens, system where its created, doesn't know when is the
right time to provision to a VM or to a application.
udev/systemd/Network Manager and others such software might already
start initializing it doing DHCP but its switch side is not yet ready.

So it is desired to make sure that once device is fully
ready/configured, its activated.

3. Additionally it doesn't follow mirror sequence during deletion when
created on host.

4. eswitch administrator simply doesn't have direct access to the system
where this device is used. So it just cannot be created there by eswitch
administrator.

> The precedence is probably not a strong argument, but that'd be the
> same way VFs work.. I don't think you can change how VFs work, right?
>

At least not with the existing available mlx5 devices. But this model
enables future VF devices to operate in same way as SF.

>>>> ==================================================================
>>>> ||                                                              ||
>>>> ||   VF manual creation and activation user cmdline API draft   ||
>>>> ||                                                              ||
>>>> ==================================================================
>>>>
>>>> To enter manual mode, the user has to turn off VF dummies creation:
>>>> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
>>>> $ devlink dev show
>>>> pci/0000:06:00.0: vf_dummies disabled
>>>>
>>>> It is "enabled" by default in order not to break existing users.
>>>>
>>>> By setting the "vf_dummies" attribute to "disabled", the driver
>>>> removes all dummy VFs. Only physical ports are present:
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>>
>>>> Then the user is able to create them in a similar way as SFs:
>>>>
>>>> $ devlink slice add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8 hw_addr aa:bb:cc:aa:bb:cc
>>>>
>>>> The devlink kernel code calls down to device driver (devlink op) and asks
>>>> it to create a slice with particular attributes. Driver then instatiates
>>>> the slice and port:
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf0 slice 99
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state inactive
>>>>
>>>> Now we activate (deploy) the VF:
>>>> $ devlink slice set pci/0000:06:00.0/99 state active
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state active

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-20 21:25     ` Jakub Kicinski
  2020-03-21  9:07       ` Parav Pandit
@ 2020-03-21  9:35       ` Jiri Pirko
  2020-03-23 19:21         ` Jakub Kicinski
  2020-03-23 23:32         ` Andy Gospodarek
  2020-03-23 21:32       ` Andy Gospodarek
  2 siblings, 2 replies; 50+ messages in thread
From: Jiri Pirko @ 2020-03-21  9:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

Fri, Mar 20, 2020 at 10:25:08PM CET, kuba@kernel.org wrote:
>On Fri, 20 Mar 2020 08:35:55 +0100 Jiri Pirko wrote:
>> Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@kernel.org wrote:
>> >On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:  
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||            Overall illustration of example setup             ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> Note that there are 2 hosts in the picture. Host A may be the smartnic host,
>> >> Host B may be one of the hosts which gets PF. Also, you might omit
>> >> the Host B and just see Host A like an ordinary nic in a host.  
>> >
>> >Could you enumerate the use scenarios for the SmartNIC?
>> >
>> >Is SmartNIC always "in-line", i.e. separating the host from the network?  
>> 
>> As far as I know, it is. The host is always given a PF which is a leg of
>> eswitch managed on the SmartNIC.
>
>Cool, I was hoping that's the case. One less configuration mode :)
>
>> >Do we need a distinction based on whether the SmartNIC controls Host's
>> >eswitch vs just the Host in its entirety (i.e. VF switching vs bare
>> >metal)?  
>> 
>> I have this described in the "PFs" section. Basically we need to have a
>> toggle to say "host is managing it's own nested eswitch.
>> 
>> >I find it really useful to be able to list use cases, and constraints
>> >first. Then go onto the design.
>> >  
>> >> Note that the PF is merged with physical port representor.
>> >> That is due to simpler and flawless transition from legacy mode and back.
>> >> The devlink_ports and netdevs for physical ports are staying during
>> >> the transition.  
>> >
>> >When users put an interface under bridge or a bond they have to move 
>> >IP addresses etc. onto the bond. Changing mode to "switchdev" is a more
>> >destructive operation and there should be no expectation that
>> >configuration survives.  
>> 
>> Yeah, I was saying the same thing when our arch came up with this, but
>> I now think it is just fine. It is drivers responsibility to do the
>> shift. And the entities representing the uplink port: netdevs and
>> devlink_port instances. They can easily stay during the transition. The
>> transition only applies to the eswitch and VF entities.
>
>If PF is split from the uplink I think the MAC address should stay with
>the PF, not the uplink (which becomes just a repr in a Host case).
>
>> >The merging of the PF with the physical port representor is flawed.  
>> 
>> Why?
>
>See below.
>
>> >People push Qdisc offloads into devlink because of design shortcuts
>> >like this.  
>> 
>> Could you please explain how it is related to "Qdisc offloads"
>
>Certain users have designed with constrained PCIe bandwidth in the
>server. Meaning NIC needs to do buffering much like a switch would.
>So we need to separate the uplink from the PF to attach the Qdisc
>offload for configuring details of PCIe queuing.

Hmm, I'm not sure I understand. Then PF and uplink is the same entity,
you can still attach the qdisc to this entity, right? What prevents the
same functionality as with the "split"?


>
>> >>                         +-----------+
>> >>                         |phys port 2+-----------------------------------+
>> >>                         +-----------+                                   |
>> >>                         +-----------+                                   |
>> >>                         |phys port 1+---------------------------------+ |
>> >>                         +-----------+                                 | |
>> >>                                                                       | |
>> >> +------------------------------------------------------------------+  | |
>> >> |  devlink instance for the whole ASIC                   HOST A    |  | |
>> >> |                                                                  |  | |
>> >> |  pci/0000:06:00.0  (devlink dev)                                 |  | |
>> >> |  +->health reporters, params, info, sb, dpipe,                   |  | |
>> >> |  |  resource, traps                                              |  | |
>> >> |  |                                                               |  | |
>> >> |  +-+port_pci/0000:06:00.0/0+-----------------------+-------------|--+ |
>> >> |  | |  flavour physical pfnum 0  (phys port and pf) ^             |    |  
>> >
>> >Please no.  
>> 
>> What exactly "no"?
>
>Dual flavorness, and PF being phys port.

I see.


>
>> >> |  | |  netdev enp6s0f0np1                           |             |    |
>> >> |  | +->health reporters, params                     |             |    |
>> >> |  | |                                               |             |    |
>> >> |  | +->slice_pci/0000:06:00.0/0+--------------------+             |    |
>> >> |  |       flavour physical                                        |    |
>> >> |  |                                                               |    |
>> >> |  +-+port_pci/0000:06:00.0/1+-----------------------+-------------|----+
>> >> |  | |  flavour physical pfnum 1  (phys port and pf) |             |
>> >> |  | |  netdev enp6s0f0np2                           |             |
>> >> |  | +->health reporters, params                     |             |
>> >> |  | |                                               |             |
>> >> |  | +->slice_pci/0000:06:00.0/1+--------------------+             |
>> >> |  |       flavour physical                                        |
>> >> |  |                                                               |
>> >> |  +-+-+port_pci/0000:06:00.0/2+-----------+-------------------+   |
>> >> |  | | |  flavour pcipf pfnum 2            ^                   |   |
>> >> |  | | |  netdev enp6s0f0pf2               |                   |   |
>> >> |  | + +->params                           |                   |   |
>> >> |  | |                                     |                   |   |
>> >> |  | +->slice_pci/0000:06:00.0/2+----------+                   |   |
>> >> |  |       flavour pcipf                                       |   |
>> >> |  |                                                           |   |
>> >> |  +-+-+port_pci/0000:06:00.0/3+-----------+----------------+  |   |
>> >> |  | | |  flavour pcivf pfnum 2 vfnum 0    ^                |  |   |
>> >> |  | | |  netdev enp6s0pf2vf0              |                |  |   |
>> >> |  | | +->params                           |                |  |   |
>> >> |  | |                                     |                |  |   |
>> >> |  | +-+slice_pci/0000:06:00.0/3+----------+                |  |   |
>> >> |  |   |   flavour pcivf                                    |  |   |
>> >> |  |   +->rate (qos), mpgroup, mac                          |  |   |
>> >> |  |                                                        |  |   |
>> >> |  +-+-+port_pci/0000:06:00.0/4+-----------+-------------+  |  |   |
>> >> |  | | |  flavour pcivf pfnum 0 vfnum 0    ^             |  |  |   |  
>> >
>> >So PF 0 is both on the SmartNIC where it is physical and on the Hosts?
>> >Is this just error in the diagram?  
>> 
>> I think it is error in the reading. This is the VF representation
>> of VF pci/0000:06:00.1, on the same host A, which is the SmartNIC host.
>
>Hm, I see pf 0 as the first port that has a line to uplink,
>and here pf 0 vf 0 has a line to Host B.

No. The lane is to "Host A"

>
>The VF above is "pfnum 2 vfnum 0" which makes sense, PF 2 is 
>Host B's PF. So VF 0 of PF 2 is also on Host B.
>
>> >> |  | | |  netdev enp6s0pf0vf0              |             |  |  |   |
>> >> |  | | +->params                           |             |  |  |   |
>> >> |  | |                                     |             |  |  |   |
>> >> |  | +-+slice_pci/0000:06:00.0/4+----------+             |  |  |   |
>> >> |  |   |   flavour pcivf                                 |  |  |   |
>> >> |  |   +->rate (qos), mpgroup, mac                       |  |  |   |
>> >> |  |                                                     |  |  |   |
>> >> |  +-+-+port_pci/0000:06:00.0/5+-----------+----------+  |  |  |   |
>> >> |  | | |  flavour pcisf pfnum 0 sfnum 1    ^          |  |  |  |   |
>> >> |  | | |  netdev enp6s0pf0sf1              |          |  |  |  |   |
>> >> |  | | +->params                           |          |  |  |  |   |
>> >> |  | |                                     |          |  |  |  |   |
>> >> |  | +-+slice_pci/0000:06:00.0/5+----------+          |  |  |  |   |
>> >> |  |   |   flavour pcisf                              |  |  |  |   |
>> >> |  |   +->rate (qos), mpgroup, mac                    |  |  |  |   |
>> >> |  |                                                  |  |  |  |   |
>> >> |  +-+slice_pci/0000:06:00.0/6+--------------------+  |  |  |  |   |
>> >> |        flavour pcivf pfnum 0 vfnum 1             |  |  |  |  |   |
>> >> |            (non-ethernet (IB, NVE)               |  |  |  |  |   |
>> >> |                                                  |  |  |  |  |   |
>> >> +------------------------------------------------------------------+
>> >>                                                    |  |  |  |  |
>> >>                                                    |  |  |  |  |
>> >>                                                    |  |  |  |  |
>> >> +----------------------------------------------+   |  |  |  |  |
>> >> |  devlink instance PF (other host)    HOST B  |   |  |  |  |  |
>> >> |                                              |   |  |  |  |  |
>> >> |  pci/0000:10:00.0  (devlink dev)             |   |  |  |  |  |
>> >> |  +->health reporters, info                   |   |  |  |  |  |
>> >> |  |                                           |   |  |  |  |  |
>> >> |  +-+port_pci/0000:10:00.0/1+---------------------------------+
>> >> |    |  flavour virtual                        |   |  |  |  |
>> >> |    |  netdev enp16s0f0                       |   |  |  |  |
>> >> |    +->health reporters                       |   |  |  |  |
>> >> |                                              |   |  |  |  |
>> >> +----------------------------------------------+   |  |  |  |
>> >>                                                    |  |  |  |
>> >> +----------------------------------------------+   |  |  |  |
>> >> |  devlink instance VF (other host)    HOST B  |   |  |  |  |
>> >> |                                              |   |  |  |  |
>> >> |  pci/0000:10:00.1  (devlink dev)             |   |  |  |  |
>> >> |  +->health reporters, info                   |   |  |  |  |
>> >> |  |                                           |   |  |  |  |
>> >> |  +-+port_pci/0000:10:00.1/1+------------------------------+
>> >> |    |  flavour virtual                        |   |  |  |
>> >> |    |  netdev enp16s0f0v0                     |   |  |  |
>> >> |    +->health reporters                       |   |  |  |
>> >> |                                              |   |  |  |
>> >> +----------------------------------------------+   |  |  |
>> >>                                                    |  |  |
>> >> +----------------------------------------------+   |  |  |
>> >> |  devlink instance VF                 HOST A  |   |  |  |
>> >> |                                              |   |  |  |
>> >> |  pci/0000:06:00.1  (devlink dev)             |   |  |  |
>> >> |  +->health reporters, info                   |   |  |  |
>> >> |  |                                           |   |  |  |
>> >> |  +-+port_pci/0000:06:00.1/1+---------------------------+
>> >> |    |  flavour virtual                        |   |  |
>> >> |    |  netdev enp6s0f0v0                      |   |  |
>> >> |    +->health reporters                       |   |  |
>> >> |                                              |   |  |
>> >> +----------------------------------------------+   |  |
>> >>                                                    |  |
>> >> +----------------------------------------------+   |  |
>> >> |  devlink instance SF                 HOST A  |   |  |
>> >> |                                              |   |  |
>> >> |  pci/0000:06:00.0%sf1    (devlink dev)       |   |  |
>> >> |  +->health reporters, info                   |   |  |
>> >> |  |                                           |   |  |
>> >> |  +-+port_pci/0000:06:00.0%sf1/1+--------------------+
>> >> |    |  flavour virtual                        |   |
>> >> |    |  netdev enp6s0f0s1                      |   |
>> >> |    +->health reporters                       |   |
>> >> |                                              |   |
>> >> +----------------------------------------------+   |
>> >>                                                    |
>> >> +----------------------------------------------+   |
>> >> |  devlink instance VF                 HOST A  |   |
>> >> |                                              |   |
>> >> |  pci/0000:06:00.2  (devlink dev)+----------------+
>> >> |  +->health reporters, info                   |
>> >> |                                              |
>> >> +----------------------------------------------+
>> >> 
>> >> 
>> >> 
>> >> 
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||                 what needs to be implemented                 ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> 1) physical port "pfnum". When PF and physical port representor
>> >>    are merged, the instance of devlink_port representing the physical port
>> >>    and PF needs to have "pfnum" attribute to be in sync
>> >>    with other PF port representors.  
>> >
>> >See above.
>> >  
>> >> 2) per-port health reporters are not implemented yet.  
>> >
>> >Which health reports are visible on a SmartNIC port?   
>> 
>> I think there is usecase for SmartNIC uplink/pf port health reporters.
>> Those are the ones which we have for TX/RX queues on devlink instance
>> now (that was a mistake). 
>> 
>> >
>> >The Host ones or the SmartNIC ones?  
>> 
>> In the host, I think there is a usecase for VF/SF devlink_port health
>> reporters - also for TX/RX queues.
>> 
>> >I think queue reporters should be per-queue, see below.  
>> 
>> Depends. There are reporters, like "fw" that are per-asic.
>> 
>> 
>> >  
>> >> 3) devlink_port instance in PF/VF/SF flavour "virtual". In PF/VF/SF devlink
>> >>    instance (in VM for example), there would make sense to have devlink_port
>> >>    instance. At least to carry link to netdevice name (otherwise we have
>> >>    no easy way to find out devlink instance and netdevice belong to each other).
>> >>    I was thinking about flavour name, we have to distinguish from eswitch
>> >>    devlink port flavours "pcipf, pcivf, pcisf".  
>> >
>> >Virtual is the flavor for the VF port, IIUC, so what's left to name?
>> >Do you mean pick a phys_port_name format?  
>> 
>> No. "virtual" is devlink_port flavour in the host in the VF devlink
>
>Yeah, I'm not 100% sure what you're describing as missing.
>
>Perhaps you could rephrase this point?

If you look at the example above. See:
pci/0000:10:00.0
That is a devlink instance in host B. It has a port:
pci/0000:10:00.0/1
with flavour "virtual"
that tells the user that this is a virtual link, to the parent eswitch.


>
>> >>    This was recently implemented by Parav:
>> >> commit 0a303214f8cb8e2043a03f7b727dba620e07e68d
>> >> Merge: c04d102ba56e 162add8cbae4
>> >> Author: David S. Miller <davem@davemloft.net>
>> >> Date:   Tue Mar 3 15:40:40 2020 -0800
>> >> 
>> >>     Merge branch 'devlink-virtual-port'
>> >> 
>> >>    What is missing is the "virtual" flavour for nested PF.
>> >> 
>> >> 4) slice is not implemented yet. This is the original "vdev/subdev" concept.
>> >>    See below section "Slice user cmdline API draft".
>> >> 
>> >> 5) SFs are not implemented.
>> >>    See below section "SF (subfunction) user cmdline API draft".
>> >> 
>> >> 6) rate for slice are not implemented yet.
>> >>    See below section "Slice rate user cmdline API draft".
>> >> 
>> >> 7) mpgroup for slice is not implemented yet.
>> >>    See below section "Slice mpgroup user cmdline API draft".
>> >> 
>> >> 8) VF manual creation using devlink is not implemented yet.
>> >>    See below section "VF manual creation and activation user cmdline API draft".
>> >>  
>> >> 9) PF aliasing. One devlink instance and multiple PFs sharing it as they have one
>> >>    merged e-switch.
>> >> 
>> >> 
>> >> 
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||                  Issues/open questions                       ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> 1) "pfnum" has to be per-asic(/devlink instance), not per-host.
>> >>    That means that in smartNIC scenario, we cannot have "pfnum 0"
>> >>    for smartNIC and "pfnum 0" for host as well.  
>> >
>> >Right, exactly, NFP already does that.
>> >  
>> >> 2) Q: for TX, RX queues reporters, should it be bound to devlink_port?
>> >>    For which flavours this might make sense?
>> >>    Most probably for flavours "physical"/"virtual".
>> >>    How about the reporters in VF/SF?  
>> >
>> >I think with the work Magnus is doing we should have queues as first  
>> 
>> Can you point me to the "work"?
>
>There was a presentation at LPC last year, and some API proposal
>circulated off-list :( Let's CC Magnus.

Ok.


>
>> >class citizens to be able to allocate them to ports.
>> >
>> >Would this mean we can hang reporters off queues?  
>> 
>> Yes, If we have a "queue object", the per-queue reporter would make sense.
>> 
>> 
>> >  
>> >> 3) How the management of nested switch is handled. The PFs created dynamically
>> >>    or the ones in hosts in smartnic scenario may themselves be each a manager
>> >>    of nested e-switch. How to toggle this capability?
>> >>    During creation by a cmdline option?
>> >>    During lifetime in case the PF does not have any childs (VFs/SFs)?  
>> >
>> >Maybe the grouping of functions into devlink instances would help? 
>> >SmartNIC could control if the host can perform switching between
>> >functions by either putting them in the same Host side devlink 
>> >instance or not.  
>> 
>> I'm not sure I follow. There is a number of PFs created "on probe".
>> Those are fixed and driver knows where to put them.
>> The comment was about possible "dynamic PF" created by user in the same
>> was he creates SF, by devlink cmdline.
>
>How does the driver differentiate between a dynamic and static PF, 
>and why are they different in the first place? :S

They are not different. The driver just has to know to instantiate the
static ones on probe, so in SmartNIC case each host has a PF.


>
>Also, once the PFs are created user may want to use them together 
>or delegate to a VM/namespace. So when I was thinking we'd need some 
>sort of a secure handshake between PFs and FW for the host to prove 
>to FW that the PFs belong to the same domain of control, and their
>resources (and eswitches) can be pooled.
>
>I'm digressing..

Yeah. This needs to be sorted out.


>
>> Now the PF itself can have a "nested eswitch" to manage. The "parent
>> eswitch" where the PF was created would only see one leg to the "nested
>> eswitch".
>> 
>> This "nested eswitch management" might or might not be required. Depends
>> on a usecare. The question was, how to configure that I as a user
>> want this or not.
>
>Ack. I'm extending your question. I think the question is not only who
>controls the eswitch but also which PFs share the eswitch.

Yes.

>
>I think eswitch is just one capability, but SmartNIC will want to
>control which ports see what capabilities in general. crypto offloads
>and such.
>
>I presume in your model if host controls eswitch the smartNIC sees just

host may control the "nested eswitch" in the SmartNIC case.


>what what comes out of Hosts single "uplink"? What if SmartNIC wants
>the host to be able to control the forwarding but not loose the ability
>to tap the VF to VF traffic?

You mean that the VF representors would be in both SmartNIC host and
host? I don't know how that could work. I think it has to be either
there or there.



>
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||                Slice user cmdline API draft                  ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> Note that some of the "devlink port" attributes may be forgotten or misordered.
>> >> 
>> >> Slices and ports are created together by device driver. The driver defines
>> >> the relationships during creation.
>> >> 
>> >> 
>> >> $ devlink port show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 0
>> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1 slice 1
>> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0 slice 2
>> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1 slice 3
>> >> 
>> >> $ devlink slice show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr 10:22:33:44:55:77 state active
>> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
>> >> 
>> >> In these 2 outputs you can see the relationships. Attributes "slice" and "port"
>> >> indicate the slice-port pairs.
>> >> 
>> >> Also, there is a fixed "state" attribute with value "active". This is by
>> >> default as the VFs are always created active. In future, it is planned
>> >> to implement manual VF creation and activation, similar to what is below
>> >> described for SFs.
>> >> 
>> >> Note that for non-ethernet slice (the last one) does not have any
>> >> related port port. It can be for example NVE or IB. But since
>> >> the "hw_addr" attribute is also omitted, it isn't IB.
>> >> 
>> >>  
>> >> Now set a different MAC address for VF1 on PF0:
>> >> $ devlink slice set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff
>> >> 
>> >> $ devlink slice show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
>> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2  
>> >
>> >What are slices?  
>> 
>> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
>> configuration of the "other side of the wire". Like the mac. Hypervizor
>> admin can use the slite to set the mac address of a VF which is in the
>> virtual machine. Basically this should be a replacement of "ip vf"
>> command.
>
>I lost my mail archive but didn't we already have a long thread with
>Parav about this?

I believe so.


>
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||          SF (subfunction) user cmdline API draft             ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> Note that some of the "devlink port" attributes may be forgotten or misordered.
>> >> 
>> >> Note that some of the "devlink slice" attributes in show commands
>> >> are omitted on purpose.
>> >> 
>> >> $ devlink port show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>> >> 
>> >> $ devlink slice show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>> >> 
>> >> There is one VF on the NIC.
>> >> 
>> >> 
>> >> Now create subfunction of SF0 on PF1, index of the slice is going to be 100
>> >> and hw_address aa:bb:cc:aa:bb:cc.
>> >> 
>> >> $ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc  
>> >
>> >Why is the SF number specified by the user rather than allocated?  
>> 
>> Because it is snown in representor netdevice name. And you need to have
>> it predetermined: enp6s0pf1sf10
>
>I'd think you need to know what was assigned, not necessarily pick
>upfront.. I feel like we had this conversation before as well.

Yeah. For the scripting sake, always when you create something, you can
directly use it later in the script. Like if you create a bridge, you
assign it a name so you can use it.

The "what was assigned" would mean that the assigne
value has to be somehow returned from the kernel and passed to the
script. Not sure how. Do you have any example where this is happening in
networking?

I think it is very clear to pass a handle to the command so you can
later on use this handle to manipulate the entity.

+ I don't see the benefit of the auto-allocation. Do you?


>
>> >> The devlink kernel code calls down to device driver (devlink op) and asks
>> >> it to create a slice with particular attributes. Driver then instatiates
>> >> the slice and port in the same way it is done for VF:
>> >> 
>> >> $ devlink port show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>> >> pci/0000:06:00.0/3: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10 slice 100
>> >> 
>> >> $ devlink slice show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>> >> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state inactive
>> >> 
>> >> Note that the SF slice is created but not active. That means the
>> >> entities are created on devlink side, the e-switch port representor
>> >> is created, but the SF device itself it not yet out there (same host
>> >> or different, depends on where the parent PF is - in this case the same host).
>> >> User might use e-switch port representor enp6s0pf1sf10 to do settings,
>> >> putting it into bridge, adding TC rules, etc.
>> >> It's like the cable is unplugged on the other side.  
>> >
>> >If it's just "cable unplugged" can't we just use the fact the
>> >representor is down to indicate no traffic can flow?  
>> 
>> It is not "cable unplugged". This "state inactive/action" is admin
>> state. You as a eswitch admin say, "I'm done with configuring a slice
>> (MAC) and a representor (bridge, TC, etc) for this particular SF and
>> I want the HOST to instantiate the SF instance (with the configured
>> MAC).
>
>I'm not opposed, I just don't understand the need. If ASIC will not 
>RX or TX any traffic from/to this new entity until repr is brought up
>there should be no problem.

The only need I see is that the eswitch manager can configure things for
the slice and representor until the "activation kick" makes the device
appear on the host. What you suggest is to do the "activation kick" by
representor netdevice admin_up. There is a connection from slice to
netdevice, but it is indirect:
devlink_slice->devlink_port->netdev.
Also, some slices may not have netdev (IB/NVE/etc).


>
>> >> Now we activate (deploy) the SF:
>> >> $ devlink slice set pci/0000:06:00.0/100 state active
>> >> 
>> >> $ devlink slice show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>> >> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state active
>> >> 
>> >> Upon the activation, the device driver asks the device to instantiate
>> >> the actual SF device on particular PF. Does not matter if that is
>> >> on the same host or not.
>> >> 
>> >> On the other side, the PF driver instance gets the event from device
>> >> that particular SF was activated. It's the cue to put the device on bus
>> >> probe it and instantiate netdev and devlink instances for it.  
>> >
>> >Seems backwards. It's the PF that wants the new function, why can't it
>> >just create it and either get an error from the other side or never get
>> >link up?  
>> 
>> We discussed that many times internally. I think it makes sense that
>> the SF is created by the same entity that manages the related eswitch
>> SF-representor. In other words, the "devlink slice" and related "devlink
>> port" object are under the same devlink instance.
>> 
>> If the PF in host manages nested switch, it can create the SF inside and
>> manages the neste eswitch SF-representor as you describe.
>> 
>> It is a matter of "nested eswitch manager on/off" configuration.
>> 
>> I think this is clean model and it is known who has what
>> responsibilities.
>
>I see so you want the creation to be controlled by the same entity that
>controls the eswitch..
>
>To me the creation should be on the side that actually needs/will use
>the new port. And if it's not eswitch manager then eswitch manager
>needs to ack it.

Hmm. The question is, is it worth to complicate things in this way?
I don't know. I see a lot of potential misunderstandings :/


>
>The precedence is probably not a strong argument, but that'd be the
>same way VFs work.. I don't think you can change how VFs work, right?

You can't, but since the VF model is not optimal as it turned out, do we
want to stick with it for the SFs too? And perhaps the "furute VF
implementation" can be done in a way this fits - that is described in
the "VF manual creation and activation user cmdline API draft" section
below.

You see that there are "dummies" created by default. I think that Jason
argued that these dummies should be created for max_vfs. That way the
eswitch manager can configure all possible VF representors, no matter
when the host decides to spawn them.


>
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||   VF manual creation and activation user cmdline API draft   ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> To enter manual mode, the user has to turn off VF dummies creation:
>> >> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
>> >> $ devlink dev show
>> >> pci/0000:06:00.0: vf_dummies disabled
>> >> 
>> >> It is "enabled" by default in order not to break existing users.
>> >> 
>> >> By setting the "vf_dummies" attribute to "disabled", the driver
>> >> removes all dummy VFs. Only physical ports are present:
>> >> 
>> >> $ devlink port show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> >> 
>> >> Then the user is able to create them in a similar way as SFs:
>> >> 
>> >> $ devlink slice add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8 hw_addr aa:bb:cc:aa:bb:cc
>> >> 
>> >> The devlink kernel code calls down to device driver (devlink op) and asks
>> >> it to create a slice with particular attributes. Driver then instatiates
>> >> the slice and port:
>> >> 
>> >> $ devlink port show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf0 slice 99
>> >> 
>> >> $ devlink slice show
>> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> >> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state inactive
>> >> 
>> >> Now we activate (deploy) the VF:
>> >> $ devlink slice set pci/0000:06:00.0/99 state active
>> >> 
>> >> $ devlink slice show
>> >> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state active
>> >> 
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||                             PFs                              ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> There are 2 flavours of PFs:
>> >> 1) Parent PF. That is coupled with uplink port. The slice flavour is
>> >>    therefore "physical", to be in sync of the flavour of the uplink port.
>> >>    In case this Parent PF is actually a leg of upstream embedded switch,
>> >>    the slice flavour is "virtual" (same as the port flavour).
>> >> 
>> >>    $ devlink port show
>> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
>> >> 
>> >>    $ devlink slice show
>> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >> 
>> >>    This slice is shown in both "switchdev" and "legacy" modes.
>> >> 
>> >>    If there is another parent PF, say "0000:06:00.1", that share the
>> >>    same embedded switch, the aliasing is established for devlink handles.
>> >> 
>> >>    The user can use devlink handles:
>> >>    pci/0000:06:00.0
>> >>    pci/0000:06:00.1
>> >>    as equivalents, pointing to the same devlink instance.
>> >> 
>> >>    Parent PFs are the ones that may be in control of managing
>> >>    embedded switch, on any hierarchy level.
>> >> 
>> >> 2) Child PF. This is a leg of a PF put to the parent PF. It is
>> >>    represented by a slice, and a port (with a netdevice):
>> >> 
>> >>    $ devlink port show
>> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
>> >>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2 slice 20
>> >> 
>> >>    $ devlink slice show
>> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >>    pci/0000:06:00.0/20: flavour pcipf pfnum 1 port 1 hw_addr aa:bb:cc:aa:bb:87 state active  <<<<<<<<<<
>> >> 
>> >>    This is a typical smartnic scenario. You would see this list on
>> >>    the smartnic CPU. The slice pci/0000:06:00.0/20 is a leg to
>> >>    one of the hosts. If you send packets to enp6s0f0pf2, they will
>> >>    go to he host.
>> >> 
>> >>    Note that inside the host, the PF is represented again as "Parent PF"
>> >>    and may be used to configure nested embedded switch.  
>> >
>> >This parent/child PF I don't understand. Does it stem from some HW
>> >limitations you have?  
>> 
>> No limitation. It's just a name for 2 roles. I didn't know how else to
>> name it for the documentation purposes. Perhaps you can help me.
>> 
>> The child can simply manage a "nested eswich". The "parent eswitch"
>> would see one leg (pf representor) one way or another. Only in case the
>> "nested eswitch" is there, the child would manage it - have separate
>> representors for vfs/sfs under its devlink instance.
>
>I see! I wouldn't use the term PF. I think we need a notion of 
>a "virtual" port within the NIC to model the eswitch being managed 
>by the Host.

But it is a PF (or maybe VF). The fact that it may or may not have
nested switch inside does not change the view of the parent eswitch
manager instance. Why would it?


>
>If Host manages the Eswitch - SmartNIC will no longer deal with its
>PCIe ports, but only with its virtual uplink.

What do you mean by "PCIe ports"?


>
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||               Slice operational state extension              ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> In addition to the "state" attribute that serves for the purpose
>> >> of setting the "admin state", there is "opstate" attribute added to
>> >> reflect the operational state of the slice:
>> >> 
>> >> 
>> >>     opstate                description
>> >>     -----------            ------------
>> >>     1. attached    State when slice device is bound to the host
>> >>                    driver. When the slice device is unbound from the
>> >>                    host driver, slice device exits this state and
>> >>                    enters detaching state.
>> >> 
>> >>     2. detaching   State when host is notified to deactivate the
>> >>                    slice device and slice device may be undergoing
>> >>                    detachment from host driver. When slice device is
>> >>                    fully detached from the host driver, slice exits
>> >>                    this state and enters detached state.
>> >> 
>> >>     3. detached    State when driver is fully unbound, it enters
>> >>                    into detached state.
>> >> 
>> >> slice state machine:
>> >> --------------------
>> >>                                slice state set inactive
>> >>                               ----<------------------<---
>> >>                               | or  slice delete        |
>> >>                               |                         |
>> >>   __________              ____|_______              ____|_______
>> >>  |          | slice add  |            |slice state |            |
>> >>  |          |-------->---|            |------>-----|            |
>> >>  | invalid  |            |  inactive  | set active |   active   |
>> >>  |          | slice del  |            |            |            |
>> >>  |__________|--<---------|____________|            |____________|
>> >> 
>> >> slice device operational state machine:
>> >> ---------------------------------------
>> >>   __________                ____________              ___________
>> >>  |          | slice state  |            |driver bus  |           |
>> >>  | invalid  |-------->-----|  detached  |------>-----| attached  |
>> >>  |          | set active   |            | probe()    |           |
>> >>  |__________|              |____________|            |___________|
>> >>                                  |                        |
>> >>                                  ^                    slice set
>> >>                                  |                    set inactive
>> >>                             successful detach             |
>> >>                               or pf reset                 |
>> >>                              ____|_______                 |
>> >>                             |            | driver bus     |
>> >>                  -----------| detaching  |---<-------------
>> >>                  |          |            | remove()
>> >>                  ^          |____________|
>> >>                  |   timeout      |
>> >>                  --<---------------
>> >> 
>> >> 
>> >> 
>> >> ==================================================================
>> >> ||                                                              ||
>> >> ||             Slice rate user cmdline API draft                ||
>> >> ||                                                              ||
>> >> ==================================================================
>> >> 
>> >> Note that some of the "devlink slice" attributes in show commands
>> >> are omitted on purpose.
>> >> 
>> >> 
>> >> $ devlink slice show
>> >> pci/0000:06:00.0/0: flavour physical pfnum
>> >> pci/0000:06:00.0/1: flavour pcivf pfnum 0 vfnum 1
>> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0
>> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1
>> >> pci/0000:06:00.0/4: flavour pcisf pfnum 0 sfnum 1
>> >> 
>> >> Slice object is extended with new rate object.
>> >> 
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf
>> >> pci/0000:06:00.0/2: type leaf
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> 
>> >> This shows the leafs created by default alongside with slice objects. No min or
>> >> max tx rates were set, so their values are omitted.
>> >> 
>> >> 
>> >> Now create new node rate object:
>> >> 
>> >> $ devlink slice rate add pci/0000:06:00.0/somerategroup
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf
>> >> pci/0000:06:00.0/2: type leaf
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> pci/0000:06:00.0/somerategroup: type node
>> >> 
>> >> New node rate object was created - the last line.
>> >> 
>> >> 
>> >> Now create another new node object was created, this time with some attributes:
>> >> 
>> >> $ devlink slice rate add pci/0000:06:00.0/secondrategroup min_tx_rate 20 max_tx_rate 1000
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf
>> >> pci/0000:06:00.0/2: type leaf
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> pci/0000:06:00.0/somerategroup: type node
>> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> >> 
>> >> Another new node object was created - the last line. The object has min and max
>> >> tx rates set, so they are displayed after the object type.
>> >> 
>> >> 
>> >> Now set node named somerategroup min/max rate using rate object:
>> >> 
>> >> $ devlink slice rate set pci/0000:06:00.0/somerategroup min_tx_rate 50 max_tx_rate 5000
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf
>> >> pci/0000:06:00.0/2: type leaf
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> >> 
>> >> 
>> >> Now set leaf slice rate using rate object:
>> >> 
>> >> $ devlink slice rate set pci/0000:06:00.0/2 min_tx_rate 10 max_tx_rate 10000
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf
>> >> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> >> 
>> >> 
>> >> Now set leaf slice with index 2 parent node using rate object:
>> >> 
>> >> $ devlink slice rate set pci/0000:06:00.0/2 parent somerategroup
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf
>> >> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> >> 
>> >> 
>> >> Now set leaf slice with index 1 parent node using rate object:
>> >> 
>> >> $ devlink slice rate set pci/0000:06:00.0/1 parent somerategroup
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf parent somerategroup
>> >> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> >> 
>> >> 
>> >> Now unset leaf slice with index 1 parent node using rate object:
>> >> 
>> >> $ devlink slice rate set pci/0000:06:00.0/1 noparent
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf
>> >> pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
>> >> pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000
>> >> 
>> >> 
>> >> Now delete node object:
>> >> 
>> >> $ devlink slice rate del pci/0000:06:00.0/somerategroup
>> >> 
>> >> $ devlink slice rate
>> >> pci/0000:06:00.0/1: type leaf
>> >> pci/0000:06:00.0/2: type leaf
>> >> pci/0000:06:00.0/3: type leaf
>> >> pci/0000:06:00.0/4: type leaf
>> >> 
>> >> Rate node object was removed and its only child pci/0000:06:00.0/2 automatically
>> >> detached.  
>> >
>> >Tomorrow we will support CoDel, ECN or any other queuing construct. 
>> >How many APIs do we want to have to configure the same thing? :/  
>> 
>> Rigth, I don't see other way. Please help me to figure out this
>> differently. Note this is for configuring HW limits on the TX side of VF.
>> 
>> We have only devlink_port and related netdev as representor of the VF
>> eswitch port. We cannot use this netdev to configure qdisc on RX.
>
>Ah, right. This is the TX case we abuse act_police for in OvS offload :S
>Yeah, we don't have an API for that.

Yeah :/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-21  9:35       ` Jiri Pirko
@ 2020-03-23 19:21         ` Jakub Kicinski
  2020-03-23 22:06           ` Jason Gunthorpe
                             ` (4 more replies)
  2020-03-23 23:32         ` Andy Gospodarek
  1 sibling, 5 replies; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-23 19:21 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Sat, 21 Mar 2020 10:35:25 +0100 Jiri Pirko wrote:
> Fri, Mar 20, 2020 at 10:25:08PM CET, kuba@kernel.org wrote:
> >On Fri, 20 Mar 2020 08:35:55 +0100 Jiri Pirko wrote:  
> >> Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@kernel.org wrote:  
> >> >On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:    
> >> >> ==================================================================
> >> >> ||                                                              ||
> >> >> ||            Overall illustration of example setup             ||
> >> >> ||                                                              ||
> >> >> ==================================================================
> >> >> 
> >> >> Note that there are 2 hosts in the picture. Host A may be the smartnic host,
> >> >> Host B may be one of the hosts which gets PF. Also, you might omit
> >> >> the Host B and just see Host A like an ordinary nic in a host.    
> >> >
> >> >Could you enumerate the use scenarios for the SmartNIC?
> >> >
> >> >Is SmartNIC always "in-line", i.e. separating the host from the network?    
> >> 
> >> As far as I know, it is. The host is always given a PF which is a leg of
> >> eswitch managed on the SmartNIC.  
> >
> >Cool, I was hoping that's the case. One less configuration mode :)
> >  
> >> >Do we need a distinction based on whether the SmartNIC controls Host's
> >> >eswitch vs just the Host in its entirety (i.e. VF switching vs bare
> >> >metal)?    
> >> 
> >> I have this described in the "PFs" section. Basically we need to have a
> >> toggle to say "host is managing it's own nested eswitch.
> >>   
> >> >I find it really useful to be able to list use cases, and constraints
> >> >first. Then go onto the design.
> >> >    
> >> >> Note that the PF is merged with physical port representor.
> >> >> That is due to simpler and flawless transition from legacy mode and back.
> >> >> The devlink_ports and netdevs for physical ports are staying during
> >> >> the transition.    
> >> >
> >> >When users put an interface under bridge or a bond they have to move 
> >> >IP addresses etc. onto the bond. Changing mode to "switchdev" is a more
> >> >destructive operation and there should be no expectation that
> >> >configuration survives.    
> >> 
> >> Yeah, I was saying the same thing when our arch came up with this, but
> >> I now think it is just fine. It is drivers responsibility to do the
> >> shift. And the entities representing the uplink port: netdevs and
> >> devlink_port instances. They can easily stay during the transition. The
> >> transition only applies to the eswitch and VF entities.  
> >
> >If PF is split from the uplink I think the MAC address should stay with
> >the PF, not the uplink (which becomes just a repr in a Host case).
> >  
> >> >The merging of the PF with the physical port representor is flawed.    
> >> 
> >> Why?  
> >
> >See below.
> >  
> >> >People push Qdisc offloads into devlink because of design shortcuts
> >> >like this.    
> >> 
> >> Could you please explain how it is related to "Qdisc offloads"  
> >
> >Certain users have designed with constrained PCIe bandwidth in the
> >server. Meaning NIC needs to do buffering much like a switch would.
> >So we need to separate the uplink from the PF to attach the Qdisc
> >offload for configuring details of PCIe queuing.  
> 
> Hmm, I'm not sure I understand. Then PF and uplink is the same entity,
> you can still attach the qdisc to this entity, right? What prevents the
> same functionality as with the "split"?

The same problem we have with the VF TX rate. We only have Qdisc APIs
for the TX direction. If we only have one port its TX is the TX onto
wire. If we split it into MAC/phys and PCIe - the TX of PCI is the RX
of the host, allowing us to control queuing on the PCIe interface.

> >> >> |  | |  netdev enp6s0f0np1                           |             |    |
> >> >> |  | +->health reporters, params                     |             |    |
> >> >> |  | |                                               |             |    |
> >> >> |  | +->slice_pci/0000:06:00.0/0+--------------------+             |    |
> >> >> |  |       flavour physical                                        |    |
> >> >> |  |                                                               |    |
> >> >> |  +-+port_pci/0000:06:00.0/1+-----------------------+-------------|----+
> >> >> |  | |  flavour physical pfnum 1  (phys port and pf) |             |
> >> >> |  | |  netdev enp6s0f0np2                           |             |
> >> >> |  | +->health reporters, params                     |             |
> >> >> |  | |                                               |             |
> >> >> |  | +->slice_pci/0000:06:00.0/1+--------------------+             |
> >> >> |  |       flavour physical                                        |
> >> >> |  |                                                               |
> >> >> |  +-+-+port_pci/0000:06:00.0/2+-----------+-------------------+   |
> >> >> |  | | |  flavour pcipf pfnum 2            ^                   |   |
> >> >> |  | | |  netdev enp6s0f0pf2               |                   |   |
> >> >> |  | + +->params                           |                   |   |
> >> >> |  | |                                     |                   |   |
> >> >> |  | +->slice_pci/0000:06:00.0/2+----------+                   |   |
> >> >> |  |       flavour pcipf                                       |   |
> >> >> |  |                                                           |   |
> >> >> |  +-+-+port_pci/0000:06:00.0/3+-----------+----------------+  |   |
> >> >> |  | | |  flavour pcivf pfnum 2 vfnum 0    ^                |  |   |
> >> >> |  | | |  netdev enp6s0pf2vf0              |                |  |   |
> >> >> |  | | +->params                           |                |  |   |
> >> >> |  | |                                     |                |  |   |
> >> >> |  | +-+slice_pci/0000:06:00.0/3+----------+                |  |   |
> >> >> |  |   |   flavour pcivf                                    |  |   |
> >> >> |  |   +->rate (qos), mpgroup, mac                          |  |   |
> >> >> |  |                                                        |  |   |
> >> >> |  +-+-+port_pci/0000:06:00.0/4+-----------+-------------+  |  |   |
> >> >> |  | | |  flavour pcivf pfnum 0 vfnum 0    ^             |  |  |   |    
> >> >
> >> >So PF 0 is both on the SmartNIC where it is physical and on the Hosts?
> >> >Is this just error in the diagram?    
> >> 
> >> I think it is error in the reading. This is the VF representation
> >> of VF pci/0000:06:00.1, on the same host A, which is the SmartNIC host.  
> >
> >Hm, I see pf 0 as the first port that has a line to uplink,
> >and here pf 0 vf 0 has a line to Host B.  
> 
> No. The lane is to "Host A"

Ah damn, half of the boxed below are also Host A! Got it now.

> >The VF above is "pfnum 2 vfnum 0" which makes sense, PF 2 is 
> >Host B's PF. So VF 0 of PF 2 is also on Host B.
> >  
> >> >> |  | | |  netdev enp6s0pf0vf0              |             |  |  |   |
> >> >> |  | | +->params                           |             |  |  |   |
> >> >> |  | |                                     |             |  |  |   |
> >> >> |  | +-+slice_pci/0000:06:00.0/4+----------+             |  |  |   |
> >> >> |  |   |   flavour pcivf                                 |  |  |   |
> >> >> |  |   +->rate (qos), mpgroup, mac                       |  |  |   |
> >> >> |  |                                                     |  |  |   |
> >> >> |  +-+-+port_pci/0000:06:00.0/5+-----------+----------+  |  |  |   |
> >> >> |  | | |  flavour pcisf pfnum 0 sfnum 1    ^          |  |  |  |   |
> >> >> |  | | |  netdev enp6s0pf0sf1              |          |  |  |  |   |
> >> >> |  | | +->params                           |          |  |  |  |   |
> >> >> |  | |                                     |          |  |  |  |   |
> >> >> |  | +-+slice_pci/0000:06:00.0/5+----------+          |  |  |  |   |
> >> >> |  |   |   flavour pcisf                              |  |  |  |   |
> >> >> |  |   +->rate (qos), mpgroup, mac                    |  |  |  |   |
> >> >> |  |                                                  |  |  |  |   |
> >> >> |  +-+slice_pci/0000:06:00.0/6+--------------------+  |  |  |  |   |
> >> >> |        flavour pcivf pfnum 0 vfnum 1             |  |  |  |  |   |
> >> >> |            (non-ethernet (IB, NVE)               |  |  |  |  |   |
> >> >> |                                                  |  |  |  |  |   |
> >> >> +------------------------------------------------------------------+
> >> >>                                                    |  |  |  |  |
> >> >>                                                    |  |  |  |  |
> >> >>                                                    |  |  |  |  |
> >> >> +----------------------------------------------+   |  |  |  |  |
> >> >> |  devlink instance PF (other host)    HOST B  |   |  |  |  |  |
> >> >> |                                              |   |  |  |  |  |
> >> >> |  pci/0000:10:00.0  (devlink dev)             |   |  |  |  |  |
> >> >> |  +->health reporters, info                   |   |  |  |  |  |
> >> >> |  |                                           |   |  |  |  |  |
> >> >> |  +-+port_pci/0000:10:00.0/1+---------------------------------+
> >> >> |    |  flavour virtual                        |   |  |  |  |
> >> >> |    |  netdev enp16s0f0                       |   |  |  |  |
> >> >> |    +->health reporters                       |   |  |  |  |
> >> >> |                                              |   |  |  |  |
> >> >> +----------------------------------------------+   |  |  |  |
> >> >>                                                    |  |  |  |
> >> >> +----------------------------------------------+   |  |  |  |
> >> >> |  devlink instance VF (other host)    HOST B  |   |  |  |  |
> >> >> |                                              |   |  |  |  |
> >> >> |  pci/0000:10:00.1  (devlink dev)             |   |  |  |  |
> >> >> |  +->health reporters, info                   |   |  |  |  |
> >> >> |  |                                           |   |  |  |  |
> >> >> |  +-+port_pci/0000:10:00.1/1+------------------------------+
> >> >> |    |  flavour virtual                        |   |  |  |
> >> >> |    |  netdev enp16s0f0v0                     |   |  |  |
> >> >> |    +->health reporters                       |   |  |  |
> >> >> |                                              |   |  |  |
> >> >> +----------------------------------------------+   |  |  |
> >> >>                                                    |  |  |
> >> >> +----------------------------------------------+   |  |  |
> >> >> |  devlink instance VF                 HOST A  |   |  |  |
> >> >> |                                              |   |  |  |
> >> >> |  pci/0000:06:00.1  (devlink dev)             |   |  |  |
> >> >> |  +->health reporters, info                   |   |  |  |
> >> >> |  |                                           |   |  |  |
> >> >> |  +-+port_pci/0000:06:00.1/1+---------------------------+
> >> >> |    |  flavour virtual                        |   |  |
> >> >> |    |  netdev enp6s0f0v0                      |   |  |
> >> >> |    +->health reporters                       |   |  |
> >> >> |                                              |   |  |
> >> >> +----------------------------------------------+   |  |
> >> >>                                                    |  |
> >> >> +----------------------------------------------+   |  |
> >> >> |  devlink instance SF                 HOST A  |   |  |
> >> >> |                                              |   |  |
> >> >> |  pci/0000:06:00.0%sf1    (devlink dev)       |   |  |
> >> >> |  +->health reporters, info                   |   |  |
> >> >> |  |                                           |   |  |
> >> >> |  +-+port_pci/0000:06:00.0%sf1/1+--------------------+
> >> >> |    |  flavour virtual                        |   |
> >> >> |    |  netdev enp6s0f0s1                      |   |
> >> >> |    +->health reporters                       |   |
> >> >> |                                              |   |
> >> >> +----------------------------------------------+   |
> >> >>                                                    |
> >> >> +----------------------------------------------+   |
> >> >> |  devlink instance VF                 HOST A  |   |
> >> >> |                                              |   |
> >> >> |  pci/0000:06:00.2  (devlink dev)+----------------+
> >> >> |  +->health reporters, info                   |
> >> >> |                                              |
> >> >> +----------------------------------------------+
> >> >> 
> >> >> 
> >> >> 
> >> >> 
> >> >> ==================================================================
> >> >> ||                                                              ||
> >> >> ||                 what needs to be implemented                 ||
> >> >> ||                                                              ||
> >> >> ==================================================================
> >> >> 
> >> >> 1) physical port "pfnum". When PF and physical port representor
> >> >>    are merged, the instance of devlink_port representing the physical port
> >> >>    and PF needs to have "pfnum" attribute to be in sync
> >> >>    with other PF port representors.    
> >> >
> >> >See above.
> >> >    
> >> >> 2) per-port health reporters are not implemented yet.    
> >> >
> >> >Which health reports are visible on a SmartNIC port?     
> >> 
> >> I think there is usecase for SmartNIC uplink/pf port health reporters.
> >> Those are the ones which we have for TX/RX queues on devlink instance
> >> now (that was a mistake). 
> >>   
> >> >
> >> >The Host ones or the SmartNIC ones?    
> >> 
> >> In the host, I think there is a usecase for VF/SF devlink_port health
> >> reporters - also for TX/RX queues.
> >>   
> >> >I think queue reporters should be per-queue, see below.    
> >> 
> >> Depends. There are reporters, like "fw" that are per-asic.
> >> 
> >>   
> >> >    
> >> >> 3) devlink_port instance in PF/VF/SF flavour "virtual". In PF/VF/SF devlink
> >> >>    instance (in VM for example), there would make sense to have devlink_port
> >> >>    instance. At least to carry link to netdevice name (otherwise we have
> >> >>    no easy way to find out devlink instance and netdevice belong to each other).
> >> >>    I was thinking about flavour name, we have to distinguish from eswitch
> >> >>    devlink port flavours "pcipf, pcivf, pcisf".    
> >> >
> >> >Virtual is the flavor for the VF port, IIUC, so what's left to name?
> >> >Do you mean pick a phys_port_name format?    
> >> 
> >> No. "virtual" is devlink_port flavour in the host in the VF devlink  
> >
> >Yeah, I'm not 100% sure what you're describing as missing.
> >
> >Perhaps you could rephrase this point?  
> 
> If you look at the example above. See:
> pci/0000:10:00.0
> That is a devlink instance in host B. It has a port:
> pci/0000:10:00.0/1
> with flavour "virtual"
> that tells the user that this is a virtual link, to the parent eswitch.
> 
> 
> >  
> >> >>    This was recently implemented by Parav:
> >> >> commit 0a303214f8cb8e2043a03f7b727dba620e07e68d
> >> >> Merge: c04d102ba56e 162add8cbae4
> >> >> Author: David S. Miller <davem@davemloft.net>
> >> >> Date:   Tue Mar 3 15:40:40 2020 -0800
> >> >> 
> >> >>     Merge branch 'devlink-virtual-port'
> >> >> 
> >> >>    What is missing is the "virtual" flavour for nested PF.
> >> >> 
> >> >> 4) slice is not implemented yet. This is the original "vdev/subdev" concept.
> >> >>    See below section "Slice user cmdline API draft".
> >> >> 
> >> >> 5) SFs are not implemented.
> >> >>    See below section "SF (subfunction) user cmdline API draft".
> >> >> 
> >> >> 6) rate for slice are not implemented yet.
> >> >>    See below section "Slice rate user cmdline API draft".
> >> >> 
> >> >> 7) mpgroup for slice is not implemented yet.
> >> >>    See below section "Slice mpgroup user cmdline API draft".
> >> >> 
> >> >> 8) VF manual creation using devlink is not implemented yet.
> >> >>    See below section "VF manual creation and activation user cmdline API draft".
> >> >>  
> >> >> 9) PF aliasing. One devlink instance and multiple PFs sharing it as they have one
> >> >>    merged e-switch.
> >> >> 
> >> >> 
> >> >> 
> >> >> ==================================================================
> >> >> ||                                                              ||
> >> >> ||                  Issues/open questions                       ||
> >> >> ||                                                              ||
> >> >> ==================================================================
> >> >> 
> >> >> 1) "pfnum" has to be per-asic(/devlink instance), not per-host.
> >> >>    That means that in smartNIC scenario, we cannot have "pfnum 0"
> >> >>    for smartNIC and "pfnum 0" for host as well.    
> >> >
> >> >Right, exactly, NFP already does that.
> >> >    
> >> >> 2) Q: for TX, RX queues reporters, should it be bound to devlink_port?
> >> >>    For which flavours this might make sense?
> >> >>    Most probably for flavours "physical"/"virtual".
> >> >>    How about the reporters in VF/SF?    
> >> >
> >> >I think with the work Magnus is doing we should have queues as first    
> >> 
> >> Can you point me to the "work"?  
> >
> >There was a presentation at LPC last year, and some API proposal
> >circulated off-list :( Let's CC Magnus.  
> 
> Ok.
> 
> 
> >  
> >> >class citizens to be able to allocate them to ports.
> >> >
> >> >Would this mean we can hang reporters off queues?    
> >> 
> >> Yes, If we have a "queue object", the per-queue reporter would make sense.
> >> 
> >>   
> >> >    
> >> >> 3) How the management of nested switch is handled. The PFs created dynamically
> >> >>    or the ones in hosts in smartnic scenario may themselves be each a manager
> >> >>    of nested e-switch. How to toggle this capability?
> >> >>    During creation by a cmdline option?
> >> >>    During lifetime in case the PF does not have any childs (VFs/SFs)?    
> >> >
> >> >Maybe the grouping of functions into devlink instances would help? 
> >> >SmartNIC could control if the host can perform switching between
> >> >functions by either putting them in the same Host side devlink 
> >> >instance or not.    
> >> 
> >> I'm not sure I follow. There is a number of PFs created "on probe".
> >> Those are fixed and driver knows where to put them.
> >> The comment was about possible "dynamic PF" created by user in the same
> >> was he creates SF, by devlink cmdline.  
> >
> >How does the driver differentiate between a dynamic and static PF, 
> >and why are they different in the first place? :S  
> 
> They are not different. The driver just has to know to instantiate the
> static ones on probe, so in SmartNIC case each host has a PF.
> 
> 
> >
> >Also, once the PFs are created user may want to use them together 
> >or delegate to a VM/namespace. So when I was thinking we'd need some 
> >sort of a secure handshake between PFs and FW for the host to prove 
> >to FW that the PFs belong to the same domain of control, and their
> >resources (and eswitches) can be pooled.
> >
> >I'm digressing..  
> 
> Yeah. This needs to be sorted out.
> 
> 
> >  
> >> Now the PF itself can have a "nested eswitch" to manage. The "parent
> >> eswitch" where the PF was created would only see one leg to the "nested
> >> eswitch".
> >> 
> >> This "nested eswitch management" might or might not be required. Depends
> >> on a usecare. The question was, how to configure that I as a user
> >> want this or not.  
> >
> >Ack. I'm extending your question. I think the question is not only who
> >controls the eswitch but also which PFs share the eswitch.  
> 
> Yes.
> 
> >
> >I think eswitch is just one capability, but SmartNIC will want to
> >control which ports see what capabilities in general. crypto offloads
> >and such.
> >
> >I presume in your model if host controls eswitch the smartNIC sees just  
> 
> host may control the "nested eswitch" in the SmartNIC case.

By nested eswitch you mean eswitch between ports of the same Host, or
within one PF? Then SmartNIC may switch between PFs or multiple hosts?

> >what what comes out of Hosts single "uplink"? What if SmartNIC wants
> >the host to be able to control the forwarding but not loose the ability
> >to tap the VF to VF traffic?  
> 
> You mean that the VF representors would be in both SmartNIC host and
> host? I don't know how that could work. I think it has to be either
> there or there.

That'd certainly be easier. Without representors we can't even check
traffic stats, though. SmartNIC may want to see the resource
utilization of ports even if it doesn't see the ports. That's just a
theoretical divagation, I don't think it's required.

> >> >> ==================================================================
> >> >> ||                                                              ||
> >> >> ||                Slice user cmdline API draft                  ||
> >> >> ||                                                              ||
> >> >> ==================================================================
> >> >> 
> >> >> Note that some of the "devlink port" attributes may be forgotten or misordered.
> >> >> 
> >> >> Slices and ports are created together by device driver. The driver defines
> >> >> the relationships during creation.
> >> >> 
> >> >> 
> >> >> $ devlink port show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 0
> >> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1 slice 1
> >> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0 slice 2
> >> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1 slice 3
> >> >> 
> >> >> $ devlink slice show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
> >> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr 10:22:33:44:55:77 state active
> >> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
> >> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
> >> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
> >> >> 
> >> >> In these 2 outputs you can see the relationships. Attributes "slice" and "port"
> >> >> indicate the slice-port pairs.
> >> >> 
> >> >> Also, there is a fixed "state" attribute with value "active". This is by
> >> >> default as the VFs are always created active. In future, it is planned
> >> >> to implement manual VF creation and activation, similar to what is below
> >> >> described for SFs.
> >> >> 
> >> >> Note that for non-ethernet slice (the last one) does not have any
> >> >> related port port. It can be for example NVE or IB. But since
> >> >> the "hw_addr" attribute is also omitted, it isn't IB.
> >> >> 
> >> >>  
> >> >> Now set a different MAC address for VF1 on PF0:
> >> >> $ devlink slice set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff
> >> >> 
> >> >> $ devlink slice show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
> >> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
> >> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
> >> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
> >> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2    
> >> >
> >> >What are slices?    
> >> 
> >> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
> >> configuration of the "other side of the wire". Like the mac. Hypervizor
> >> admin can use the slite to set the mac address of a VF which is in the
> >> virtual machine. Basically this should be a replacement of "ip vf"
> >> command.  
> >
> >I lost my mail archive but didn't we already have a long thread with
> >Parav about this?  
> 
> I believe so.

Oh, well. I still don't see the need for it :( If it's one to one with
ports why add another API, and have to do some cross linking to get
from one to the other?

I'd much rather resources hanging off the port.

> >> >> ==================================================================
> >> >> ||                                                              ||
> >> >> ||          SF (subfunction) user cmdline API draft             ||
> >> >> ||                                                              ||
> >> >> ==================================================================
> >> >> 
> >> >> Note that some of the "devlink port" attributes may be forgotten or misordered.
> >> >> 
> >> >> Note that some of the "devlink slice" attributes in show commands
> >> >> are omitted on purpose.
> >> >> 
> >> >> $ devlink port show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
> >> >> 
> >> >> $ devlink slice show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> >> >> 
> >> >> There is one VF on the NIC.
> >> >> 
> >> >> 
> >> >> Now create subfunction of SF0 on PF1, index of the slice is going to be 100
> >> >> and hw_address aa:bb:cc:aa:bb:cc.
> >> >> 
> >> >> $ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc    
> >> >
> >> >Why is the SF number specified by the user rather than allocated?    
> >> 
> >> Because it is snown in representor netdevice name. And you need to have
> >> it predetermined: enp6s0pf1sf10  
> >
> >I'd think you need to know what was assigned, not necessarily pick
> >upfront.. I feel like we had this conversation before as well.  
> 
> Yeah. For the scripting sake, always when you create something, you can
> directly use it later in the script. Like if you create a bridge, you
> assign it a name so you can use it.
> 
> The "what was assigned" would mean that the assigne
> value has to be somehow returned from the kernel and passed to the
> script. Not sure how. Do you have any example where this is happening in
> networking?

Not really, but when allocating objects it seems idiomatic to get the
id / handle / address of the new entity in response. Seems to me we're
not doing it because the infrastructure for it is not in place, but
it'd be a good extension.

Times are a little crazy but I can take a poke at implementing
something along those lines once I find some time..

> I think it is very clear to pass a handle to the command so you can
> later on use this handle to manipulate the entity.
> 
> + I don't see the benefit of the auto-allocation. Do you?
>
> >> >> The devlink kernel code calls down to device driver (devlink op) and asks
> >> >> it to create a slice with particular attributes. Driver then instatiates
> >> >> the slice and port in the same way it is done for VF:
> >> >> 
> >> >> $ devlink port show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
> >> >> pci/0000:06:00.0/3: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10 slice 100
> >> >> 
> >> >> $ devlink slice show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> >> >> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state inactive
> >> >> 
> >> >> Note that the SF slice is created but not active. That means the
> >> >> entities are created on devlink side, the e-switch port representor
> >> >> is created, but the SF device itself it not yet out there (same host
> >> >> or different, depends on where the parent PF is - in this case the same host).
> >> >> User might use e-switch port representor enp6s0pf1sf10 to do settings,
> >> >> putting it into bridge, adding TC rules, etc.
> >> >> It's like the cable is unplugged on the other side.    
> >> >
> >> >If it's just "cable unplugged" can't we just use the fact the
> >> >representor is down to indicate no traffic can flow?    
> >> 
> >> It is not "cable unplugged". This "state inactive/action" is admin
> >> state. You as a eswitch admin say, "I'm done with configuring a slice
> >> (MAC) and a representor (bridge, TC, etc) for this particular SF and
> >> I want the HOST to instantiate the SF instance (with the configured
> >> MAC).  
> >
> >I'm not opposed, I just don't understand the need. If ASIC will not 
> >RX or TX any traffic from/to this new entity until repr is brought up
> >there should be no problem.  
> 
> The only need I see is that the eswitch manager can configure things for
> the slice and representor until the "activation kick" makes the device
> appear on the host. What you suggest is to do the "activation kick" by
> representor netdevice admin_up. There is a connection from slice to
> netdevice, but it is indirect:
> devlink_slice->devlink_port->netdev.
> Also, some slices may not have netdev (IB/NVE/etc).

Not sure what activation on IB and NVM looks like, but isn't just not
serving requests until ready not an option there?

> >> >> Now we activate (deploy) the SF:
> >> >> $ devlink slice set pci/0000:06:00.0/100 state active
> >> >> 
> >> >> $ devlink slice show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
> >> >> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state active
> >> >> 
> >> >> Upon the activation, the device driver asks the device to instantiate
> >> >> the actual SF device on particular PF. Does not matter if that is
> >> >> on the same host or not.
> >> >> 
> >> >> On the other side, the PF driver instance gets the event from device
> >> >> that particular SF was activated. It's the cue to put the device on bus
> >> >> probe it and instantiate netdev and devlink instances for it.    
> >> >
> >> >Seems backwards. It's the PF that wants the new function, why can't it
> >> >just create it and either get an error from the other side or never get
> >> >link up?    
> >> 
> >> We discussed that many times internally. I think it makes sense that
> >> the SF is created by the same entity that manages the related eswitch
> >> SF-representor. In other words, the "devlink slice" and related "devlink
> >> port" object are under the same devlink instance.
> >> 
> >> If the PF in host manages nested switch, it can create the SF inside and
> >> manages the neste eswitch SF-representor as you describe.
> >> 
> >> It is a matter of "nested eswitch manager on/off" configuration.
> >> 
> >> I think this is clean model and it is known who has what
> >> responsibilities.  
> >
> >I see so you want the creation to be controlled by the same entity that
> >controls the eswitch..
> >
> >To me the creation should be on the side that actually needs/will use
> >the new port. And if it's not eswitch manager then eswitch manager
> >needs to ack it.  
> 
> Hmm. The question is, is it worth to complicate things in this way?
> I don't know. I see a lot of potential misunderstandings :/

I'd see requesting SFs over devlink/sysfs as simplification, if
anything.

> >The precedence is probably not a strong argument, but that'd be the
> >same way VFs work.. I don't think you can change how VFs work, right?  
> 
> You can't, but since the VF model is not optimal as it turned out, do we
> want to stick with it for the SFs too?

The VF model is not optimal in what sense? I thought the main reason
for SFs is that they are significantly more light weight on Si side.

> And perhaps the "furute VF
> implementation" can be done in a way this fits - that is described in
> the "VF manual creation and activation user cmdline API draft" section
> below.
> 
> You see that there are "dummies" created by default. I think that Jason
> argued that these dummies should be created for max_vfs. That way the
> eswitch manager can configure all possible VF representors, no matter
> when the host decides to spawn them.
> 
> 
> >  
> >> >> ==================================================================
> >> >> ||                                                              ||
> >> >> ||   VF manual creation and activation user cmdline API draft   ||
> >> >> ||                                                              ||
> >> >> ==================================================================
> >> >> 
> >> >> To enter manual mode, the user has to turn off VF dummies creation:
> >> >> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
> >> >> $ devlink dev show
> >> >> pci/0000:06:00.0: vf_dummies disabled
> >> >> 
> >> >> It is "enabled" by default in order not to break existing users.
> >> >> 
> >> >> By setting the "vf_dummies" attribute to "disabled", the driver
> >> >> removes all dummy VFs. Only physical ports are present:
> >> >> 
> >> >> $ devlink port show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> >> 
> >> >> Then the user is able to create them in a similar way as SFs:
> >> >> 
> >> >> $ devlink slice add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8 hw_addr aa:bb:cc:aa:bb:cc
> >> >> 
> >> >> The devlink kernel code calls down to device driver (devlink op) and asks
> >> >> it to create a slice with particular attributes. Driver then instatiates
> >> >> the slice and port:
> >> >> 
> >> >> $ devlink port show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf0 slice 99
> >> >> 
> >> >> $ devlink slice show
> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >> >> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state inactive
> >> >> 
> >> >> Now we activate (deploy) the VF:
> >> >> $ devlink slice set pci/0000:06:00.0/99 state active
> >> >> 
> >> >> $ devlink slice show
> >> >> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state active
> >> >> 
> >> >> ==================================================================
> >> >> ||                                                              ||
> >> >> ||                             PFs                              ||
> >> >> ||                                                              ||
> >> >> ==================================================================
> >> >> 
> >> >> There are 2 flavours of PFs:
> >> >> 1) Parent PF. That is coupled with uplink port. The slice flavour is
> >> >>    therefore "physical", to be in sync of the flavour of the uplink port.
> >> >>    In case this Parent PF is actually a leg of upstream embedded switch,
> >> >>    the slice flavour is "virtual" (same as the port flavour).
> >> >> 
> >> >>    $ devlink port show
> >> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> >> >> 
> >> >>    $ devlink slice show
> >> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> >> 
> >> >>    This slice is shown in both "switchdev" and "legacy" modes.
> >> >> 
> >> >>    If there is another parent PF, say "0000:06:00.1", that share the
> >> >>    same embedded switch, the aliasing is established for devlink handles.
> >> >> 
> >> >>    The user can use devlink handles:
> >> >>    pci/0000:06:00.0
> >> >>    pci/0000:06:00.1
> >> >>    as equivalents, pointing to the same devlink instance.
> >> >> 
> >> >>    Parent PFs are the ones that may be in control of managing
> >> >>    embedded switch, on any hierarchy level.
> >> >> 
> >> >> 2) Child PF. This is a leg of a PF put to the parent PF. It is
> >> >>    represented by a slice, and a port (with a netdevice):
> >> >> 
> >> >>    $ devlink port show
> >> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> >> >>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2 slice 20
> >> >> 
> >> >>    $ devlink slice show
> >> >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >> >>    pci/0000:06:00.0/20: flavour pcipf pfnum 1 port 1 hw_addr aa:bb:cc:aa:bb:87 state active  <<<<<<<<<<
> >> >> 
> >> >>    This is a typical smartnic scenario. You would see this list on
> >> >>    the smartnic CPU. The slice pci/0000:06:00.0/20 is a leg to
> >> >>    one of the hosts. If you send packets to enp6s0f0pf2, they will
> >> >>    go to he host.
> >> >> 
> >> >>    Note that inside the host, the PF is represented again as "Parent PF"
> >> >>    and may be used to configure nested embedded switch.    
> >> >
> >> >This parent/child PF I don't understand. Does it stem from some HW
> >> >limitations you have?    
> >> 
> >> No limitation. It's just a name for 2 roles. I didn't know how else to
> >> name it for the documentation purposes. Perhaps you can help me.
> >> 
> >> The child can simply manage a "nested eswich". The "parent eswitch"
> >> would see one leg (pf representor) one way or another. Only in case the
> >> "nested eswitch" is there, the child would manage it - have separate
> >> representors for vfs/sfs under its devlink instance.  
> >
> >I see! I wouldn't use the term PF. I think we need a notion of 
> >a "virtual" port within the NIC to model the eswitch being managed 
> >by the Host.  
> 
> But it is a PF (or maybe VF). The fact that it may or may not have
> nested switch inside does not change the view of the parent eswitch
> manager instance. Why would it?

See the first comment on mixing PCIe ports with uplink :)

> >If Host manages the Eswitch - SmartNIC will no longer deal with its
> >PCIe ports, but only with its virtual uplink.  
> 
> What do you mean by "PCIe ports"?

PF/VF/SFs, basically PCIe side ingress/egress/queuing.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-21  9:07       ` Parav Pandit
@ 2020-03-23 19:31         ` Jakub Kicinski
  2020-03-23 22:50           ` Jason Gunthorpe
  2020-03-24  5:36           ` Parav Pandit
  0 siblings, 2 replies; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-23 19:31 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On Sat, 21 Mar 2020 09:07:30 +0000 Parav Pandit wrote:
> > I see so you want the creation to be controlled by the same entity that
> > controls the eswitch..
> > 
> > To me the creation should be on the side that actually needs/will use
> > the new port. And if it's not eswitch manager then eswitch manager
> > needs to ack it.
> >  
> 
> There are few reasons to create them on eswitch manager system as below.
> 
> 1. Creation and deletion on one system and synchronizing it with eswitch
> system requires multiple back-n-forth calls between two systems.
> 
> 2. When this happens, system where its created, doesn't know when is the
> right time to provision to a VM or to a application.
> udev/systemd/Network Manager and others such software might already
> start initializing it doing DHCP but its switch side is not yet ready.

Networking software can deal with link down..

> So it is desired to make sure that once device is fully
> ready/configured, its activated.
> 
> 3. Additionally it doesn't follow mirror sequence during deletion when
> created on host.

Why so? Surely host needs to request deletion, otherwise container
given only an SF could be cut off?

> 4. eswitch administrator simply doesn't have direct access to the system
> where this device is used. So it just cannot be created there by eswitch
> administrator.

Right, that is the point. It's the host admin that wants the new
entity, so if possible it'd be better if they could just ask for it 
via devlink rather than some cloud API. Not that I'm completely opposed
to a cloud API - just seems unnecessary here.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-20 21:25     ` Jakub Kicinski
  2020-03-21  9:07       ` Parav Pandit
  2020-03-21  9:35       ` Jiri Pirko
@ 2020-03-23 21:32       ` Andy Gospodarek
  2 siblings, 0 replies; 50+ messages in thread
From: Andy Gospodarek @ 2020-03-23 21:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson


[Sorry to do this, but I'm going to reply to a few different parts in
different messages.  I'm a bit late to the party and there's lots that
has already been worked out, but I want to address some of those in
detail.  Thanks to Jiri et al at Mellanox for starting this detailed
thread and sharing with the community.]

On Fri, Mar 20, 2020 at 02:25:08PM -0700, Jakub Kicinski wrote:
> On Fri, 20 Mar 2020 08:35:55 +0100 Jiri Pirko wrote:
> > Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@kernel.org wrote:
> > >On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:  
> > >> 
> > >> ==================================================================
> > >> ||                                                              ||
> > >> ||                             PFs                              ||
> > >> ||                                                              ||
> > >> ==================================================================
> > >> 
> > >> There are 2 flavours of PFs:
> > >> 1) Parent PF. That is coupled with uplink port. The slice flavour is
> > >>    therefore "physical", to be in sync of the flavour of the uplink port.
> > >>    In case this Parent PF is actually a leg of upstream embedded switch,
> > >>    the slice flavour is "virtual" (same as the port flavour).
> > >> 
> > >>    $ devlink port show
> > >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> > >> 
> > >>    $ devlink slice show
> > >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> > >> 
> > >>    This slice is shown in both "switchdev" and "legacy" modes.
> > >> 
> > >>    If there is another parent PF, say "0000:06:00.1", that share the
> > >>    same embedded switch, the aliasing is established for devlink handles.
> > >> 
> > >>    The user can use devlink handles:
> > >>    pci/0000:06:00.0
> > >>    pci/0000:06:00.1
> > >>    as equivalents, pointing to the same devlink instance.
> > >> 
> > >>    Parent PFs are the ones that may be in control of managing
> > >>    embedded switch, on any hierarchy level.
> > >> 
> > >> 2) Child PF. This is a leg of a PF put to the parent PF. It is
> > >>    represented by a slice, and a port (with a netdevice):
> > >> 
> > >>    $ devlink port show
> > >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> > >>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2 slice 20
> > >> 
> > >>    $ devlink slice show
> > >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> > >>    pci/0000:06:00.0/20: flavour pcipf pfnum 1 port 1 hw_addr aa:bb:cc:aa:bb:87 state active  <<<<<<<<<<
> > >> 
> > >>    This is a typical smartnic scenario. You would see this list on
> > >>    the smartnic CPU. The slice pci/0000:06:00.0/20 is a leg to
> > >>    one of the hosts. If you send packets to enp6s0f0pf2, they will
> > >>    go to he host.
> > >> 
> > >>    Note that inside the host, the PF is represented again as "Parent PF"
> > >>    and may be used to configure nested embedded switch.  
> > >
> > >This parent/child PF I don't understand. Does it stem from some HW
> > >limitations you have?  
> > 
> > No limitation. It's just a name for 2 roles. I didn't know how else to
> > name it for the documentation purposes. Perhaps you can help me.
> > 
> > The child can simply manage a "nested eswich". The "parent eswitch"
> > would see one leg (pf representor) one way or another. Only in case the
> > "nested eswitch" is there, the child would manage it - have separate
> > representors for vfs/sfs under its devlink instance.
> 
> I see! I wouldn't use the term PF. I think we need a notion of 
> a "virtual" port within the NIC to model the eswitch being managed 
> by the Host.
> 
> If Host manages the Eswitch - SmartNIC will no longer deal with its
> PCIe ports, but only with its virtual uplink.
> 

We have been referencing these as PFs for a while but without any
in-kernel way to differentiate between what you describe as a
parent/child relationship.  The terminology someone came up with was the
notion of referring to these as "PF Pairs" when all traffic on a
SmartNIC goes to a particular PF on a host.

This is partly because in the nominal case when our SmartNIC is booted
the eSwitch is configured so that traffic is passed to the proper PF
based on the destination MAC of the traffic.  Here is a dump of the
interfaces on the smartnic side and server side for a 2 port card:

[root@smartnic ~]# ip li sh | grep enp -A 1 
2: enP8p1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group de0
    link/ether 00:0a:f7:ac:cf:a0 brd ff:ff:ff:ff:ff:ff
3: enP8p1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group de0
    link/ether 00:0a:f7:ac:cf:a1 brd ff:ff:ff:ff:ff:ff
4: enP8p1s0f2np0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT grou0
    link/ether 00:0a:f7:ac:cf:a2 brd ff:ff:ff:ff:ff:ff
5: enP8p1s0f3np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT grou0
    link/ether 00:0a:f7:ac:cf:a3 brd ff:ff:ff:ff:ff:ff

root@server:~# ip li sh | grep enp -A 1 
2: enp1s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:ac:cf:a8 brd ff:ff:ff:ff:ff:ff
3: enp1s0f1d1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:ac:cf:a9 brd ff:ff:ff:ff:ff:ff
4: enp1s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:ac:cf:aa brd ff:ff:ff:ff:ff:ff
5: enp1s0f3d1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:ac:cf:ab brd ff:ff:ff:ff:ff:ff

So while it might not make sense at first to have less physical
interfaces than PFs on the server or smartnic this gives flexibility to
have a PF on the server side that does have direct network connectivity
so that traffic destined to MAC address 00:0a:f7:ac:cf:a8 will go
directly to enp1s0f0 when it comes off the wire or from other PFs on the
server or smartnic.

We can also essentially 'lock out' PFs from being able to access the
physical ports if we want.  When that is done then the parent/child
relationship would be what you would describe and we would match up

enP8p1s0f2np0 (smartnic) <---> enp1s0f0 (server)
and
enP8p1s0f3np1 (smartnic) <---> enp1s0f1d1 (server)

and delete enp1s0f2 and enp1s0f3d1 on the server.

In this case PF0 and PF1 (enP8p1s0f0np0 and enP8p1s0f1np1) on the
smartnic effectively become the physical ports as there would be no
other 'ports' on the eswitch that are in the same broadcast domain.

I'm sure it comes as no surprise to anyone, but we also have the idea
that VFs can be paired in similar ways to PFs.  Practically speaking,
however, there is not much of a reason to use VFs on the SmartNIC
without VMs on the SmartNIC unless you are using this same parent/child
relationship.  Are you proposing that this will also be an option?

One point is that we also find is that generally customers are not super
interested in having these changed in real-time.  I do LOVE the idea of
being able to query this information via devlink however, so let's keep
that rolling and if people want them to be real PCI b/d/f I think that
should be allowed.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 19:21         ` Jakub Kicinski
@ 2020-03-23 22:06           ` Jason Gunthorpe
  2020-03-24  3:56             ` Jakub Kicinski
  2020-03-26 14:37           ` Jiri Pirko
                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2020-03-23 22:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, netdev, davem, parav, yuvalav, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Mon, Mar 23, 2020 at 12:21:23PM -0700, Jakub Kicinski wrote:
> > >I see so you want the creation to be controlled by the same entity that
> > >controls the eswitch..
> > >
> > >To me the creation should be on the side that actually needs/will use
> > >the new port. And if it's not eswitch manager then eswitch manager
> > >needs to ack it.  
> > 
> > Hmm. The question is, is it worth to complicate things in this way?
> > I don't know. I see a lot of potential misunderstandings :/
> 
> I'd see requesting SFs over devlink/sysfs as simplification, if
> anything.

We looked at it for a while, working the communication such that the
'untrusted' side could request a port be created with certain
parameters and the 'secure eswitch' could know those parameters to
authorize and wire it up was super complicated and very hard to do
without races.

Since it is a security sensitive operation it seems like a much more
secure design to have the secure side do all the creation and present
the fully operational object to the insecure side.

To draw a parallel to qemu & kvm, the untrusted guest VM can't request
that qemu create a virtio-net for it. Those are always hot plugged in
by the secure side. Same flow here.

> > >The precedence is probably not a strong argument, but that'd be the
> > >same way VFs work.. I don't think you can change how VFs work, right?  
> > 
> > You can't, but since the VF model is not optimal as it turned out, do we
> > want to stick with it for the SFs too?
> 
> The VF model is not optimal in what sense? I thought the main reason
> for SFs is that they are significantly more light weight on Si side.

Not entirely really, the biggest reason for SF is because VF have to
be pr-eallocated and have a number limited by PCI.

The VF model is poor because the VF is just a dummy stub until the
representor/eswitch side is fully configured. There is no way for the
Linux driver to know if the VF is operational or not, so we get weird
artifacts where we sometimes bind a driver to a VF (and get a
non-working ethXX) and sometimes we don't. 

The only reason it is like this is because of how SRIOV requires
everything to be preallocated.

The SFs can't even exist until they are configured, so there is no
state where a driver is connected to an inoperative SF.

So it would be nice if VF and SF had the same flow, the SF flow is
better, lets fix the VF flow to match it.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 19:31         ` Jakub Kicinski
@ 2020-03-23 22:50           ` Jason Gunthorpe
  2020-03-24  3:41             ` Jakub Kicinski
  2020-03-24  5:36           ` Parav Pandit
  1 sibling, 1 reply; 50+ messages in thread
From: Jason Gunthorpe @ 2020-03-23 22:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Parav Pandit, Jiri Pirko, netdev, davem, Yuval Avnery,
	Saeed Mahameed, leon, andrew.gospodarek, michael.chan,
	Moshe Shemesh, Aya Levin, Eran Ben Elisha, Vlad Buslov,
	Yevgeny Kliteynik, dchickles, sburla, fmanlunas, Tariq Toukan,
	oss-drivers, snelson, drivers, aelior, GR-everest-linux-l2,
	grygorii.strashko, mlxsw, Ido Schimmel, Mark Zhang,
	jacob.e.keller, Alex Vesker, linyunsheng, lihong.yang,
	vikas.gupta, magnus.karlsson

On Mon, Mar 23, 2020 at 12:31:16PM -0700, Jakub Kicinski wrote:

> Right, that is the point. It's the host admin that wants the new
> entity, so if possible it'd be better if they could just ask for it 
> via devlink rather than some cloud API. Not that I'm completely opposed
> to a cloud API - just seems unnecessary here.

The cloud API provides all the permissions checks and security
elements. It cannot be avoided.

If you try to do it as you say then it is weird. You have to use the
cloud API to authorize the VM to touch a certain network, then the VM
has to somehow take that network ID and use devlink to get a netdev
for it. And the cloud side has to protect against a hostile VM sending
garbage along this communication channel.

vs simply host plugging in the correct network fully operational when
the cloud API connects the VM to the network.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-21  9:35       ` Jiri Pirko
  2020-03-23 19:21         ` Jakub Kicinski
@ 2020-03-23 23:32         ` Andy Gospodarek
  2020-03-24  0:11           ` Jason Gunthorpe
  2020-03-24  5:53           ` Parav Pandit
  1 sibling, 2 replies; 50+ messages in thread
From: Andy Gospodarek @ 2020-03-23 23:32 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Sat, Mar 21, 2020 at 10:35:25AM +0100, Jiri Pirko wrote:
> Fri, Mar 20, 2020 at 10:25:08PM CET, kuba@kernel.org wrote:
> >On Fri, 20 Mar 2020 08:35:55 +0100 Jiri Pirko wrote:
> >> Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@kernel.org wrote:
> >> >On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:  
[...]
> >
> >Also, once the PFs are created user may want to use them together 
> >or delegate to a VM/namespace. So when I was thinking we'd need some 
> >sort of a secure handshake between PFs and FW for the host to prove 
> >to FW that the PFs belong to the same domain of control, and their
> >resources (and eswitches) can be pooled.
> >
> >I'm digressing..
> 
> Yeah. This needs to be sorted out.
> 
> 
> >
> >> Now the PF itself can have a "nested eswitch" to manage. The "parent
> >> eswitch" where the PF was created would only see one leg to the "nested
> >> eswitch".
> >> 
> >> This "nested eswitch management" might or might not be required. Depends
> >> on a usecare. The question was, how to configure that I as a user
> >> want this or not.
> >
> >Ack. I'm extending your question. I think the question is not only who
> >controls the eswitch but also which PFs share the eswitch.
> 
> Yes.
> 

So we have implemented the notion of an 'adminstrative PF.'  This is a
gross simplification, but the idea is that the PCI domain (or CPU
complex) that contains this PF is the one that is 'in-charge' of the
eSwitch and the rest of the resources (firmware/phycode update) and
might also be the one that gets the VF representors when VFs are created
on any other PCI host/domains.

I'm not sure we need a kernel API to set it as I would leave this as
something that might be burned into the hardware in some manner.

> >
> >I think eswitch is just one capability, but SmartNIC will want to
> >control which ports see what capabilities in general. crypto offloads
> >and such.
> >
> >I presume in your model if host controls eswitch the smartNIC sees just
> 
> host may control the "nested eswitch" in the SmartNIC case.
> 

I'm not sure programming the eswitch in a nested manner is realistic.
Sure we can make hardware do it, but it's probably more trouble than
it's worth.  If a smartnic wants to give control of flows to the host
then it makes more sense to allow some communication at a higher layer
so that requests for hardware offload can be easily validated against
some sort of policy set forth by the admin of the smartnic.

> >what what comes out of Hosts single "uplink"? What if SmartNIC wants
> >the host to be able to control the forwarding but not loose the ability
> >to tap the VF to VF traffic?
> 
> You mean that the VF representors would be in both SmartNIC host and
> host? I don't know how that could work. I think it has to be either
> there or there.
> 

Agreed.  The VF reps should probably appear on whichever host/domain has
the Admin PF.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 23:32         ` Andy Gospodarek
@ 2020-03-24  0:11           ` Jason Gunthorpe
  2020-03-24  5:53           ` Parav Pandit
  1 sibling, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2020-03-24  0:11 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Jiri Pirko, Jakub Kicinski, netdev, davem, parav, yuvalav,
	saeedm, leon, andrew.gospodarek, michael.chan, moshe, ayal,
	eranbe, vladbu, kliteyn, dchickles, sburla, fmanlunas, tariqt,
	oss-drivers, snelson, drivers, aelior, GR-everest-linux-l2,
	grygorii.strashko, mlxsw, idosch, markz, jacob.e.keller, valex,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On Mon, Mar 23, 2020 at 07:32:00PM -0400, Andy Gospodarek wrote:

> If a smartnic wants to give control of flows to the host
> then it makes more sense to allow some communication at a higher layer
> so that requests for hardware offload can be easily validated against
> some sort of policy set forth by the admin of the smartnic.

The important rule is that a PF/VF/SF is always constrained by its
representor. It doesn't matter how it handles the packets internally,
via a tx/rx ring, via an eswitch offload, RDMA, iscsi, etc, it is all
the same as far as the representor is concerned..

Since eswitch is a powerful offload capability it makes sense to
directly nest it.

Jason


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 22:50           ` Jason Gunthorpe
@ 2020-03-24  3:41             ` Jakub Kicinski
  2020-03-24 13:43               ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-24  3:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, Jiri Pirko, netdev, davem, Yuval Avnery,
	Saeed Mahameed, leon, andrew.gospodarek, michael.chan,
	Moshe Shemesh, Aya Levin, Eran Ben Elisha, Vlad Buslov,
	Yevgeny Kliteynik, dchickles, sburla, fmanlunas, Tariq Toukan,
	oss-drivers, snelson, drivers, aelior, GR-everest-linux-l2,
	grygorii.strashko, mlxsw, Ido Schimmel, Mark Zhang,
	jacob.e.keller, Alex Vesker, linyunsheng, lihong.yang,
	vikas.gupta, magnus.karlsson

On Mon, 23 Mar 2020 19:50:09 -0300 Jason Gunthorpe wrote:
> On Mon, Mar 23, 2020 at 12:31:16PM -0700, Jakub Kicinski wrote:
> 
> > Right, that is the point. It's the host admin that wants the new
> > entity, so if possible it'd be better if they could just ask for it 
> > via devlink rather than some cloud API. Not that I'm completely opposed
> > to a cloud API - just seems unnecessary here.  
> 
> The cloud API provides all the permissions checks and security
> elements. It cannot be avoided.

Ack, the question is just who consults the cloud API, the Host or the
SmartNIC (latter would abstract differences between cloud APIs).

> If you try to do it as you say then it is weird. You have to use the
> cloud API to authorize the VM to touch a certain network, then the VM
> has to somehow take that network ID and use devlink to get a netdev
> for it. And the cloud side has to protect against a hostile VM sending
> garbage along this communication channel.

I don't understand how the VM needs to know the network ID, quite the
opposite, the Network ID should be gettable/settable by the hypervisor/
/PF. 

If VF starts requesting nested network IDs those should be in a
separate namespace from the the "outer" ones, no?

> vs simply host plugging in the correct network fully operational when
> the cloud API connects the VM to the network.

That means the user has to pre-allocate the device ID, or query the
cloud API after the device is created about its attributes (in the case
of two interfaces being requested simultaneously).

I don't feel very strongly about this, but given how many Linux
instances run in the cloud it'd seem nice if we had some APIs to meet
their basic needs.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 22:06           ` Jason Gunthorpe
@ 2020-03-24  3:56             ` Jakub Kicinski
  2020-03-24 13:20               ` Jason Gunthorpe
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-24  3:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jiri Pirko, netdev, davem, parav, yuvalav, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Mon, 23 Mar 2020 19:06:05 -0300 Jason Gunthorpe wrote:
> On Mon, Mar 23, 2020 at 12:21:23PM -0700, Jakub Kicinski wrote:
> > > >I see so you want the creation to be controlled by the same entity that
> > > >controls the eswitch..
> > > >
> > > >To me the creation should be on the side that actually needs/will use
> > > >the new port. And if it's not eswitch manager then eswitch manager
> > > >needs to ack it.    
> > > 
> > > Hmm. The question is, is it worth to complicate things in this way?
> > > I don't know. I see a lot of potential misunderstandings :/  
> > 
> > I'd see requesting SFs over devlink/sysfs as simplification, if
> > anything.  
> 
> We looked at it for a while, working the communication such that the
> 'untrusted' side could request a port be created with certain
> parameters and the 'secure eswitch' could know those parameters to
> authorize and wire it up was super complicated and very hard to do
> without races.
> 
> Since it is a security sensitive operation it seems like a much more
> secure design to have the secure side do all the creation and present
> the fully operational object to the insecure side.
> 
> To draw a parallel to qemu & kvm, the untrusted guest VM can't request
> that qemu create a virtio-net for it. Those are always hot plugged in
> by the secure side. Same flow here.

Could you tell us a little more about the races? Other than the
communication channel what changes between issuing from cloud API
vs devlink?

Side note - there is no communication channel between VM and hypervisor
right now, which is the cause for weird designs e.g. the failover/auto
bond mechanism.

> > > >The precedence is probably not a strong argument, but that'd be the
> > > >same way VFs work.. I don't think you can change how VFs work, right?    
> > > 
> > > You can't, but since the VF model is not optimal as it turned out, do we
> > > want to stick with it for the SFs too?  
> > 
> > The VF model is not optimal in what sense? I thought the main reason
> > for SFs is that they are significantly more light weight on Si side.  
> 
> Not entirely really, the biggest reason for SF is because VF have to
> be pr-eallocated and have a number limited by PCI.
> 
> The VF model is poor because the VF is just a dummy stub until the
> representor/eswitch side is fully configured. There is no way for the
> Linux driver to know if the VF is operational or not, so we get weird
> artifacts where we sometimes bind a driver to a VF (and get a
> non-working ethXX) and sometimes we don't. 

Sounds like an implementation issue :S

> The only reason it is like this is because of how SRIOV requires
> everything to be preallocated.

SF also requires pre-allocated resources, so you're not talking about
PCI mem space etc. here I assume.

> The SFs can't even exist until they are configured, so there is no
> state where a driver is connected to an inoperative SF.

You mean it doesn't exist in terms of sysfs device entry?

Okay, if we limit ourselves to sysfs as device interface then sure.
We have far richer interfaces in networking world, so it's a little 
hard to sympathize.

> So it would be nice if VF and SF had the same flow, the SF flow is
> better, lets fix the VF flow to match it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 19:31         ` Jakub Kicinski
  2020-03-23 22:50           ` Jason Gunthorpe
@ 2020-03-24  5:36           ` Parav Pandit
  1 sibling, 0 replies; 50+ messages in thread
From: Parav Pandit @ 2020-03-24  5:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On 3/24/2020 1:01 AM, Jakub Kicinski wrote:
> On Sat, 21 Mar 2020 09:07:30 +0000 Parav Pandit wrote:
>>> I see so you want the creation to be controlled by the same entity that
>>> controls the eswitch..
>>>
>>> To me the creation should be on the side that actually needs/will use
>>> the new port. And if it's not eswitch manager then eswitch manager
>>> needs to ack it.
>>>  
>>
>> There are few reasons to create them on eswitch manager system as below.
>>
>> 1. Creation and deletion on one system and synchronizing it with eswitch
>> system requires multiple back-n-forth calls between two systems.
>>
>> 2. When this happens, system where its created, doesn't know when is the
>> right time to provision to a VM or to a application.
>> udev/systemd/Network Manager and others such software might already
>> start initializing it doing DHCP but its switch side is not yet ready.
> 
> Networking software can deal with link down..
> 
Serving a half cooked device to an application is just not going to
work. It is not just link status.
A typical desired flow is:

1. create device
2. configure mac address
3. configure its rate limits
4. setup policy, encap/decap settings via tc offloads etc
5. bring up the link via rep
6. activate the device and attach it to application

Often administrator wants to assign/do (2), even though user is free to
change it later on.

This can only work if user system and eswitch has secure channel
established which often is just not available.

In other use case user system to define networkId is not trusted.
In this case there is some arbitrary/undefined on wait time at user host
to know that 2 to 6 are done.

vs doing step-1 to 6 on eswitch side by trusted entity and attaching it
to system where it is desired to use is elegant, secure.

networkID is read only for the system where this is deployed.

A application/vm/container may have one or more such devices that needs
to be present for the life of it regardless of its link status.

link down => detach device from container/vm
link up => attach device from container/vm is not right.
Hence port link status doesn't drive device status.

>> So it is desired to make sure that once device is fully
>> ready/configured, its activated.
>>
>> 3. Additionally it doesn't follow mirror sequence during deletion when
>> created on host.
> 
> Why so? Surely host needs to request deletion, otherwise container
> given only an SF could be cut off?
> 
creation from user system,
(a) create device
(b) configure device
(c) synchronous kick to create rep on other system (involving sw)

deletion from user system should be,
(d) synchronous kick to delete rep on other system (involving sw)
(e) unconfig the device
(f) delete the device

To achieve this mirror a sw synchronization is needed, not just with device.

Even if this is achieved somehow, it doesn't address the issue of
untrusted user system not having privilege to create the device with
given NetworkID.

>> 4. eswitch administrator simply doesn't have direct access to the system
>> where this device is used. So it just cannot be created there by eswitch
>> administrator.
> 
> Right, that is the point. It's the host admin that wants the new
> entity, so if possible it'd be better if they could just ask for it 
> via devlink rather than some cloud API. Not that I'm completely opposed
> to a cloud API - just seems unnecessary here.
> 

Flow is:
trusted_administator->cloud_api->smartnic->devlink_create,config,deploy->get_device_on_user_system.

untrusted_user_system->query_network_id->attach_to_container/vm/application.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 23:32         ` Andy Gospodarek
  2020-03-24  0:11           ` Jason Gunthorpe
@ 2020-03-24  5:53           ` Parav Pandit
  1 sibling, 0 replies; 50+ messages in thread
From: Parav Pandit @ 2020-03-24  5:53 UTC (permalink / raw)
  To: Andy Gospodarek, Jiri Pirko
  Cc: Jakub Kicinski, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On 3/24/2020 5:02 AM, Andy Gospodarek wrote:
> 
> Agreed.  The VF reps should probably appear on whichever host/domain has
> the Admin PF.
> 
+1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-24  3:56             ` Jakub Kicinski
@ 2020-03-24 13:20               ` Jason Gunthorpe
  0 siblings, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2020-03-24 13:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, netdev, davem, parav, yuvalav, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Mon, Mar 23, 2020 at 08:56:19PM -0700, Jakub Kicinski wrote:
> On Mon, 23 Mar 2020 19:06:05 -0300 Jason Gunthorpe wrote:
> > On Mon, Mar 23, 2020 at 12:21:23PM -0700, Jakub Kicinski wrote:
> > > > >I see so you want the creation to be controlled by the same entity that
> > > > >controls the eswitch..
> > > > >
> > > > >To me the creation should be on the side that actually needs/will use
> > > > >the new port. And if it's not eswitch manager then eswitch manager
> > > > >needs to ack it.    
> > > > 
> > > > Hmm. The question is, is it worth to complicate things in this way?
> > > > I don't know. I see a lot of potential misunderstandings :/  
> > > 
> > > I'd see requesting SFs over devlink/sysfs as simplification, if
> > > anything.  
> > 
> > We looked at it for a while, working the communication such that the
> > 'untrusted' side could request a port be created with certain
> > parameters and the 'secure eswitch' could know those parameters to
> > authorize and wire it up was super complicated and very hard to do
> > without races.
> > 
> > Since it is a security sensitive operation it seems like a much more
> > secure design to have the secure side do all the creation and present
> > the fully operational object to the insecure side.
> > 
> > To draw a parallel to qemu & kvm, the untrusted guest VM can't request
> > that qemu create a virtio-net for it. Those are always hot plugged in
> > by the secure side. Same flow here.
> 
> Could you tell us a little more about the races? Other than the
> communication channel what changes between issuing from cloud API
> vs devlink?

If I recall the problems came when trying to work with existing cloud
infrastructure that doesn't assume this operating model. You need to
somehow adapt an async APIs of secure/insecure communication with an
async API inside the cloud world. It was a huge mess.

> Side note - there is no communication channel between VM and hypervisor
> right now, which is the cause for weird designs e.g. the failover/auto
> bond mechanism.

Right, and considering the security concerns building one hidden
inside a driver seems like a poor idea..

> > The VF model is poor because the VF is just a dummy stub until the
> > representor/eswitch side is fully configured. There is no way for the
> > Linux driver to know if the VF is operational or not, so we get weird
> > artifacts where we sometimes bind a driver to a VF (and get a
> > non-working ethXX) and sometimes we don't. 
> 
> Sounds like an implementation issue :S

How so?

> > The only reason it is like this is because of how SRIOV requires
> > everything to be preallocated.
> 
> SF also requires pre-allocated resources, so you're not talking about
> PCI mem space etc. here I assume.

It isn't pre-allocated, the usage of the BAR space is dynamic.

> > The SFs can't even exist until they are configured, so there is no
> > state where a driver is connected to an inoperative SF.
> 
> You mean it doesn't exist in terms of sysfs device entry?

I mean literally do not exist at the HW level.

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-24  3:41             ` Jakub Kicinski
@ 2020-03-24 13:43               ` Jason Gunthorpe
  0 siblings, 0 replies; 50+ messages in thread
From: Jason Gunthorpe @ 2020-03-24 13:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Parav Pandit, Jiri Pirko, netdev, davem, Yuval Avnery,
	Saeed Mahameed, leon, andrew.gospodarek, michael.chan,
	Moshe Shemesh, Aya Levin, Eran Ben Elisha, Vlad Buslov,
	Yevgeny Kliteynik, dchickles, sburla, fmanlunas, Tariq Toukan,
	oss-drivers, snelson, drivers, aelior, GR-everest-linux-l2,
	grygorii.strashko, mlxsw, Ido Schimmel, Mark Zhang,
	jacob.e.keller, Alex Vesker, linyunsheng, lihong.yang,
	vikas.gupta, magnus.karlsson

On Mon, Mar 23, 2020 at 08:41:16PM -0700, Jakub Kicinski wrote:
> On Mon, 23 Mar 2020 19:50:09 -0300 Jason Gunthorpe wrote:
> > On Mon, Mar 23, 2020 at 12:31:16PM -0700, Jakub Kicinski wrote:
> > 
> > > Right, that is the point. It's the host admin that wants the new
> > > entity, so if possible it'd be better if they could just ask for it 
> > > via devlink rather than some cloud API. Not that I'm completely opposed
> > > to a cloud API - just seems unnecessary here.  
> > 
> > The cloud API provides all the permissions checks and security
> > elements. It cannot be avoided.
> 
> Ack, the question is just who consults the cloud API, the Host or the
> SmartNIC (latter would abstract differences between cloud APIs).

I feel it is more natural to use the native cloud API. It will always
have the right feature set and authorization scheme. If we try to make
a lowest common denominator in devlink then it will probably be a poor
match, and possibly very complicated.

We have to support push auto-creation anyhow to make the 'device
pre-exists at boot time' case work right.

> > If you try to do it as you say then it is weird. You have to use the
> > cloud API to authorize the VM to touch a certain network, then the VM
> > has to somehow take that network ID and use devlink to get a netdev
> > for it. And the cloud side has to protect against a hostile VM sending
> > garbage along this communication channel.
> 
> I don't understand how the VM needs to know the network ID, quite the
> opposite, the Network ID should be gettable/settable by the hypervisor/
> /PF.

I don't follow, if I have thousands of vlans in my cloud world I need
some way to label them all. If the VM is going to ask for a specific
vlan to be plugged into it, then it needs to ask with the right label.

> > vs simply host plugging in the correct network fully operational when
> > the cloud API connects the VM to the network.
> 
> That means the user has to pre-allocate the device ID, or query the
> cloud API after the device is created about its attributes (in the case
> of two interfaces being requested simultaneously).

I suppose it would use MAC address matching, the cloud creation action
will indicate the MAC and the netdev will appear with that MAC when it
is ready

I think we could do a better job passing some kind of meta-information
into the guest..

Jason

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 19:21         ` Jakub Kicinski
  2020-03-23 22:06           ` Jason Gunthorpe
@ 2020-03-26 14:37           ` Jiri Pirko
  2020-03-26 14:43           ` Jiri Pirko
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 50+ messages in thread
From: Jiri Pirko @ 2020-03-26 14:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

Mon, Mar 23, 2020 at 08:21:23PM CET, kuba@kernel.org wrote:
>> >> >> Note that the PF is merged with physical port representor.
>> >> >> That is due to simpler and flawless transition from legacy mode and back.
>> >> >> The devlink_ports and netdevs for physical ports are staying during
>> >> >> the transition.    
>> >> >
>> >> >When users put an interface under bridge or a bond they have to move 
>> >> >IP addresses etc. onto the bond. Changing mode to "switchdev" is a more
>> >> >destructive operation and there should be no expectation that
>> >> >configuration survives.    
>> >> 
>> >> Yeah, I was saying the same thing when our arch came up with this, but
>> >> I now think it is just fine. It is drivers responsibility to do the
>> >> shift. And the entities representing the uplink port: netdevs and
>> >> devlink_port instances. They can easily stay during the transition. The
>> >> transition only applies to the eswitch and VF entities.  
>> >
>> >If PF is split from the uplink I think the MAC address should stay with
>> >the PF, not the uplink (which becomes just a repr in a Host case).
>> >  
>> >> >The merging of the PF with the physical port representor is flawed.    
>> >> 
>> >> Why?  
>> >
>> >See below.
>> >  
>> >> >People push Qdisc offloads into devlink because of design shortcuts
>> >> >like this.    
>> >> 
>> >> Could you please explain how it is related to "Qdisc offloads"  
>> >
>> >Certain users have designed with constrained PCIe bandwidth in the
>> >server. Meaning NIC needs to do buffering much like a switch would.
>> >So we need to separate the uplink from the PF to attach the Qdisc
>> >offload for configuring details of PCIe queuing.  
>> 
>> Hmm, I'm not sure I understand. Then PF and uplink is the same entity,
>> you can still attach the qdisc to this entity, right? What prevents the
>> same functionality as with the "split"?
>
>The same problem we have with the VF TX rate. We only have Qdisc APIs
>for the TX direction. If we only have one port its TX is the TX onto
>wire. If we split it into MAC/phys and PCIe - the TX of PCI is the RX
>of the host, allowing us to control queuing on the PCIe interface.

I see. But is it needed? I mean, this is for the "management pf" on the
local host. You don't put it into VM. You should only use it for
slowpath (like arps, OVS, etc). If you want to have traffic from
localhost that you need to rate limit, you can either create dynamic PF
or VF or SF for it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 19:21         ` Jakub Kicinski
  2020-03-23 22:06           ` Jason Gunthorpe
  2020-03-26 14:37           ` Jiri Pirko
@ 2020-03-26 14:43           ` Jiri Pirko
  2020-03-26 14:47           ` Jiri Pirko
  2020-03-26 14:59           ` Jiri Pirko
  4 siblings, 0 replies; 50+ messages in thread
From: Jiri Pirko @ 2020-03-26 14:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

>> >> Now the PF itself can have a "nested eswitch" to manage. The "parent
>> >> eswitch" where the PF was created would only see one leg to the "nested
>> >> eswitch".
>> >> 
>> >> This "nested eswitch management" might or might not be required. Depends
>> >> on a usecare. The question was, how to configure that I as a user
>> >> want this or not.  
>> >
>> >Ack. I'm extending your question. I think the question is not only who
>> >controls the eswitch but also which PFs share the eswitch.  
>> 
>> Yes.
>> 
>> >
>> >I think eswitch is just one capability, but SmartNIC will want to
>> >control which ports see what capabilities in general. crypto offloads
>> >and such.
>> >
>> >I presume in your model if host controls eswitch the smartNIC sees just  
>> 
>> host may control the "nested eswitch" in the SmartNIC case.
>
>By nested eswitch you mean eswitch between ports of the same Host, or
>within one PF? Then SmartNIC may switch between PFs or multiple hosts?

In general, each pf can manage a switch and have another pf underneath
which may manage the nested switch. This may go to more than 2 levels in
theory.

It is basically an independent switch with uplink to the higher switch.

It can be on the same host, or a different host. Does not matter.


>
>> >what what comes out of Hosts single "uplink"? What if SmartNIC wants
>> >the host to be able to control the forwarding but not loose the ability
>> >to tap the VF to VF traffic?  
>> 
>> You mean that the VF representors would be in both SmartNIC host and
>> host? I don't know how that could work. I think it has to be either
>> there or there.
>
>That'd certainly be easier. Without representors we can't even check
>traffic stats, though. SmartNIC may want to see the resource
>utilization of ports even if it doesn't see the ports. That's just a
>theoretical divagation, I don't think it's required.

Hmm, it would be really off. I don't think it would be possible to use
them both to do "write" config. Maybe only "read" for showing some stats
or something. Each would have netdev representor? One may be used to
send data and second not? That would be hell to maintain and understand :/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 19:21         ` Jakub Kicinski
                             ` (2 preceding siblings ...)
  2020-03-26 14:43           ` Jiri Pirko
@ 2020-03-26 14:47           ` Jiri Pirko
  2020-03-26 14:51             ` Jiri Pirko
  2020-03-26 14:59           ` Jiri Pirko
  4 siblings, 1 reply; 50+ messages in thread
From: Jiri Pirko @ 2020-03-26 14:47 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

>> >> >> $ devlink slice show
>> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>> >> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
>> >> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>> >> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>> >> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2    
>> >> >
>> >> >What are slices?    
>> >> 
>> >> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
>> >> configuration of the "other side of the wire". Like the mac. Hypervizor
>> >> admin can use the slite to set the mac address of a VF which is in the
>> >> virtual machine. Basically this should be a replacement of "ip vf"
>> >> command.  
>> >
>> >I lost my mail archive but didn't we already have a long thread with
>> >Parav about this?  
>> 
>> I believe so.
>
>Oh, well. I still don't see the need for it :( If it's one to one with
>ports why add another API, and have to do some cross linking to get
>from one to the other?
>
>I'd much rather resources hanging off the port.

Yeah, I was originally saying exactly the same as you do. However, there
might be slices that are not related to any port. Like NVE. Port does
not make sense in that world. It is just a slice of device.
Do we want to model those as "ports" too? Maybe. What do you think?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-26 14:47           ` Jiri Pirko
@ 2020-03-26 14:51             ` Jiri Pirko
  2020-03-26 20:30               ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Jiri Pirko @ 2020-03-26 14:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

Thu, Mar 26, 2020 at 03:47:09PM CET, jiri@resnulli.us wrote:
>>> >> >> $ devlink slice show
>>> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>>> >> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
>>> >> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>>> >> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>>> >> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2    
>>> >> >
>>> >> >What are slices?    
>>> >> 
>>> >> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
>>> >> configuration of the "other side of the wire". Like the mac. Hypervizor
>>> >> admin can use the slite to set the mac address of a VF which is in the
>>> >> virtual machine. Basically this should be a replacement of "ip vf"
>>> >> command.  
>>> >
>>> >I lost my mail archive but didn't we already have a long thread with
>>> >Parav about this?  
>>> 
>>> I believe so.
>>
>>Oh, well. I still don't see the need for it :( If it's one to one with
>>ports why add another API, and have to do some cross linking to get
>>from one to the other?
>>
>>I'd much rather resources hanging off the port.
>
>Yeah, I was originally saying exactly the same as you do. However, there
>might be slices that are not related to any port. Like NVE. Port does
>not make sense in that world. It is just a slice of device.
>Do we want to model those as "ports" too? Maybe. What do you think?

Also, the slice is to model "the other side of the wire":

eswitch - devlink_port ...... slice

If we have it under devlink port, it would probably
have to be nested object to have the clean cut.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-23 19:21         ` Jakub Kicinski
                             ` (3 preceding siblings ...)
  2020-03-26 14:47           ` Jiri Pirko
@ 2020-03-26 14:59           ` Jiri Pirko
  4 siblings, 0 replies; 50+ messages in thread
From: Jiri Pirko @ 2020-03-26 14:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

>> >> >> $ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc    
>> >> >
>> >> >Why is the SF number specified by the user rather than allocated?    
>> >> 
>> >> Because it is snown in representor netdevice name. And you need to have
>> >> it predetermined: enp6s0pf1sf10  
>> >
>> >I'd think you need to know what was assigned, not necessarily pick
>> >upfront.. I feel like we had this conversation before as well.  
>> 
>> Yeah. For the scripting sake, always when you create something, you can
>> directly use it later in the script. Like if you create a bridge, you
>> assign it a name so you can use it.
>> 
>> The "what was assigned" would mean that the assigne
>> value has to be somehow returned from the kernel and passed to the
>> script. Not sure how. Do you have any example where this is happening in
>> networking?
>
>Not really, but when allocating objects it seems idiomatic to get the
>id / handle / address of the new entity in response. Seems to me we're
>not doing it because the infrastructure for it is not in place, but
>it'd be a good extension.
>
>Times are a little crazy but I can take a poke at implementing
>something along those lines once I find some time..

I can't really see how is this supposed to work efficiently. Imagine a
simple dummy script:
devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10
devlink slice set pci/0000.06.00.0/100 hw_addr aa:bb:cc:aa:bb:cc
devlink slice del pci/0000.06.00.0/100

The handle is clear then, used for add/set/del. The same thing.


Now with dynamically allocated index that you suggest:
devlink slice add pci/0000.06.00.0 flavour pcisf pfnum 1 sfnum 10
#somehow get the 100 into variable $XXX
XXX=???
devlink slice set pci/0000.06.00.0/$XXX hw_addr aa:bb:cc:aa:bb:cc
devlink slice del pci/0000.06.00.0/$XXX

there are two things I don't like about this:
1) You use different handles for different actions.
2) You need to somehow get the number into variable $XXX

What is the benefit of this approach?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-26 14:51             ` Jiri Pirko
@ 2020-03-26 20:30               ` Jakub Kicinski
  2020-03-27  7:47                 ` Jiri Pirko
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-26 20:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Thu, 26 Mar 2020 15:51:46 +0100 Jiri Pirko wrote:
> Thu, Mar 26, 2020 at 03:47:09PM CET, jiri@resnulli.us wrote:
> >>> >> >> $ devlink slice show
> >>> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> >>> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
> >>> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
> >>> >> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
> >>> >> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
> >>> >> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
> >>> >> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2      
> >>> >> >
> >>> >> >What are slices?      
> >>> >> 
> >>> >> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
> >>> >> configuration of the "other side of the wire". Like the mac. Hypervizor
> >>> >> admin can use the slite to set the mac address of a VF which is in the
> >>> >> virtual machine. Basically this should be a replacement of "ip vf"
> >>> >> command.    
> >>> >
> >>> >I lost my mail archive but didn't we already have a long thread with
> >>> >Parav about this?    
> >>> 
> >>> I believe so.  
> >>
> >>Oh, well. I still don't see the need for it :( If it's one to one with
> >>ports why add another API, and have to do some cross linking to get
> >>from one to the other?
> >>
> >>I'd much rather resources hanging off the port.  
> >
> >Yeah, I was originally saying exactly the same as you do. However, there
> >might be slices that are not related to any port. Like NVE. Port does
> >not make sense in that world. It is just a slice of device.
> >Do we want to model those as "ports" too? Maybe. What do you think?  
> 
> Also, the slice is to model "the other side of the wire":
> 
> eswitch - devlink_port ...... slice
> 
> If we have it under devlink port, it would probably
> have to be nested object to have the clean cut.

So the queues, interrupts, and other resources are also part 
of the slice then?

How do slice parameters like rate apply to NVMe?

Are ports always ethernet? and slices also cover endpoints with
transport stack offloaded to the NIC?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-26 20:30               ` Jakub Kicinski
@ 2020-03-27  7:47                 ` Jiri Pirko
  2020-03-27 16:38                   ` Jakub Kicinski
  2020-03-30  5:30                   ` Parav Pandit
  0 siblings, 2 replies; 50+ messages in thread
From: Jiri Pirko @ 2020-03-27  7:47 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

Thu, Mar 26, 2020 at 09:30:01PM CET, kuba@kernel.org wrote:
>On Thu, 26 Mar 2020 15:51:46 +0100 Jiri Pirko wrote:
>> Thu, Mar 26, 2020 at 03:47:09PM CET, jiri@resnulli.us wrote:
>> >>> >> >> $ devlink slice show
>> >>> >> >> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>> >>> >> >> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>> >>> >> >> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>> >>> >> >> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
>> >>> >> >> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>> >>> >> >> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>> >>> >> >> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2      
>> >>> >> >
>> >>> >> >What are slices?      
>> >>> >> 
>> >>> >> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
>> >>> >> configuration of the "other side of the wire". Like the mac. Hypervizor
>> >>> >> admin can use the slite to set the mac address of a VF which is in the
>> >>> >> virtual machine. Basically this should be a replacement of "ip vf"
>> >>> >> command.    
>> >>> >
>> >>> >I lost my mail archive but didn't we already have a long thread with
>> >>> >Parav about this?    
>> >>> 
>> >>> I believe so.  
>> >>
>> >>Oh, well. I still don't see the need for it :( If it's one to one with
>> >>ports why add another API, and have to do some cross linking to get
>> >>from one to the other?
>> >>
>> >>I'd much rather resources hanging off the port.  
>> >
>> >Yeah, I was originally saying exactly the same as you do. However, there
>> >might be slices that are not related to any port. Like NVE. Port does
>> >not make sense in that world. It is just a slice of device.
>> >Do we want to model those as "ports" too? Maybe. What do you think?  
>> 
>> Also, the slice is to model "the other side of the wire":
>> 
>> eswitch - devlink_port ...... slice
>> 
>> If we have it under devlink port, it would probably
>> have to be nested object to have the clean cut.
>
>So the queues, interrupts, and other resources are also part 
>of the slice then?

Yep, that seems to make sense.

>
>How do slice parameters like rate apply to NVMe?

Not really.

>
>Are ports always ethernet? and slices also cover endpoints with
>transport stack offloaded to the NIC?

devlink_port now can be either "ethernet" or "infiniband". Perhaps,
there can be port type "nve" which would contain only some of the
config options and would not have a representor "netdev/ibdev" linked.
I don't know.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27  7:47                 ` Jiri Pirko
@ 2020-03-27 16:38                   ` Jakub Kicinski
  2020-03-27 18:49                     ` Samudrala, Sridhar
  2020-03-30  7:48                     ` Parav Pandit
  2020-03-30  5:30                   ` Parav Pandit
  1 sibling, 2 replies; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-27 16:38 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:
> >So the queues, interrupts, and other resources are also part 
> >of the slice then?  
> 
> Yep, that seems to make sense.
> 
> >How do slice parameters like rate apply to NVMe?  
> 
> Not really.
> 
> >Are ports always ethernet? and slices also cover endpoints with
> >transport stack offloaded to the NIC?  
> 
> devlink_port now can be either "ethernet" or "infiniband". Perhaps,
> there can be port type "nve" which would contain only some of the
> config options and would not have a representor "netdev/ibdev" linked.
> I don't know.

I honestly find it hard to understand what that slice abstraction is,
and which things belong to slices and which to PCI ports (or why we even
have them).

With devices like NFP and Mellanox CX3 which have one PCI PF maybe it
would have made sense to have a slice that covers multiple ports, but
it seems the proposal is to have port to slice mapping be 1:1. And rate
in those devices should still be per port not per slice.

But this keeps coming back, and since you guys are doing all the work,
if you really really need it..

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 16:38                   ` Jakub Kicinski
@ 2020-03-27 18:49                     ` Samudrala, Sridhar
  2020-03-27 19:10                       ` Jakub Kicinski
  2020-03-30  7:48                     ` Parav Pandit
  1 sibling, 1 reply; 50+ messages in thread
From: Samudrala, Sridhar @ 2020-03-27 18:49 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson



On 3/27/2020 9:38 AM, Jakub Kicinski wrote:
> On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:
>>> So the queues, interrupts, and other resources are also part
>>> of the slice then?
>>
>> Yep, that seems to make sense.
>>
>>> How do slice parameters like rate apply to NVMe?
>>
>> Not really.
>>
>>> Are ports always ethernet? and slices also cover endpoints with
>>> transport stack offloaded to the NIC?
>>
>> devlink_port now can be either "ethernet" or "infiniband". Perhaps,
>> there can be port type "nve" which would contain only some of the
>> config options and would not have a representor "netdev/ibdev" linked.
>> I don't know.
> 
> I honestly find it hard to understand what that slice abstraction is,
> and which things belong to slices and which to PCI ports (or why we even
> have them).

Looks like slice is a new term for sub function and we can consider this
as a VMDQ VSI(intel terminology) or even a Queue group of a VSI.

Today we expose VMDQ VSI via offloaded MACVLAN. This mechanism should 
allow us to expose it as a separate netdev.

> 
> With devices like NFP and Mellanox CX3 which have one PCI PF maybe it
> would have made sense to have a slice that covers multiple ports, but
> it seems the proposal is to have port to slice mapping be 1:1. And rate
> in those devices should still be per port not per slice.
> 
> But this keeps coming back, and since you guys are doing all the work,
> if you really really need it..


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 18:49                     ` Samudrala, Sridhar
@ 2020-03-27 19:10                       ` Jakub Kicinski
  2020-03-27 19:45                         ` Saeed Mahameed
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-27 19:10 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jiri Pirko, netdev, davem, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, magnus.karlsson

On Fri, 27 Mar 2020 11:49:10 -0700 Samudrala, Sridhar wrote:
> On 3/27/2020 9:38 AM, Jakub Kicinski wrote:
> > On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:  
> >>> So the queues, interrupts, and other resources are also part
> >>> of the slice then?  
> >>
> >> Yep, that seems to make sense.
> >>  
> >>> How do slice parameters like rate apply to NVMe?  
> >>
> >> Not really.
> >>  
> >>> Are ports always ethernet? and slices also cover endpoints with
> >>> transport stack offloaded to the NIC?  
> >>
> >> devlink_port now can be either "ethernet" or "infiniband". Perhaps,
> >> there can be port type "nve" which would contain only some of the
> >> config options and would not have a representor "netdev/ibdev" linked.
> >> I don't know.  
> > 
> > I honestly find it hard to understand what that slice abstraction is,
> > and which things belong to slices and which to PCI ports (or why we even
> > have them).  
> 
> Looks like slice is a new term for sub function and we can consider this
> as a VMDQ VSI(intel terminology) or even a Queue group of a VSI.
> 
> Today we expose VMDQ VSI via offloaded MACVLAN. This mechanism should 
> allow us to expose it as a separate netdev.

Kinda. Looks like with the new APIs you guys will definitely be able to
expose VMDQ as a full(er) device, and if memory serves me well that's
what you wanted initially.

But the sub-functions are just a subset of slices, PF and VFs also
have a slice associated with them.. And all those things have a port,
too.

> > With devices like NFP and Mellanox CX3 which have one PCI PF maybe it
> > would have made sense to have a slice that covers multiple ports, but
> > it seems the proposal is to have port to slice mapping be 1:1. And rate
> > in those devices should still be per port not per slice.
> > 
> > But this keeps coming back, and since you guys are doing all the work,
> > if you really really need it..  

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 19:10                       ` Jakub Kicinski
@ 2020-03-27 19:45                         ` Saeed Mahameed
  2020-03-27 20:42                           ` Jakub Kicinski
                                             ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Saeed Mahameed @ 2020-03-27 19:45 UTC (permalink / raw)
  To: sridhar.samudrala, kuba
  Cc: Aya Levin, andrew.gospodarek, sburla, jiri, Tariq Toukan, davem,
	netdev, Vlad Buslov, lihong.yang, Ido Schimmel, jgg, fmanlunas,
	oss-drivers, leon, Parav Pandit, grygorii.strashko, michael.chan,
	Alex Vesker, snelson, linyunsheng, magnus.karlsson, dchickles,
	jacob.e.keller, Moshe Shemesh, Mark Zhang, aelior, Yuval Avnery,
	drivers, mlxsw, GR-everest-linux-l2, Yevgeny Kliteynik,
	vikas.gupta, Eran Ben Elisha

On Fri, 2020-03-27 at 12:10 -0700, Jakub Kicinski wrote:
> On Fri, 27 Mar 2020 11:49:10 -0700 Samudrala, Sridhar wrote:
> > On 3/27/2020 9:38 AM, Jakub Kicinski wrote:
> > > On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:  
> > > > > So the queues, interrupts, and other resources are also part
> > > > > of the slice then?  
> > > > 
> > > > Yep, that seems to make sense.
> > > >  
> > > > > How do slice parameters like rate apply to NVMe?  
> > > > 
> > > > Not really.
> > > >  
> > > > > Are ports always ethernet? and slices also cover endpoints
> > > > > with
> > > > > transport stack offloaded to the NIC?  
> > > > 
> > > > devlink_port now can be either "ethernet" or "infiniband".
> > > > Perhaps,
> > > > there can be port type "nve" which would contain only some of
> > > > the
> > > > config options and would not have a representor "netdev/ibdev"
> > > > linked.
> > > > I don't know.  
> > > 
> > > I honestly find it hard to understand what that slice abstraction
> > > is,
> > > and which things belong to slices and which to PCI ports (or why
> > > we even
> > > have them).  
> > 
> > Looks like slice is a new term for sub function and we can consider
> > this
> > as a VMDQ VSI(intel terminology) or even a Queue group of a VSI.
> > 
> > Today we expose VMDQ VSI via offloaded MACVLAN. This mechanism
> > should 
> > allow us to expose it as a separate netdev.
> 
> Kinda. Looks like with the new APIs you guys will definitely be able
> to
> expose VMDQ as a full(er) device, and if memory serves me well that's
> what you wanted initially.
> 

VMDQ is just a steering based isolated set of rx tx rings pointed at by
a dumb steering rule in the HW .. i am not sure we can just wrap them
in their own vendor specific netdev and just call it a slice..

from what i understand, a real slice is a full isolated HW pipeline
with its own HW resources and HW based isolation, a slice rings/hw
resources can never be shared between different slices, just like a vf,
but without the pcie virtual function back-end..

Why would you need a devlink slice instance for something that has only
rx/tx rings attributes ? if we are going with such design, then i guess
a simple rdma app with a pair of QPs can call itself a slice .. 

We need a clear-cut definition of what a Sub-function slice is.. this
RFC doesn't seem to address that clearly.

> But the sub-functions are just a subset of slices, PF and VFs also
> have a slice associated with them.. And all those things have a port,
> too.
> 

PFs/VFs, might have more than one port sometimes .. 

> > > With devices like NFP and Mellanox CX3 which have one PCI PF
> > > maybe it
> > > would have made sense to have a slice that covers multiple ports,
> > > but
> > > it seems the proposal is to have port to slice mapping be 1:1.
> > > And rate
> > > in those devices should still be per port not per slice.
> > > 
> > > But this keeps coming back, and since you guys are doing all the
> > > work,
> > > if you really really need it..  

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 19:45                         ` Saeed Mahameed
@ 2020-03-27 20:42                           ` Jakub Kicinski
  2020-03-30  9:07                             ` Parav Pandit
  2020-03-27 20:47                           ` Samudrala, Sridhar
  2020-03-30  7:09                           ` Parav Pandit
  2 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-27 20:42 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: sridhar.samudrala, Aya Levin, andrew.gospodarek, sburla, jiri,
	Tariq Toukan, davem, netdev, Vlad Buslov, lihong.yang,
	Ido Schimmel, jgg, fmanlunas, oss-drivers, leon, Parav Pandit,
	grygorii.strashko, michael.chan, Alex Vesker, snelson,
	linyunsheng, magnus.karlsson, dchickles, jacob.e.keller,
	Moshe Shemesh, Mark Zhang, aelior, Yuval Avnery, drivers, mlxsw,
	GR-everest-linux-l2, Yevgeny Kliteynik, vikas.gupta,
	Eran Ben Elisha

On Fri, 27 Mar 2020 19:45:53 +0000 Saeed Mahameed wrote:
> On Fri, 2020-03-27 at 12:10 -0700, Jakub Kicinski wrote:
> > On Fri, 27 Mar 2020 11:49:10 -0700 Samudrala, Sridhar wrote:  
> > > On 3/27/2020 9:38 AM, Jakub Kicinski wrote:  
> > > > On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:    
> > > > > > So the queues, interrupts, and other resources are also part
> > > > > > of the slice then?    
> > > > > 
> > > > > Yep, that seems to make sense.
> > > > >    
> > > > > > How do slice parameters like rate apply to NVMe?    
> > > > > 
> > > > > Not really.
> > > > >    
> > > > > > Are ports always ethernet? and slices also cover endpoints
> > > > > > with
> > > > > > transport stack offloaded to the NIC?    
> > > > > 
> > > > > devlink_port now can be either "ethernet" or "infiniband".
> > > > > Perhaps,
> > > > > there can be port type "nve" which would contain only some of
> > > > > the
> > > > > config options and would not have a representor "netdev/ibdev"
> > > > > linked.
> > > > > I don't know.    
> > > > 
> > > > I honestly find it hard to understand what that slice abstraction
> > > > is,
> > > > and which things belong to slices and which to PCI ports (or why
> > > > we even
> > > > have them).    
> > > 
> > > Looks like slice is a new term for sub function and we can consider
> > > this
> > > as a VMDQ VSI(intel terminology) or even a Queue group of a VSI.
> > > 
> > > Today we expose VMDQ VSI via offloaded MACVLAN. This mechanism
> > > should 
> > > allow us to expose it as a separate netdev.  
> > 
> > Kinda. Looks like with the new APIs you guys will definitely be able
> > to
> > expose VMDQ as a full(er) device, and if memory serves me well that's
> > what you wanted initially.
> 
> VMDQ is just a steering based isolated set of rx tx rings pointed at by
> a dumb steering rule in the HW .. i am not sure we can just wrap them
> in their own vendor specific netdev and just call it a slice..
> 
> from what i understand, a real slice is a full isolated HW pipeline
> with its own HW resources and HW based isolation, a slice rings/hw
> resources can never be shared between different slices, just like a vf,
> but without the pcie virtual function back-end..
> 
> Why would you need a devlink slice instance for something that has only
> rx/tx rings attributes ? if we are going with such design, then i guess
> a simple rdma app with a pair of QPs can call itself a slice .. 

Ack, I'm not sure where Intel is, but I'd hope since VMDq in its
"just a bunch of queues with dumb steering" form was created in
igb/ixgbe days, 2 generations of HW later its not just that..

> We need a clear-cut definition of what a Sub-function slice is.. this
> RFC doesn't seem to address that clearly.

Definitely. I'd say we need a clear definition of (a) what a
sub-functions is, and (b) what a slice is.

> > But the sub-functions are just a subset of slices, PF and VFs also
> > have a slice associated with them.. And all those things have a port,
> > too.
> >   
> 
> PFs/VFs, might have more than one port sometimes .. 

Like I said below, right? So in that case do you think they should have
multiple slices, too, or slice per port?

> > > > With devices like NFP and Mellanox CX3 which have one PCI PF
> > > > maybe it
> > > > would have made sense to have a slice that covers multiple ports,
> > > > but
> > > > it seems the proposal is to have port to slice mapping be 1:1.
> > > > And rate
> > > > in those devices should still be per port not per slice.
> > > > 
> > > > But this keeps coming back, and since you guys are doing all the
> > > > work,
> > > > if you really really need it..    

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 19:45                         ` Saeed Mahameed
  2020-03-27 20:42                           ` Jakub Kicinski
@ 2020-03-27 20:47                           ` Samudrala, Sridhar
  2020-03-27 20:59                             ` Jakub Kicinski
  2020-03-30  7:09                           ` Parav Pandit
  2 siblings, 1 reply; 50+ messages in thread
From: Samudrala, Sridhar @ 2020-03-27 20:47 UTC (permalink / raw)
  To: Saeed Mahameed, kuba
  Cc: Aya Levin, andrew.gospodarek, sburla, jiri, Tariq Toukan, davem,
	netdev, Vlad Buslov, lihong.yang, Ido Schimmel, jgg, fmanlunas,
	oss-drivers, leon, Parav Pandit, grygorii.strashko, michael.chan,
	Alex Vesker, snelson, linyunsheng, magnus.karlsson, dchickles,
	jacob.e.keller, Moshe Shemesh, Mark Zhang, aelior, Yuval Avnery,
	drivers, mlxsw, GR-everest-linux-l2, Yevgeny Kliteynik,
	vikas.gupta, Eran Ben Elisha



On 3/27/2020 12:45 PM, Saeed Mahameed wrote:
> On Fri, 2020-03-27 at 12:10 -0700, Jakub Kicinski wrote:
>> On Fri, 27 Mar 2020 11:49:10 -0700 Samudrala, Sridhar wrote:
>>> On 3/27/2020 9:38 AM, Jakub Kicinski wrote:
>>>> On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:
>>>>>> So the queues, interrupts, and other resources are also part
>>>>>> of the slice then?
>>>>>
>>>>> Yep, that seems to make sense.
>>>>>   
>>>>>> How do slice parameters like rate apply to NVMe?
>>>>>
>>>>> Not really.
>>>>>   
>>>>>> Are ports always ethernet? and slices also cover endpoints
>>>>>> with
>>>>>> transport stack offloaded to the NIC?
>>>>>
>>>>> devlink_port now can be either "ethernet" or "infiniband".
>>>>> Perhaps,
>>>>> there can be port type "nve" which would contain only some of
>>>>> the
>>>>> config options and would not have a representor "netdev/ibdev"
>>>>> linked.
>>>>> I don't know.
>>>>
>>>> I honestly find it hard to understand what that slice abstraction
>>>> is,
>>>> and which things belong to slices and which to PCI ports (or why
>>>> we even
>>>> have them).
>>>
>>> Looks like slice is a new term for sub function and we can consider
>>> this
>>> as a VMDQ VSI(intel terminology) or even a Queue group of a VSI.
>>>
>>> Today we expose VMDQ VSI via offloaded MACVLAN. This mechanism
>>> should
>>> allow us to expose it as a separate netdev.
>>
>> Kinda. Looks like with the new APIs you guys will definitely be able
>> to
>> expose VMDQ as a full(er) device, and if memory serves me well that's
>> what you wanted initially.
>>
> 
> VMDQ is just a steering based isolated set of rx tx rings pointed at by
> a dumb steering rule in the HW .. i am not sure we can just wrap them
> in their own vendor specific netdev and just call it a slice..
> 
> from what i understand, a real slice is a full isolated HW pipeline
> with its own HW resources and HW based isolation, a slice rings/hw
> resources can never be shared between different slices, just like a vf,
> but without the pcie virtual function back-end..

The above definition of slice matches a VMDQ VSI. It is an isolated set 
of queues/interrupt vectors/rss and packet steering rules can be added 
to steer packets from and to this entity. A PF is associated with its 
own VSI and can spawn multiple VMDQ VSIs similar to VFs.

By default a PF is associated with a single uplink port.

> 
> Why would you need a devlink slice instance for something that has only
> rx/tx rings attributes ? if we are going with such design, then i guess
> a simple rdma app with a pair of QPs can call itself a slice ..
> 
> We need a clear-cut definition of what a Sub-function slice is.. this
> RFC doesn't seem to address that clearly.
> 
>> But the sub-functions are just a subset of slices, PF and VFs also
>> have a slice associated with them.. And all those things have a port,
>> too.
>>
> 
> PFs/VFs, might have more than one port sometimes ..

Yes. When the uplink ports are in a LAG, then a PF/VF/slice should be 
able to send or receive from more than 1 port.

> 
>>>> With devices like NFP and Mellanox CX3 which have one PCI PF
>>>> maybe it
>>>> would have made sense to have a slice that covers multiple ports,
>>>> but
>>>> it seems the proposal is to have port to slice mapping be 1:1.
>>>> And rate
>>>> in those devices should still be per port not per slice.
>>>>
>>>> But this keeps coming back, and since you guys are doing all the
>>>> work,
>>>> if you really really need it..

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 20:47                           ` Samudrala, Sridhar
@ 2020-03-27 20:59                             ` Jakub Kicinski
  0 siblings, 0 replies; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-27 20:59 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Saeed Mahameed, Aya Levin, andrew.gospodarek, sburla, jiri,
	Tariq Toukan, davem, netdev, Vlad Buslov, lihong.yang,
	Ido Schimmel, jgg, fmanlunas, oss-drivers, leon, Parav Pandit,
	grygorii.strashko, michael.chan, Alex Vesker, snelson,
	linyunsheng, magnus.karlsson, dchickles, jacob.e.keller,
	Moshe Shemesh, Mark Zhang, aelior, Yuval Avnery, drivers, mlxsw,
	GR-everest-linux-l2, Yevgeny Kliteynik, vikas.gupta,
	Eran Ben Elisha

On Fri, 27 Mar 2020 13:47:44 -0700 Samudrala, Sridhar wrote:
> >> But the sub-functions are just a subset of slices, PF and VFs also
> >> have a slice associated with them.. And all those things have a port,
> >> too.
> >>  
> > 
> > PFs/VFs, might have more than one port sometimes ..  
> 
> Yes. When the uplink ports are in a LAG, then a PF/VF/slice should be 
> able to send or receive from more than 1 port.

So that's a little simpler to what I was considering, in mlx4 and older
nfps we have 1 PCI PF for multi-port devices. There is only one PF with
multiple BAR regions corresponding to different device ports.

So you can still address fully independently the pipelines for two
ports, but they are "mapped" in the same PCI PF.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27  7:47                 ` Jiri Pirko
  2020-03-27 16:38                   ` Jakub Kicinski
@ 2020-03-30  5:30                   ` Parav Pandit
  1 sibling, 0 replies; 50+ messages in thread
From: Parav Pandit @ 2020-03-30  5:30 UTC (permalink / raw)
  To: Jiri Pirko, Jakub Kicinski
  Cc: netdev, davem, Yuval Avnery, jgg, Saeed Mahameed, leon,
	andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On 3/27/2020 1:17 PM, Jiri Pirko wrote:
> Thu, Mar 26, 2020 at 09:30:01PM CET, kuba@kernel.org wrote:
>> On Thu, 26 Mar 2020 15:51:46 +0100 Jiri Pirko wrote:
>>> Thu, Mar 26, 2020 at 03:47:09PM CET, jiri@resnulli.us wrote:
>>>>>>>>>> $ devlink slice show
>>>>>>>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>>>>>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>>>>>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>>>>>>>>>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
>>>>>>>>>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>>>>>>>>>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>>>>>>>>>> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2      
>>>>>>>>>
>>>>>>>>> What are slices?      
>>>>>>>>
>>>>>>>> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
>>>>>>>> configuration of the "other side of the wire". Like the mac. Hypervizor
>>>>>>>> admin can use the slite to set the mac address of a VF which is in the
>>>>>>>> virtual machine. Basically this should be a replacement of "ip vf"
>>>>>>>> command.    
>>>>>>>
>>>>>>> I lost my mail archive but didn't we already have a long thread with
>>>>>>> Parav about this?    
>>>>>>
>>>>>> I believe so.  
>>>>>
>>>>> Oh, well. I still don't see the need for it :( If it's one to one with
>>>>> ports why add another API, and have to do some cross linking to get
>>>> >from one to the other?
>>>>>
>>>>> I'd much rather resources hanging off the port.  
>>>>
>>>> Yeah, I was originally saying exactly the same as you do. However, there
>>>> might be slices that are not related to any port. Like NVE. Port does
>>>> not make sense in that world. It is just a slice of device.
>>>> Do we want to model those as "ports" too? Maybe. What do you think?  
>>>
>>> Also, the slice is to model "the other side of the wire":
>>>
>>> eswitch - devlink_port ...... slice
>>>
>>> If we have it under devlink port, it would probably
>>> have to be nested object to have the clean cut.
>>
>> So the queues, interrupts, and other resources are also part 
>> of the slice then?
> 
> Yep, that seems to make sense.
> 
>>
>> How do slice parameters like rate apply to NVMe?
> 
> Not really.
> 
>>
>> Are ports always ethernet? and slices also cover endpoints with
>> transport stack offloaded to the NIC?
> 
> devlink_port now can be either "ethernet" or "infiniband". Perhaps,
> there can be port type "nve" which would contain only some of the
> config options and would not have a representor "netdev/ibdev" linked.
> I don't know.
> 
devlink slice represents a PF/VF/SF. This means that a given function
can have an rdma, eth and more class of device.
So port of a slice is both eth+rdma (not either or).


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 19:45                         ` Saeed Mahameed
  2020-03-27 20:42                           ` Jakub Kicinski
  2020-03-27 20:47                           ` Samudrala, Sridhar
@ 2020-03-30  7:09                           ` Parav Pandit
  2 siblings, 0 replies; 50+ messages in thread
From: Parav Pandit @ 2020-03-30  7:09 UTC (permalink / raw)
  To: Saeed Mahameed, sridhar.samudrala, kuba
  Cc: Aya Levin, andrew.gospodarek, sburla, jiri, Tariq Toukan, davem,
	netdev, Vlad Buslov, lihong.yang, Ido Schimmel, jgg, fmanlunas,
	oss-drivers, leon, grygorii.strashko, michael.chan, Alex Vesker,
	snelson, linyunsheng, magnus.karlsson, dchickles, jacob.e.keller,
	Moshe Shemesh, Mark Zhang, aelior, Yuval Avnery, drivers, mlxsw,
	GR-everest-linux-l2, Yevgeny Kliteynik, vikas.gupta,
	Eran Ben Elisha

On 3/28/2020 1:15 AM, Saeed Mahameed wrote:
> 
> We need a clear-cut definition of what a Sub-function slice is.. this
> RFC doesn't seem to address that clearly.
> 
Jiri's RFC content was already big so we skipped lot of SF slice
plumbing details.
I will shortly post an extended content in this email thread which talks
about SF, slice and all of its plumbing details.

>> But the sub-functions are just a subset of slices, PF and VFs also
>> have a slice associated with them.. And all those things have a port,
>> too.
Yes.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 16:38                   ` Jakub Kicinski
  2020-03-27 18:49                     ` Samudrala, Sridhar
@ 2020-03-30  7:48                     ` Parav Pandit
  2020-03-30 19:36                       ` Jakub Kicinski
  1 sibling, 1 reply; 50+ messages in thread
From: Parav Pandit @ 2020-03-30  7:48 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: netdev, davem, Yuval Avnery, jgg, Saeed Mahameed, leon,
	andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

Hi Jakub,

On 3/27/2020 10:08 PM, Jakub Kicinski wrote:
> On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:
>>> So the queues, interrupts, and other resources are also part 
>>> of the slice then?  
>>
>> Yep, that seems to make sense.
>>
>>> How do slice parameters like rate apply to NVMe?  
>>
>> Not really.
>>
>>> Are ports always ethernet? and slices also cover endpoints with
>>> transport stack offloaded to the NIC?  
>>
>> devlink_port now can be either "ethernet" or "infiniband". Perhaps,
>> there can be port type "nve" which would contain only some of the
>> config options and would not have a representor "netdev/ibdev" linked.
>> I don't know.
> 
> I honestly find it hard to understand what that slice abstraction is,
> and which things belong to slices and which to PCI ports (or why we even
> have them).
> 
In an alternative, devlink port can be overloaded/retrofit to do all
things that slice desires to do.
For that matter representor netdev can be overloaded/extended to do what
slice desire to do (instead of devlink port).

Can you please explain why you think devlink port should be overloaded
instead of netdev or any other kernel object?
Do you have an example of such overloaded functionality of a kernel object?
Like why macvlan and vlan drivers are not combined to in single driver
object? Why teaming and bonding driver are combined in single driver
object?...

User should be able to create, configure, deploy, delete a 'portion of
the device' with/without eswitch.
We shouldn't be starting with restrictive/narrow view of devlink port.

Internally with Jiri and others, we also explored the possibility to
have 'mgmtvf', 'mgmtpf',  'mgmtsf' port flavours by overloading port to
do all things as that of slice.
It wasn't elegant enough. Why not create right object?

Additionally devlink port object doesn't go through the same state
machine as that what slice has to go through.
So its weird that some devlink port has state machine and some doesn't.

> With devices like NFP and Mellanox CX3 which have one PCI PF maybe it
> would have made sense to have a slice that covers multiple ports, but
> it seems the proposal is to have port to slice mapping be 1:1. And rate
> in those devices should still be per port not per slice.
> 
Slice can have multiple ports. slice object doesn't restrict it. User
can always split the port for a device, if device support it.

> But this keeps coming back, and since you guys are doing all the work,
> if you really really need it..
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-27 20:42                           ` Jakub Kicinski
@ 2020-03-30  9:07                             ` Parav Pandit
  2020-04-08  6:10                               ` Parav Pandit
  0 siblings, 1 reply; 50+ messages in thread
From: Parav Pandit @ 2020-03-30  9:07 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: sridhar.samudrala, Aya Levin, andrew.gospodarek, sburla, jiri,
	Tariq Toukan, davem, netdev, Vlad Buslov, lihong.yang,
	Ido Schimmel, jgg, fmanlunas, oss-drivers, leon,
	grygorii.strashko, michael.chan, Alex Vesker, snelson,
	linyunsheng, magnus.karlsson, dchickles, jacob.e.keller,
	Moshe Shemesh, Mark Zhang, aelior, Yuval Avnery, drivers, mlxsw,
	GR-everest-linux-l2, Yevgeny Kliteynik, vikas.gupta,
	Eran Ben Elisha

On 3/28/2020 2:12 AM, Jakub Kicinski wrote:
> On Fri, 27 Mar 2020 19:45:53 +0000 Saeed Mahameed wrote:

>> from what i understand, a real slice is a full isolated HW pipeline
>> with its own HW resources and HW based isolation, a slice rings/hw
>> resources can never be shared between different slices, just like a vf,
>> but without the pcie virtual function back-end..

>> We need a clear-cut definition of what a Sub-function slice is.. this
>> RFC doesn't seem to address that clearly.
> 
> Definitely. I'd say we need a clear definition of (a) what a
> sub-functions is, and (b) what a slice is.
> 

Below is the extended contents for Jiri's RFC that addresses Saeed point
on clear definition of slice, sub-function and their plumbing.
Few things already defined by Jiri's RFC such as diagrams, examples, but
putting it all together below for completeness.

This RFC extension covers overall requirements, design, alternatives
considered and rationale for current design choices to support
sub-functions using slices.

Overview:
---------
A user wants to use multiple netdev and rdma devices of a PCI device
without using PCI SR-IOV. This is done by creating slices of a
PCI device using devlink tool.

Requirements:
-------------
1. Able to make multiple slices of a device
2. Able to provision such slice to use in-kernel
3. Able to have persistent name of netdev and rdma device of a slice
4. Reuse current power management (suspend/resume) kernel infra for
   slice devices; not reinvent locking etc in each driver or devlink
5. Able to control, steer, offload network traffic of slices using
   existing devlink eswitch switchdev support
6. Suddenly not change existing pci pf/vf naming scheme by introducing
   slice
7. Support slice of a device in an iommu enabled kernel
8. Dynamically create/delete a slice
9. Ability to configure a slice before deploying it
10. Get/Set slice attributes like PCI VF slice attributes
11. Reuse/extend virtbus for carrying SF slice devices
12. Hot-plug a slice device in host system from the NIC
    (eswitch system) side without running any agent in the
    host system
13. Have unified interface for slice management regardless of
    deploying on eswitch system or on host system.
14. User must be able to create a portion of the device from eswitch
    system and attach to the untrusted host system. This host system
    is not accessible to eswitch system for purpose of device
    life-cycle and initial configuration.

Slice:
------
A slice represents a portion of the device. A slice is a generic
object that represents either a PCI VF or PCI PF or
PCI sub function (SF) described below.

Sub-function (SF):
------------------
- An sub-function is a portion of the PCI device which supports multiple
  class of devices such as netdev, rdma and more.
- An SF netdev has its own dedicated queues(txq, rxq).
- An SF rdma device has its own QP1, GID table and rdma resources.
  An SF rdma resources has its own resource namespace.
- An SF supports eswitch representation and full offload support
  when it is running with eswitch support.
- User must configure eswitch to send/receive packets for an SF.
- An SF shares PCI level resources with other SFs and/or with its
  parent PCI function.
  For example, an SF shares IRQ vectors with other SFs and its
  PCI function.
  In future it may have dedicated IRQ vector per SF.
  An SF has dedicated window in PCI BAR space that is not shared
  with other SFs or PF. This ensures that when a SF is assigned to
  an application, only that application can access device resources.

Overall design:
---------------
A new flavour of slice is created that represents a portion of the
device. It is equivalent to a PCI VF slice, but new slice exists
without PCI SR-IOV. This can scale to possibly more than number
of SR-IOV VFs. A new slice flavour 'pcisf' is introduced which is
explained later in this document.

devlink subsystem is extended to create, delete and deploy a slice
using a devlink instance.

Slice life cycle is done using devlink commands explained below.

(a) Lifecycle command consist of 3 main commands i.e. add, delete and
    state change.
(b) Add/delete commands create or delete the slice of a device
    respectively.
(c) A slice undergoes one or more configuration before it is
    activated. This may include,
     (a) slice hardware address configuration
     (b) representor netdevice configuration
     (c) Network policy, steering configuration through representor
(d) Once a slice is fully configured, user activates it.
    Slice activation triggers device enumeration and binding to the
    driver.
(e) Each slice goes through a state transition during its life cycle.
(f) Each slice's admin state is controlled by the user. Slice
    operational is updated based on driver attach/detach tasks on the
    host system.

    such as,
    admin state            description
    -----------            ------------
    1. inactive    Slice enter this state when it is newly created.
                   User typically does most of the configuration in
                   this state before activating the slice.

    2. active      State when slice is just activated by user.

    operational            description
       state
    -----------            ------------
    1. attached    State when slice device is bound to the host
                   driver. When the slice device is unbound from the
                   host driver, slice device exits this state and
                   enters detaching state.

    2. detaching   State when host is notified to deactivate the
                   slice device and slice device may be undergoing
                   detachment from host driver. When slice device is
                   fully detached from the host driver, slice exits
                   this state and enters detached state.

    3. detached    State when driver is fully unbound, it enters
                   into detached state.

slice state machine:
--------------------
                               slice state set inactive
                              ----<------------------<---
                              | or  slice delete        |
                              |                         |
  __________              ____|_______              ____|_______
 |          | slice add  |            |slice state |            |
 |          |-------->---|            |------>-----|            |
 | invalid  |            |  inactive  | set active |   active   |
 |          | slice del  |            |            |            |
 |__________|--<---------|____________|            |____________|

slice device operational state machine:
---------------------------------------
  __________                ____________              ___________
 |          | slice state  |            |driver bus  |           |
 | invalid  |-------->-----|  detached  |------>-----| attached  |
 |          | set active   |            | probe()    |           |
 |__________|              |____________|            |___________|
                                 |                        |
                                 ^                    slice set
                                 |                    set inactive
                            successful detach             |
                              or pf reset                 |
                             ____|_______                 |
                            |            | driver bus     |
                 -----------| detaching  |---<-------------
                 |          |            | remove()
                 ^          |____________|
                 |   timeout      |
                 --<---------------

More on the slice states:
(a) Primary reason to run slice life cycle this way is: a user must
    be able to create and configure a slice on eswitch system, before
    host system discovers the slice device. Hence, a slice
    device will not be accessible from the host system until it is
    explicitly activated by the user. Typically this is done after
    all necessary slice attributes are configured.
(b) A user wants to create and configure the slice on eswitch system
    when host system is power-down state.
(c) A slice interface for sub-functions should be uniform regardless
    of slice devices are deployed in eswitch system or host system.
(d) If a host system software where slice is deployed is compromised,
    which doesn't detach the slice, such slice remains in detaching
    state until PF driver is reset on the host. Such slice won't be
    usable until it is detached gracefully by host software.
    Forcefully changing its state and reusing it can lead to
    unexpected behavior and access of slice resources. Hence when
    opstate = detaching and state = inactive, such slice cannot be
    activated.

SF sysfs, virtbus and devlink plumbing:
---------------------------------------
(a) Given that a sub-function device may have PCI BAR resource(s), power
    management, IRQ configuration, persistence naming etc, a clear sysfs
    view like existing PCI device is desired.

    Therefore, an SF resides on a new bus called virtbus.
    This virtbus holds one or more SFs of a parent PCI device.

    Whenever user activates a slice, corresponding sub-function device
    is created on the virtbus and attached to the driver.

(b) Each SF has user defined unique number associated with it,
    called 'sfnum'. This sfnum is provided during SF slice creation
    time. Multiple uses of the sfnum is explained in detail below.
(c) An active SF slice has a unique 'struct device' anchored on
    virtbus. An SF is identified using unique name on the virtbus.
(d) An SF's device (sysfs) name is created using ida assigned by the
    virtbus core. such as,
    /sys/bus/virtbus/devices/mlx5_sf.100
    /sys/bus/virtbus/devices/mlx5_sf.2
(e) Scope of a sfnum is within the devlink instance who supports SF
    lifecycle and SF devlink port lifecycle.
    This sfnum is populated in /sys/bus/virtbus/devices/100/sfnum
(f) Persistent name of Netdevice and RDMA device of a virtbus SF
    device is prepared using parent device of SF and sfnum of an SF.
    Such as,
    /sys/bus/virtbus/devices/mlx5_sf.100 ->
../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/100
    Netdev name for such SF = enp6s0f0s1
    Where <en><parent_dev naming scheme> <s for sf> <sfnum_in_decimal>
    This generates unique netdev name because parent is involved in the
    device naming.
    RDMA device persistent name will be done similarly such as
    'rocep6s0f0s1'.
(g) SF devlink instance name is prepared using SF's parent bus/device
    and sfnum. Such as, pci/0000.06.00.0%sf1.
    This scheme ensures that SF's devlink device name remains
    predictable using sfnum regardless of dynamic virtbus device
    name/index.
(h) Each virtbus SF device driver id is defined by virtbus core.
    This driver id is used to bind virtbus SF device with the SF
    driver which has matching driver-id provided during driver
    registration time.
    This is further visible via modpost tools too.
(i) A new devlink eswitch port flavour named 'pcisf' is introduced for
    PCI SF devices similar to existing flavour 'pcivf' for PCI VFs.
(j) devlink eswitch port for PCI SF is added/deleted whenever SF slice
    is added/deleted.
(k) SF representor netdev phys_port_name=pf0sf1
    Format is: pf<pf_number>sf<user_assigned_sfnum>

Examples:
---------
- Add a slice of flavour SF with sf identifier 4 on eswitch system:

$ devlink slice add pci/0000.06.00.0 flavour pcisf pfnum 0 sfnum 4 index 100

- Show a newly added slice on eswitch system:
$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "state" : "inactive",
            "opstate" : "detached",
        }
    }
  }

- Show eswitch port configuration for SF on eswitch system:
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0 flavour physical port 0
pci/0000:06:00.0/8000: type eth netdev ens2f0sf1 flavour pcisf pfnum 0
sfnum 4
Here eswitch port index 8000 is assigned by the vendor driver.

- Activate SF slice to trigger creating a virtbus SF device in host system:
$ devlink slice set pci/0000.06.00.0/100 state active

- Query SF slice state on eswitch system:
$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "state": "active",
            "opstate": "detached",
        }
    }
  }

- Query SF slice state on eswitch system once driver is loaded on SF:
$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "state": "active",
            "opstate": "attached",
        }
    }
  }

- Show devlink instances on host system:
$ devlink dev show
pci/0000:06:00.0
pci/0000:06:00.0%sf4

- Show devlink ports on host system:
$ devlink port show
pci/0000:06:00.0/0: flavour physical type eth netdev netdev enp6s0f0
pci/0000:06:00.0%sf4/0: flavour virtual type eth netdev netdev enp6s0f0s10

- Mark a slice inactive, when slice may be in active state:
$ devlink slice set pci/0000.06.00.0/100 state inactive

- Query SF slice state on eswitch system once inactivation is triggered:
Output when it is in detaching state:

$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "state": "inactive",
            "opstate": "detaching",
        }
    }
  }

Output once detached:
$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "active": "inactive",
            "opstate": "detached",
        }
    }
  }

- Delete the slice which is recently inactivated.
$ devlink slice del pci/0000.06.00.0/100

FAQs:
----
1. What are the differences between PCI VF slice and PCI SF slice?
Ans:
(a) PCI SF slice lifecycle is driven by the user at individual slice
    level.
    PCI VF slice(s) are created and destroyed by the vendor driver
    such as mlx5_core. Typically, this is done when SR-IOV is
    enabled/disabled.
(b) PCI VF slices state cannot be changed by user as it is currently
    not support by PCI core and vendor devices. They are always
    created in active state.
    PCI SF slice state is controlled by the user.

2. What are the similarities between PCI VF and PCI SF slice?
(a) Both slices have similar config attributes.
(b) Both slices have eswitch devlink port and representor netdevice.

3. What about handling VF slice similar to SF slice?
Ans:
   Yes this is desired. When VF vendor device supports this mode,
   There will be at least one difference between SF and VF
   slice handling, i.e. SRIOV enablement will continue using sysfs on
   host system. Once SRIOV is enabled, VF slice commands should
   function in similar way as SF slice.

4. What are the similaries with SF slice and PF slice?
Ans:
(a) Both slices have similar config attributes.
(b) Both slices have eswitch devlink port and representor netdevice.

5. Can slice be used to have dynamic PF as slice?
Ans:
   Yes. Whenever a vendor device support dynamic PF, same lifecycle,
   APIs, attributes can be used for PF slice.

6. Can slice exist without a eswitch?
Ans: Yes

7. Why not overload devlink port instead of creating new object slice?
Ans:
   In one implementation a devlink port can be overloaded/retrofit
   what slice object wants to achieve.
   Same reasoning can be applied to overload and retrofit netdevice to
   achieve what a slice object wants to achieve.
   However, it is more natural to create a new object (slice) that
   represents a device for below rationale.
   (a) Even though a devlink slice has devlink port attached to it,
       it is narrow model to always have such association. It limits
       the devlink to be used only with eswitch.
   (b) slice undergoes state transitions from
       create->config->activate->inactivate->delete.
       It is weird to have few port flavours follow state transition
       and few don't.

8. Why a bus is needed?
Ans:
(a) To get unique, persistent and deterministic names of netdev,
    rdma dev of slice/sf.
(b) device lifecycle operates using similar pci bus and current
    driver model. No need to invent new lifecycle scheme.
(c) To follow uniform device config model for VF and SFs, where
    user must be able to configure the attributes before binding a
    VF/SF to driver.
(d) In future, if needed a virtbus SF device can be bound to some
    other driver similar to how a PCI PF/VF device can be bound to
    mlx5_core or vfio_pci device.
(e) Reuse kernel's existing power management framework to
    suspend/resume SF devices.
(f) When using SFs in smartnic based system where SF eswitch port and
    SF slice are located on two different systems, user desire to
    configure eswitch before activating the SF slice device.
Bus allows to follow existing driver model for above needs as,
create->configure_multiple_attributes->deploy.

9. Why not further extend (or abuse) mdev bus?
Ans: Because
(a) mdev and vfio are coupled together from iommu perspective
(b) mdev shouldn't be further abuse to use mdev in current state
    as suggested by Greg-KH, Jason Gunthrope.
(c) One bus for "sw_api" purpose, hence the new virtbus.
(d) Few don't like weird guid/uuid of mdev for sub function purpose
(e) If needed to map a SF to VM using mdev, new mdev slice flavour
    should be created for lifecycling via devlink or may be continue
    life cycle via mdev sysfs; but do remaining slice handling via
    devlink similar to how PCI VF slices are managed.

10. Is it ok to reuse virtbus used for matching service?
Ans: Yes.
Greg-KH guided in [1] that its ok for virtbus to hold devices with
different attributes. Such as matching services vs SF devices where
SF device will have
(a) PCI BAR resource info
(b) IRQ info
(c) sfnum

Greg-KH also guided in [1] that its ok to anchor netdev and rdma dev
of a SF device to SF virtbus device while rdma and netdev of matching
service virtbus device to anchored at the parent PCI device.

[1] https://www.spinics.net/lists/linux-rdma/msg87124.html

11. Why platform and mfd framework is not used for creating slices?
Ans:
(a) As platform documentation clearly describes, platform devices
    are for a specific platform. They are autonomous devices, unlike
    user created slices.
(b) slices are dynamic in nature where each individual slice is
    created/destroyed independently.
(c) MFD (multi-function) devices are for a device that comprise
    more than one non-unique yet varying hardware functionality.
    While, here each slice is of same type as that of parent device,
    with less capabilities than the parent device. Given that mfd
    devices are built on top of platform devices, it inherits similar
    limitations as that of platform device.
(d) In few other threads Greg-KH said to not (ab)use platform
    devices for such purpose.
(e) Given that for networking related slices, slice is linked to
    eswitch which is managed by devlink, having lifecycle through
    devlink, improves locking/synchronization to single subsystem.

12. A given slice supports multiple class of devices, such as RDMA,
   netdev, vdpa device. How can I disable/enable such attributes of
   a slice?
Ans:
   Slice attributes should be extended in vendor neutral and also
   vendor specific way to enable/disable such attributes depending
   on attribute type. This should be handled in future patches.

virtbus limitations:
--------------------
1. virtbus bus will not support IOMMU like how pci bus does.
   Hence, all protocol devices (such as netdev, rdmadev) must use its
   parent's DMA device.
   RDMA subsystem will use ib_device->dma_device with existing ib_dma_
   wrapper.
   netdev will use mlx5_core_dev->pci_dev for dma purposes.
   This is suggested/hinted by Christoph and Jason.

   Why is it this way?
   (a) Because currently only rdma and netdev intend to use it.
   (b) In future if more use case arise, virtbus device can share DMAR
       and group of same parent PCI device for in-kernel usecase.

2. Dedicated or shared irq vector(s) assignment per sub function and
   its exposure in sysfs will not be supported in initial version.
   It will be supported in future series.

3. PCI BAR resource information as resource files in sysfs will not be
   supported in initial version.
   It will be supported in future series.

Post initial series:
--------------------
Following plumbing will be done post initial series.

1. Max number of SF slices resource at devlink level
2. Max IRQ vectors resource config at devlink level
3. sf lifecycle for kernel vDPA support, still under discussion
4. systemd/udev user space patches for persistent naming for
   netdev and rdma

Out-of-scope:
-------------
1. Creating mdev/vhost backend/vdpa devices using devlink slice APIs.
   To support vhost backend 'struct device' creation, SF slice
   create/set API can be extended to enable/disable specific
   capabilities of the slice. Such as enable/disable kernel vDPA
   feature or enable/disable RDMA device for a SF slice.
2. Similar extension is applicable for a VF slice.
3. Multi host support is orthogonal to this and it can be extended
   in future.

Example software/system view:
-----------------------------
       _______
      | admin |
      | user  |----------
      |_______|         |
          |             |
      ____|____       __|______            _____________
     |         |     |         |          |             |
     | devlink |     |   ovs   |          |    user     |
     | tool    |     |_________|          | application |
     |_________|         |                |_____________|
           |             |                   |       |
-----------|-------------|-------------------|-------|-----------
           |             |           +----------+   +----------+
           |             |           |  netdev  |   | rdma dev |
           |             |           +----------+   +----------+
      (slice cmds,       |              ^             ^
       add/del/set)      |              |             |
           |             |              +-------------|
      _____|___          |              |         ____|________
     |         |         |              |        |             |
     | devlink |   +------------+       |        | mlx5_core/ib|
     | kernel  |   | rep netdev |       |        | drivers     |
     |_________|   +------------+       |        |_____________|
           |             |              |             ^
     (slice cmds)        |              |        (probe/remove)
      _____|____         |              |         ____|_____
     |          |        |    +--------------+   |          |
     | mlx5_core|---------    | virtbus dev  |---|  virtbus |
     | driver   |             +--------------+   |  driver  |
     |__________|                                |__________|
           |                                          ^
      (sf add/del, events)                            |
           |                                   (device add/del)
      _____|____                                  ____|_____
     |          |                                |          |
     |  PCI NIC |---- admin activate/deactive    | mlx5_core|
     |__________|           deactive events ---->| driver   |
                                                 |__________|



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-30  7:48                     ` Parav Pandit
@ 2020-03-30 19:36                       ` Jakub Kicinski
  2020-03-31  7:45                         ` Parav Pandit
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-30 19:36 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On Mon, 30 Mar 2020 07:48:39 +0000 Parav Pandit wrote:
> On 3/27/2020 10:08 PM, Jakub Kicinski wrote:
> > On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:  
> >>> So the queues, interrupts, and other resources are also part 
> >>> of the slice then?    
> >>
> >> Yep, that seems to make sense.
> >>  
> >>> How do slice parameters like rate apply to NVMe?    
> >>
> >> Not really.
> >>  
> >>> Are ports always ethernet? and slices also cover endpoints with
> >>> transport stack offloaded to the NIC?    
> >>
> >> devlink_port now can be either "ethernet" or "infiniband". Perhaps,
> >> there can be port type "nve" which would contain only some of the
> >> config options and would not have a representor "netdev/ibdev" linked.
> >> I don't know.  
> > 
> > I honestly find it hard to understand what that slice abstraction is,
> > and which things belong to slices and which to PCI ports (or why we even
> > have them).
> >   
> In an alternative, devlink port can be overloaded/retrofit to do all
> things that slice desires to do.

I wouldn't say retrofitted, in my mind port has always been a port of 
a device.

Jiri explained to me that to Mellanox port is port of a eswitch, not
port of a device. While to me (/Netronome) it was any way to send or
receive data to/from the device.

Now I understand why to you nvme doesn't fit the port abstraction.

> For that matter representor netdev can be overloaded/extended to do what
> slice desire to do (instead of devlink port).

Right, in my mental model representor _is_ a port of the eswitch, so
repr would not make sense to me.

> Can you please explain why you think devlink port should be overloaded
> instead of netdev or any other kernel object?
> Do you have an example of such overloaded functionality of a kernel object?
> Like why macvlan and vlan drivers are not combined to in single driver
> object? Why teaming and bonding driver are combined in single driver
> object?...

I think it's not overloading, but the fact that we started with
different definitions. We (me and you) tried adding the PCIe ports
around the same time, I guess we should have dug into the details
right away.

> User should be able to create, configure, deploy, delete a 'portion of
> the device' with/without eswitch.

Right, to me ports are of the device, not eswitch.

> We shouldn't be starting with restrictive/narrow view of devlink port.
> 
> Internally with Jiri and others, we also explored the possibility to
> have 'mgmtvf', 'mgmtpf',  'mgmtsf' port flavours by overloading port to
> do all things as that of slice.
> It wasn't elegant enough. Why not create right object?

We just need clear definitions of what goes where. We already have
params etc. hanging off the ports, including irq/sriov stuff. But in
slice model those don't belong there :S

In fact very little belongs to the port in that model. So why have
PCI ports in the first place?

> Additionally devlink port object doesn't go through the same state
> machine as that what slice has to go through.
> So its weird that some devlink port has state machine and some doesn't.

You mean for VFs? I think you can add the states to the API.

> > With devices like NFP and Mellanox CX3 which have one PCI PF maybe it
> > would have made sense to have a slice that covers multiple ports, but
> > it seems the proposal is to have port to slice mapping be 1:1. And rate
> > in those devices should still be per port not per slice.
> >   
> Slice can have multiple ports. slice object doesn't restrict it. User
> can always split the port for a device, if device support it.

Okay, so slices are not 1:1 with ports, then?  Is it any:any?

> > But this keeps coming back, and since you guys are doing all the work,
> > if you really really need it..

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-30 19:36                       ` Jakub Kicinski
@ 2020-03-31  7:45                         ` Parav Pandit
  2020-03-31 17:32                           ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Parav Pandit @ 2020-03-31  7:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On 3/31/2020 1:06 AM, Jakub Kicinski wrote:
> On Mon, 30 Mar 2020 07:48:39 +0000 Parav Pandit wrote:
>> On 3/27/2020 10:08 PM, Jakub Kicinski wrote:
>>> On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:  
>>>>> So the queues, interrupts, and other resources are also part 
>>>>> of the slice then?    
>>>>
>>>> Yep, that seems to make sense.
>>>>  
>>>>> How do slice parameters like rate apply to NVMe?    
>>>>
>>>> Not really.
>>>>  
>>>>> Are ports always ethernet? and slices also cover endpoints with
>>>>> transport stack offloaded to the NIC?    
>>>>
>>>> devlink_port now can be either "ethernet" or "infiniband". Perhaps,
>>>> there can be port type "nve" which would contain only some of the
>>>> config options and would not have a representor "netdev/ibdev" linked.
>>>> I don't know.  
>>>
>>> I honestly find it hard to understand what that slice abstraction is,
>>> and which things belong to slices and which to PCI ports (or why we even
>>> have them).
>>>   
>> In an alternative, devlink port can be overloaded/retrofit to do all
>> things that slice desires to do.
> 
> I wouldn't say retrofitted, in my mind port has always been a port of 
> a device.
> 
But here a networking device is getting created on host system that has
connection to an eswitch port.

> Jiri explained to me that to Mellanox port is port of a eswitch, not
> port of a device. While to me (/Netronome) it was any way to send or
> receive data to/from the device.
> 
> Now I understand why to you nvme doesn't fit the port abstraction.
> 
ok. Great.

>> For that matter representor netdev can be overloaded/extended to do what
>> slice desire to do (instead of devlink port).
> 
> Right, in my mental model representor _is_ a port of the eswitch, so
> repr would not make sense to me.
>
Right. So eswitch devlink port (pcipf, pcivf) flavours are also not the
right object to use as it represents eswitch side.

So either we create a new devlink port flavour which is facing the host
and run the state machine for those devlink ports or we create a more
refined object as slice and anchor things there.

>> Can you please explain why you think devlink port should be overloaded
>> instead of netdev or any other kernel object?
>> Do you have an example of such overloaded functionality of a kernel object?
>> Like why macvlan and vlan drivers are not combined to in single driver
>> object? Why teaming and bonding driver are combined in single driver
>> object?...
> 
> I think it's not overloading, but the fact that we started with
> different definitions. We (me and you) tried adding the PCIe ports
> around the same time, I guess we should have dug into the details
> right away.
>
Yes. :-)

>> User should be able to create, configure, deploy, delete a 'portion of
>> the device' with/without eswitch.
> 
> Right, to me ports are of the device, not eswitch.
> 
True. We are aligned here.

>> We shouldn't be starting with restrictive/narrow view of devlink port.
>>
>> Internally with Jiri and others, we also explored the possibility to
>> have 'mgmtvf', 'mgmtpf',  'mgmtsf' port flavours by overloading port to
>> do all things as that of slice.
>> It wasn't elegant enough. Why not create right object?
> 
> We just need clear definitions of what goes where. 
Yes.
The proposal is straight forward here.
that is,
(a) if a user wants to control/monitor params of the PF/VF/SF which is
facing the particular function (PF/VF/SF), such as mac, irq, num_qs,
state machine etc,
Those are anchored at the slice (portion of the device) level.
I detail how the whole plumbing in the extended RFC content in the
thread yday.

(b) if a user wants to control/monitor params which are towards the
eswitch level, they are either done through representor netdev or
devlink eswitch side port.
For example, eswitch pci vf's internal flow table should be exposed via
dpipe linked to eswitch devlink port.


> We already have
> params etc. hanging off the ports, including irq/sriov stuff. But in
> slice model those don't belong there :S
> 
I looked at the DaveM net-next tree today.
Only driver that uses devlink port params is bnxt. Even this driver
registers empty array of port parameters.
sriov/irq stuff currently hanging off at the devlink device level for
its own device.
Can you please provide link to code that uses devlink port params?

> In fact very little belongs to the port in that model. So why have
> PCI ports in the first place?
>
for few reasons.
1. PCI ports are establishing the relationship between eswitch port and
its representor netdevice.
Relying on plain netdev name doesn't work in certain pci topology where
netdev name exceeds 15 characters.
2. health reporters can be at port level.
3. In future at eswitch pci port, I will be adding dpipe support for the
internal flow tables done by the driver.
4. There were inconsistency among vendor drivers in using/abusing
phys_port_name of the eswitch ports. This is consolidated via devlink
port in core. This provides consistent view among all vendor drivers.

So PCI eswitch side ports are useful regardless of slice.

>> Additionally devlink port object doesn't go through the same state
>> machine as that what slice has to go through.
>> So its weird that some devlink port has state machine and some doesn't.
> 
> You mean for VFs? I think you can add the states to the API.
> 
As we agreed above that eswitch side objects (devlink port and
representor netdev) should not be used for 'portion of device',

we certainly need to create either
(a) new devlink ports and their host facing flavour(s) and run state
machine for it
or
(b) new devlink slice object that represents the 'portion of the device'.

We can add the state machine to the port. However it suffers from issue
that certain flavour as physical, dsa, eswitch ports etc doesn't have
notion of state machine, attachment to driver etc.

This is where I find it that we are overloading the port beyond its
current definition. And extensions doesn't seem to become applicable in
future on those ports.

A 'portion of the device' as individual object that optionally can be
linked to eswitch port made more sense. (like how a devlink port
optionally links to representor).

>>> With devices like NFP and Mellanox CX3 which have one PCI PF maybe it
>>> would have made sense to have a slice that covers multiple ports, but
>>> it seems the proposal is to have port to slice mapping be 1:1. And rate
>>> in those devices should still be per port not per slice.
>>>   
>> Slice can have multiple ports. slice object doesn't restrict it. User
>> can always split the port for a device, if device support it.
> 
> Okay, so slices are not 1:1 with ports, then?  Is it any:any?
> 
A slice can attach to one or more eswitch port, if slice wants to
support eswitch offloads etc.

A slice without eswitch, can have zero eswitch ports linked to it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-03-31  7:45                         ` Parav Pandit
@ 2020-03-31 17:32                           ` Jakub Kicinski
  2020-04-01  7:32                             ` Parav Pandit
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-31 17:32 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On Tue, 31 Mar 2020 07:45:51 +0000 Parav Pandit wrote:
> > In fact very little belongs to the port in that model. So why have
> > PCI ports in the first place?
> >  
> for few reasons.
> 1. PCI ports are establishing the relationship between eswitch port and
> its representor netdevice.
> Relying on plain netdev name doesn't work in certain pci topology where
> netdev name exceeds 15 characters.
> 2. health reporters can be at port level.

Why? The health reporters we have not AFAIK are for FW and for queues
hanging. Aren't queues on the slice and FW on the device?

> 3. In future at eswitch pci port, I will be adding dpipe support for the
> internal flow tables done by the driver.
> 4. There were inconsistency among vendor drivers in using/abusing
> phys_port_name of the eswitch ports. This is consolidated via devlink
> port in core. This provides consistent view among all vendor drivers.
> 
> So PCI eswitch side ports are useful regardless of slice.
> 
> >> Additionally devlink port object doesn't go through the same state
> >> machine as that what slice has to go through.
> >> So its weird that some devlink port has state machine and some doesn't.  
> > 
> > You mean for VFs? I think you can add the states to the API.
> >   
> As we agreed above that eswitch side objects (devlink port and
> representor netdev) should not be used for 'portion of device',

We haven't agreed, I just explained how we differ.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC] current devlink extension plan for NICs
  2020-03-31 17:32                           ` Jakub Kicinski
@ 2020-04-01  7:32                             ` Parav Pandit
  2020-04-01 20:12                               ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Parav Pandit @ 2020-04-01  7:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, March 31, 2020 11:03 PM
> 
> On Tue, 31 Mar 2020 07:45:51 +0000 Parav Pandit wrote:
> > > In fact very little belongs to the port in that model. So why have
> > > PCI ports in the first place?
> > >
> > for few reasons.
> > 1. PCI ports are establishing the relationship between eswitch port
> > and its representor netdevice.
> > Relying on plain netdev name doesn't work in certain pci topology
> > where netdev name exceeds 15 characters.
> > 2. health reporters can be at port level.
> 
> Why? The health reporters we have not AFAIK are for FW and for queues
> hanging. Aren't queues on the slice and FW on the device?
There are multiple heath reporters per object.
There are per q health reporters on the representor queues (and representors are attached to devlink port).
Can someone can have representor netdev for an eswitch port without devlink port?
No, net/core/devlink.c cross verify this and do WARN_ON.
So devlink port for eswitch are linked to representors and are needed.
Their existence is not a replacement for representing 'portion of the device'.

> 
> > 3. In future at eswitch pci port, I will be adding dpipe support for
> > the internal flow tables done by the driver.
> > 4. There were inconsistency among vendor drivers in using/abusing
> > phys_port_name of the eswitch ports. This is consolidated via devlink
> > port in core. This provides consistent view among all vendor drivers.
> >
> > So PCI eswitch side ports are useful regardless of slice.
> >
> > >> Additionally devlink port object doesn't go through the same state
> > >> machine as that what slice has to go through.
> > >> So its weird that some devlink port has state machine and some doesn't.
> > >
> > > You mean for VFs? I think you can add the states to the API.
> > >
> > As we agreed above that eswitch side objects (devlink port and
> > representor netdev) should not be used for 'portion of device',
> 
> We haven't agreed, I just explained how we differ.

You mentioned that " Right, in my mental model representor _is_ a port of the eswitch, so repr would not make sense to me."

With that I infer that 'any object that is directly and _always_ linked to eswitch and represents an eswitch port is out of question, this includes devlink port of eswitch and netdev representor.
Hence, the comment 'we agree conceptually' to not involve devlink port of eswitch and representor netdev to represent 'portion of the device'.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-04-01  7:32                             ` Parav Pandit
@ 2020-04-01 20:12                               ` Jakub Kicinski
  2020-04-02  6:16                                 ` Jiri Pirko
  2020-04-08  5:07                                 ` Parav Pandit
  0 siblings, 2 replies; 50+ messages in thread
From: Jakub Kicinski @ 2020-04-01 20:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On Wed, 1 Apr 2020 07:32:46 +0000 Parav Pandit wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > Sent: Tuesday, March 31, 2020 11:03 PM
> > 
> > On Tue, 31 Mar 2020 07:45:51 +0000 Parav Pandit wrote:  
> > > > In fact very little belongs to the port in that model. So why have
> > > > PCI ports in the first place?
> > > >  
> > > for few reasons.
> > > 1. PCI ports are establishing the relationship between eswitch port
> > > and its representor netdevice.
> > > Relying on plain netdev name doesn't work in certain pci topology
> > > where netdev name exceeds 15 characters.
> > > 2. health reporters can be at port level.  
> > 
> > Why? The health reporters we have not AFAIK are for FW and for queues
> > hanging. Aren't queues on the slice and FW on the device?  
> There are multiple heath reporters per object.
> There are per q health reporters on the representor queues (and
> representors are attached to devlink port). Can someone can have
> representor netdev for an eswitch port without devlink port? No,
> net/core/devlink.c cross verify this and do WARN_ON. So devlink port
> for eswitch are linked to representors and are needed. Their
> existence is not a replacement for representing 'portion of the
> device'.

I don't understand what you're trying to say. My question was why are
queues not on the "slice"? If PCIe resources are on the slice, then so
should be the health reporters.

> > > 3. In future at eswitch pci port, I will be adding dpipe support
> > > for the internal flow tables done by the driver.
> > > 4. There were inconsistency among vendor drivers in using/abusing
> > > phys_port_name of the eswitch ports. This is consolidated via
> > > devlink port in core. This provides consistent view among all
> > > vendor drivers.
> > >
> > > So PCI eswitch side ports are useful regardless of slice.
> > >  
> > > >> Additionally devlink port object doesn't go through the same
> > > >> state machine as that what slice has to go through.
> > > >> So its weird that some devlink port has state machine and some
> > > >> doesn't.  
> > > >
> > > > You mean for VFs? I think you can add the states to the API.
> > > >  
> > > As we agreed above that eswitch side objects (devlink port and
> > > representor netdev) should not be used for 'portion of device',  
> > 
> > We haven't agreed, I just explained how we differ.  
> 
> You mentioned that " Right, in my mental model representor _is_ a
> port of the eswitch, so repr would not make sense to me."
> 
> With that I infer that 'any object that is directly and _always_
> linked to eswitch and represents an eswitch port is out of question,
> this includes devlink port of eswitch and netdev representor. Hence,
> the comment 'we agree conceptually' to not involve devlink port of
> eswitch and representor netdev to represent 'portion of the device'.

I disagree, repr is one to one with eswitch port. Just because
repr is associated with a devlink port doesn't mean devlink port 
must be associated with a repr or a netdev. 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-04-01 20:12                               ` Jakub Kicinski
@ 2020-04-02  6:16                                 ` Jiri Pirko
  2020-04-08  5:10                                   ` Parav Pandit
  2020-04-08  5:07                                 ` Parav Pandit
  1 sibling, 1 reply; 50+ messages in thread
From: Jiri Pirko @ 2020-04-02  6:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Parav Pandit, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

Wed, Apr 01, 2020 at 10:12:31PM CEST, kuba@kernel.org wrote:
>On Wed, 1 Apr 2020 07:32:46 +0000 Parav Pandit wrote:
>> > From: Jakub Kicinski <kuba@kernel.org>
>> > Sent: Tuesday, March 31, 2020 11:03 PM
>> > 
>> > On Tue, 31 Mar 2020 07:45:51 +0000 Parav Pandit wrote:  
>> > > > In fact very little belongs to the port in that model. So why have
>> > > > PCI ports in the first place?
>> > > >  
>> > > for few reasons.
>> > > 1. PCI ports are establishing the relationship between eswitch port
>> > > and its representor netdevice.
>> > > Relying on plain netdev name doesn't work in certain pci topology
>> > > where netdev name exceeds 15 characters.
>> > > 2. health reporters can be at port level.  
>> > 
>> > Why? The health reporters we have not AFAIK are for FW and for queues
>> > hanging. Aren't queues on the slice and FW on the device?  
>> There are multiple heath reporters per object.
>> There are per q health reporters on the representor queues (and
>> representors are attached to devlink port). Can someone can have
>> representor netdev for an eswitch port without devlink port? No,
>> net/core/devlink.c cross verify this and do WARN_ON. So devlink port
>> for eswitch are linked to representors and are needed. Their
>> existence is not a replacement for representing 'portion of the
>> device'.
>
>I don't understand what you're trying to say. My question was why are
>queues not on the "slice"? If PCIe resources are on the slice, then so
>should be the health reporters.

I agree. These should be attached to the slice.


>
>> > > 3. In future at eswitch pci port, I will be adding dpipe support
>> > > for the internal flow tables done by the driver.
>> > > 4. There were inconsistency among vendor drivers in using/abusing
>> > > phys_port_name of the eswitch ports. This is consolidated via
>> > > devlink port in core. This provides consistent view among all
>> > > vendor drivers.
>> > >
>> > > So PCI eswitch side ports are useful regardless of slice.
>> > >  
>> > > >> Additionally devlink port object doesn't go through the same
>> > > >> state machine as that what slice has to go through.
>> > > >> So its weird that some devlink port has state machine and some
>> > > >> doesn't.  
>> > > >
>> > > > You mean for VFs? I think you can add the states to the API.
>> > > >  
>> > > As we agreed above that eswitch side objects (devlink port and
>> > > representor netdev) should not be used for 'portion of device',  
>> > 
>> > We haven't agreed, I just explained how we differ.  
>> 
>> You mentioned that " Right, in my mental model representor _is_ a
>> port of the eswitch, so repr would not make sense to me."
>> 
>> With that I infer that 'any object that is directly and _always_
>> linked to eswitch and represents an eswitch port is out of question,
>> this includes devlink port of eswitch and netdev representor. Hence,
>> the comment 'we agree conceptually' to not involve devlink port of
>> eswitch and representor netdev to represent 'portion of the device'.
>
>I disagree, repr is one to one with eswitch port. Just because
>repr is associated with a devlink port doesn't mean devlink port 
>must be associated with a repr or a netdev. 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC] current devlink extension plan for NICs
  2020-04-01 20:12                               ` Jakub Kicinski
  2020-04-02  6:16                                 ` Jiri Pirko
@ 2020-04-08  5:07                                 ` Parav Pandit
  2020-04-08 16:59                                   ` Jakub Kicinski
  1 sibling, 1 reply; 50+ messages in thread
From: Parav Pandit @ 2020-04-08  5:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson



> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Jakub Kicinski
> 
> On Wed, 1 Apr 2020 07:32:46 +0000 Parav Pandit wrote:
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Sent: Tuesday, March 31, 2020 11:03 PM
> > >
> > > On Tue, 31 Mar 2020 07:45:51 +0000 Parav Pandit wrote:
> > > > > In fact very little belongs to the port in that model. So why
> > > > > have PCI ports in the first place?
> > > > >
> > > > for few reasons.
> > > > 1. PCI ports are establishing the relationship between eswitch
> > > > port and its representor netdevice.
> > > > Relying on plain netdev name doesn't work in certain pci topology
> > > > where netdev name exceeds 15 characters.
> > > > 2. health reporters can be at port level.
> > >
> > > Why? The health reporters we have not AFAIK are for FW and for
> > > queues hanging. Aren't queues on the slice and FW on the device?
> > There are multiple heath reporters per object.
> > There are per q health reporters on the representor queues (and
> > representors are attached to devlink port). Can someone can have
> > representor netdev for an eswitch port without devlink port? No,
> > net/core/devlink.c cross verify this and do WARN_ON. So devlink port
> > for eswitch are linked to representors and are needed. Their existence
> > is not a replacement for representing 'portion of the device'.
> 
> I don't understand what you're trying to say. My question was why are
> queues not on the "slice"? If PCIe resources are on the slice, then so should
> be the health reporters.
> 

> > > > 3. In future at eswitch pci port, I will be adding dpipe support
> > > > for the internal flow tables done by the driver.
> > > > 4. There were inconsistency among vendor drivers in using/abusing
> > > > phys_port_name of the eswitch ports. This is consolidated via
> > > > devlink port in core. This provides consistent view among all
> > > > vendor drivers.
> > > >
> > > > So PCI eswitch side ports are useful regardless of slice.
> > > >
> > > > >> Additionally devlink port object doesn't go through the same
> > > > >> state machine as that what slice has to go through.
> > > > >> So its weird that some devlink port has state machine and some
> > > > >> doesn't.
> > > > >
> > > > > You mean for VFs? I think you can add the states to the API.
> > > > >
> > > > As we agreed above that eswitch side objects (devlink port and
> > > > representor netdev) should not be used for 'portion of device',
> > >
> > > We haven't agreed, I just explained how we differ.
> >
> > You mentioned that " Right, in my mental model representor _is_ a port
> > of the eswitch, so repr would not make sense to me."
> >
> > With that I infer that 'any object that is directly and _always_
> > linked to eswitch and represents an eswitch port is out of question,
> > this includes devlink port of eswitch and netdev representor. Hence,
> > the comment 'we agree conceptually' to not involve devlink port of
> > eswitch and representor netdev to represent 'portion of the device'.
> 
> I disagree, repr is one to one with eswitch port. Just because repr is
> associated with a devlink port doesn't mean devlink port must be associated
> with a repr or a netdev.
Devlink port which is on eswitch side is registered with switch_id and also linked to the rep netdev.
From this port phys_port_name is derived.
This eswitch port shouldn't represent 'portion of the device'.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC] current devlink extension plan for NICs
  2020-04-02  6:16                                 ` Jiri Pirko
@ 2020-04-08  5:10                                   ` Parav Pandit
  0 siblings, 0 replies; 50+ messages in thread
From: Parav Pandit @ 2020-04-08  5:10 UTC (permalink / raw)
  To: Jiri Pirko, Jakub Kicinski
  Cc: netdev, davem, Yuval Avnery, jgg, Saeed Mahameed, leon,
	andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson



> From: Jiri Pirko <jiri@resnulli.us>
> Sent: Thursday, April 2, 2020 11:47 AM
> 
> Wed, Apr 01, 2020 at 10:12:31PM CEST, kuba@kernel.org wrote:
> >On Wed, 1 Apr 2020 07:32:46 +0000 Parav Pandit wrote:
> >> > From: Jakub Kicinski <kuba@kernel.org>
> >> > Sent: Tuesday, March 31, 2020 11:03 PM
> >> >
> >> > On Tue, 31 Mar 2020 07:45:51 +0000 Parav Pandit wrote:
> >> > > > In fact very little belongs to the port in that model. So why
> >> > > > have PCI ports in the first place?
> >> > > >
> >> > > for few reasons.
> >> > > 1. PCI ports are establishing the relationship between eswitch
> >> > > port and its representor netdevice.
> >> > > Relying on plain netdev name doesn't work in certain pci topology
> >> > > where netdev name exceeds 15 characters.
> >> > > 2. health reporters can be at port level.
> >> >
> >> > Why? The health reporters we have not AFAIK are for FW and for
> >> > queues hanging. Aren't queues on the slice and FW on the device?
> >> There are multiple heath reporters per object.
> >> There are per q health reporters on the representor queues (and
> >> representors are attached to devlink port). Can someone can have
> >> representor netdev for an eswitch port without devlink port? No,
> >> net/core/devlink.c cross verify this and do WARN_ON. So devlink port
> >> for eswitch are linked to representors and are needed. Their
> >> existence is not a replacement for representing 'portion of the
> >> device'.
> >
> >I don't understand what you're trying to say. My question was why are
> >queues not on the "slice"? If PCIe resources are on the slice, then so
> >should be the health reporters.
> 
> I agree. These should be attached to the slice.
>
Representor netdev has txq and rxq.
Health reporters for this queue are attached to the txq and rxq.

Txq/rxq related health reporters can be linked to a slice, if that is what you meant.
 
> 
> >
> >> > > 3. In future at eswitch pci port, I will be adding dpipe support
> >> > > for the internal flow tables done by the driver.
> >> > > 4. There were inconsistency among vendor drivers in using/abusing
> >> > > phys_port_name of the eswitch ports. This is consolidated via
> >> > > devlink port in core. This provides consistent view among all
> >> > > vendor drivers.
> >> > >
> >> > > So PCI eswitch side ports are useful regardless of slice.
> >> > >
> >> > > >> Additionally devlink port object doesn't go through the same
> >> > > >> state machine as that what slice has to go through.
> >> > > >> So its weird that some devlink port has state machine and some
> >> > > >> doesn't.
> >> > > >
> >> > > > You mean for VFs? I think you can add the states to the API.
> >> > > >
> >> > > As we agreed above that eswitch side objects (devlink port and
> >> > > representor netdev) should not be used for 'portion of device',
> >> >
> >> > We haven't agreed, I just explained how we differ.
> >>
> >> You mentioned that " Right, in my mental model representor _is_ a
> >> port of the eswitch, so repr would not make sense to me."
> >>
> >> With that I infer that 'any object that is directly and _always_
> >> linked to eswitch and represents an eswitch port is out of question,
> >> this includes devlink port of eswitch and netdev representor. Hence,
> >> the comment 'we agree conceptually' to not involve devlink port of
> >> eswitch and representor netdev to represent 'portion of the device'.
> >
> >I disagree, repr is one to one with eswitch port. Just because repr is
> >associated with a devlink port doesn't mean devlink port must be
> >associated with a repr or a netdev.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC] current devlink extension plan for NICs
  2020-03-30  9:07                             ` Parav Pandit
@ 2020-04-08  6:10                               ` Parav Pandit
  0 siblings, 0 replies; 50+ messages in thread
From: Parav Pandit @ 2020-04-08  6:10 UTC (permalink / raw)
  To: Parav Pandit, Jakub Kicinski, Saeed Mahameed
  Cc: sridhar.samudrala, Aya Levin, andrew.gospodarek, sburla, jiri,
	Tariq Toukan, davem, netdev, Vlad Buslov, lihong.yang,
	Ido Schimmel, jgg, fmanlunas, oss-drivers, leon,
	grygorii.strashko, michael.chan, Alex Vesker, snelson,
	linyunsheng, magnus.karlsson, dchickles, jacob.e.keller,
	Moshe Shemesh, Mark Zhang, aelior, Yuval Avnery, drivers, mlxsw,
	GR-everest-linux-l2, Yevgeny Kliteynik, vikas.gupta,
	Eran Ben Elisha

Hi Saeed,

> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Parav Pandit
> On 3/28/2020 2:12 AM, Jakub Kicinski wrote:
> > On Fri, 27 Mar 2020 19:45:53 +0000 Saeed Mahameed wrote:
> 
> >> from what i understand, a real slice is a full isolated HW pipeline
> >> with its own HW resources and HW based isolation, a slice rings/hw
> >> resources can never be shared between different slices, just like a
> >> vf, but without the pcie virtual function back-end..
> 
> >> We need a clear-cut definition of what a Sub-function slice is.. this
> >> RFC doesn't seem to address that clearly.
> >

Did you get a chance to review below content?
Can you please confirm if below description describes, what you were looking for?



> > Definitely. I'd say we need a clear definition of (a) what a
> > sub-functions is, and (b) what a slice is.
> >
> 
> Below is the extended contents for Jiri's RFC that addresses Saeed point on
> clear definition of slice, sub-function and their plumbing.
> Few things already defined by Jiri's RFC such as diagrams, examples, but
> putting it all together below for completeness.
> 

> This RFC extension covers overall requirements, design, alternatives
> considered and rationale for current design choices to support sub-functions
> using slices.
> 
> Overview:
> ---------
> A user wants to use multiple netdev and rdma devices of a PCI device
> without using PCI SR-IOV. This is done by creating slices of a PCI device using
> devlink tool.
> 
> Requirements:
> -------------
> 1. Able to make multiple slices of a device 2. Able to provision such slice to
> use in-kernel 3. Able to have persistent name of netdev and rdma device of a
> slice 4. Reuse current power management (suspend/resume) kernel infra for
>    slice devices; not reinvent locking etc in each driver or devlink 5. Able to
> control, steer, offload network traffic of slices using
>    existing devlink eswitch switchdev support 6. Suddenly not change existing
> pci pf/vf naming scheme by introducing
>    slice
> 7. Support slice of a device in an iommu enabled kernel 8. Dynamically
> create/delete a slice 9. Ability to configure a slice before deploying it 10.
> Get/Set slice attributes like PCI VF slice attributes 11. Reuse/extend virtbus
> for carrying SF slice devices 12. Hot-plug a slice device in host system from
> the NIC
>     (eswitch system) side without running any agent in the
>     host system
> 13. Have unified interface for slice management regardless of
>     deploying on eswitch system or on host system.
> 14. User must be able to create a portion of the device from eswitch
>     system and attach to the untrusted host system. This host system
>     is not accessible to eswitch system for purpose of device
>     life-cycle and initial configuration.
> 
> Slice:
> ------
> A slice represents a portion of the device. A slice is a generic object that
> represents either a PCI VF or PCI PF or PCI sub function (SF) described below.
> 
> Sub-function (SF):
> ------------------
> - An sub-function is a portion of the PCI device which supports multiple
>   class of devices such as netdev, rdma and more.
> - An SF netdev has its own dedicated queues(txq, rxq).
> - An SF rdma device has its own QP1, GID table and rdma resources.
>   An SF rdma resources has its own resource namespace.
> - An SF supports eswitch representation and full offload support
>   when it is running with eswitch support.
> - User must configure eswitch to send/receive packets for an SF.
> - An SF shares PCI level resources with other SFs and/or with its
>   parent PCI function.
>   For example, an SF shares IRQ vectors with other SFs and its
>   PCI function.
>   In future it may have dedicated IRQ vector per SF.
>   An SF has dedicated window in PCI BAR space that is not shared
>   with other SFs or PF. This ensures that when a SF is assigned to
>   an application, only that application can access device resources.
> 
> Overall design:
> ---------------
> A new flavour of slice is created that represents a portion of the device. It is
> equivalent to a PCI VF slice, but new slice exists without PCI SR-IOV. This can
> scale to possibly more than number of SR-IOV VFs. A new slice flavour 'pcisf'
> is introduced which is explained later in this document.
> 
> devlink subsystem is extended to create, delete and deploy a slice using a
> devlink instance.
> 
> Slice life cycle is done using devlink commands explained below.
> 
> (a) Lifecycle command consist of 3 main commands i.e. add, delete and
>     state change.
> (b) Add/delete commands create or delete the slice of a device
>     respectively.
> (c) A slice undergoes one or more configuration before it is
>     activated. This may include,
>      (a) slice hardware address configuration
>      (b) representor netdevice configuration
>      (c) Network policy, steering configuration through representor
> (d) Once a slice is fully configured, user activates it.
>     Slice activation triggers device enumeration and binding to the
>     driver.
> (e) Each slice goes through a state transition during its life cycle.
> (f) Each slice's admin state is controlled by the user. Slice
>     operational is updated based on driver attach/detach tasks on the
>     host system.
> 
>     such as,
>     admin state            description
>     -----------            ------------
>     1. inactive    Slice enter this state when it is newly created.
>                    User typically does most of the configuration in
>                    this state before activating the slice.
> 
>     2. active      State when slice is just activated by user.
> 
>     operational            description
>        state
>     -----------            ------------
>     1. attached    State when slice device is bound to the host
>                    driver. When the slice device is unbound from the
>                    host driver, slice device exits this state and
>                    enters detaching state.
> 
>     2. detaching   State when host is notified to deactivate the
>                    slice device and slice device may be undergoing
>                    detachment from host driver. When slice device is
>                    fully detached from the host driver, slice exits
>                    this state and enters detached state.
> 
>     3. detached    State when driver is fully unbound, it enters
>                    into detached state.
> 
> slice state machine:
> --------------------
>                                slice state set inactive
>                               ----<------------------<---
>                               | or  slice delete        |
>                               |                         |
>   __________              ____|_______              ____|_______
>  |          | slice add  |            |slice state |            |
>  |          |-------->---|            |------>-----|            |
>  | invalid  |            |  inactive  | set active |   active   |
>  |          | slice del  |            |            |            |
>  |__________|--<---------|____________|            |____________|
> 
> slice device operational state machine:
> ---------------------------------------
>   __________                ____________              ___________
>  |          | slice state  |            |driver bus  |           |
>  | invalid  |-------->-----|  detached  |------>-----| attached  |
>  |          | set active   |            | probe()    |           |
>  |__________|              |____________|            |___________|
>                                  |                        |
>                                  ^                    slice set
>                                  |                    set inactive
>                             successful detach             |
>                               or pf reset                 |
>                              ____|_______                 |
>                             |            | driver bus     |
>                  -----------| detaching  |---<-------------
>                  |          |            | remove()
>                  ^          |____________|
>                  |   timeout      |
>                  --<---------------
> 
> More on the slice states:
> (a) Primary reason to run slice life cycle this way is: a user must
>     be able to create and configure a slice on eswitch system, before
>     host system discovers the slice device. Hence, a slice
>     device will not be accessible from the host system until it is
>     explicitly activated by the user. Typically this is done after
>     all necessary slice attributes are configured.
> (b) A user wants to create and configure the slice on eswitch system
>     when host system is power-down state.
> (c) A slice interface for sub-functions should be uniform regardless
>     of slice devices are deployed in eswitch system or host system.
> (d) If a host system software where slice is deployed is compromised,
>     which doesn't detach the slice, such slice remains in detaching
>     state until PF driver is reset on the host. Such slice won't be
>     usable until it is detached gracefully by host software.
>     Forcefully changing its state and reusing it can lead to
>     unexpected behavior and access of slice resources. Hence when
>     opstate = detaching and state = inactive, such slice cannot be
>     activated.
> 
> SF sysfs, virtbus and devlink plumbing:
> ---------------------------------------
> (a) Given that a sub-function device may have PCI BAR resource(s), power
>     management, IRQ configuration, persistence naming etc, a clear sysfs
>     view like existing PCI device is desired.
> 
>     Therefore, an SF resides on a new bus called virtbus.
>     This virtbus holds one or more SFs of a parent PCI device.
> 
>     Whenever user activates a slice, corresponding sub-function device
>     is created on the virtbus and attached to the driver.
> 
> (b) Each SF has user defined unique number associated with it,
>     called 'sfnum'. This sfnum is provided during SF slice creation
>     time. Multiple uses of the sfnum is explained in detail below.
> (c) An active SF slice has a unique 'struct device' anchored on
>     virtbus. An SF is identified using unique name on the virtbus.
> (d) An SF's device (sysfs) name is created using ida assigned by the
>     virtbus core. such as,
>     /sys/bus/virtbus/devices/mlx5_sf.100
>     /sys/bus/virtbus/devices/mlx5_sf.2
> (e) Scope of a sfnum is within the devlink instance who supports SF
>     lifecycle and SF devlink port lifecycle.
>     This sfnum is populated in /sys/bus/virtbus/devices/100/sfnum
> (f) Persistent name of Netdevice and RDMA device of a virtbus SF
>     device is prepared using parent device of SF and sfnum of an SF.
>     Such as,
>     /sys/bus/virtbus/devices/mlx5_sf.100 ->
> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/100
>     Netdev name for such SF = enp6s0f0s1
>     Where <en><parent_dev naming scheme> <s for sf> <sfnum_in_decimal>
>     This generates unique netdev name because parent is involved in the
>     device naming.
>     RDMA device persistent name will be done similarly such as
>     'rocep6s0f0s1'.
> (g) SF devlink instance name is prepared using SF's parent bus/device
>     and sfnum. Such as, pci/0000.06.00.0%sf1.
>     This scheme ensures that SF's devlink device name remains
>     predictable using sfnum regardless of dynamic virtbus device
>     name/index.
> (h) Each virtbus SF device driver id is defined by virtbus core.
>     This driver id is used to bind virtbus SF device with the SF
>     driver which has matching driver-id provided during driver
>     registration time.
>     This is further visible via modpost tools too.
> (i) A new devlink eswitch port flavour named 'pcisf' is introduced for
>     PCI SF devices similar to existing flavour 'pcivf' for PCI VFs.
> (j) devlink eswitch port for PCI SF is added/deleted whenever SF slice
>     is added/deleted.
> (k) SF representor netdev phys_port_name=pf0sf1
>     Format is: pf<pf_number>sf<user_assigned_sfnum>
> 
> Examples:
> ---------
> - Add a slice of flavour SF with sf identifier 4 on eswitch system:
> 
> $ devlink slice add pci/0000.06.00.0 flavour pcisf pfnum 0 sfnum 4 index 100
> 
> - Show a newly added slice on eswitch system:
> $ devlink slice show pci/0000:06:00.0/100 -jp {
>     "slice": {
>         "pci/0000:06:00.0/100": {
>             "flavour": "pcisf",
>             "pfnum": 0,
>             "sfnum": 4,
>             "state" : "inactive",
>             "opstate" : "detached",
>         }
>     }
>   }
> 
> - Show eswitch port configuration for SF on eswitch system:
> $ devlink port show
> pci/0000:06:00.0/65535: type eth netdev ens2f0 flavour physical port 0
> pci/0000:06:00.0/8000: type eth netdev ens2f0sf1 flavour pcisf pfnum 0
> sfnum 4 Here eswitch port index 8000 is assigned by the vendor driver.
> 
> - Activate SF slice to trigger creating a virtbus SF device in host system:
> $ devlink slice set pci/0000.06.00.0/100 state active
> 
> - Query SF slice state on eswitch system:
> $ devlink slice show pci/0000:06:00.0/100 -jp {
>     "slice": {
>         "pci/0000:06:00.0/100": {
>             "flavour": "pcisf",
>             "pfnum": 0,
>             "sfnum": 4,
>             "state": "active",
>             "opstate": "detached",
>         }
>     }
>   }
> 
> - Query SF slice state on eswitch system once driver is loaded on SF:
> $ devlink slice show pci/0000:06:00.0/100 -jp {
>     "slice": {
>         "pci/0000:06:00.0/100": {
>             "flavour": "pcisf",
>             "pfnum": 0,
>             "sfnum": 4,
>             "state": "active",
>             "opstate": "attached",
>         }
>     }
>   }
> 
> - Show devlink instances on host system:
> $ devlink dev show
> pci/0000:06:00.0
> pci/0000:06:00.0%sf4
> 
> - Show devlink ports on host system:
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical type eth netdev netdev enp6s0f0
> pci/0000:06:00.0%sf4/0: flavour virtual type eth netdev netdev enp6s0f0s10
> 
> - Mark a slice inactive, when slice may be in active state:
> $ devlink slice set pci/0000.06.00.0/100 state inactive
> 
> - Query SF slice state on eswitch system once inactivation is triggered:
> Output when it is in detaching state:
> 
> $ devlink slice show pci/0000:06:00.0/100 -jp {
>     "slice": {
>         "pci/0000:06:00.0/100": {
>             "flavour": "pcisf",
>             "pfnum": 0,
>             "sfnum": 4,
>             "state": "inactive",
>             "opstate": "detaching",
>         }
>     }
>   }
> 
> Output once detached:
> $ devlink slice show pci/0000:06:00.0/100 -jp {
>     "slice": {
>         "pci/0000:06:00.0/100": {
>             "flavour": "pcisf",
>             "pfnum": 0,
>             "sfnum": 4,
>             "active": "inactive",
>             "opstate": "detached",
>         }
>     }
>   }
> 
> - Delete the slice which is recently inactivated.
> $ devlink slice del pci/0000.06.00.0/100
> 
> FAQs:
> ----
> 1. What are the differences between PCI VF slice and PCI SF slice?
> Ans:
> (a) PCI SF slice lifecycle is driven by the user at individual slice
>     level.
>     PCI VF slice(s) are created and destroyed by the vendor driver
>     such as mlx5_core. Typically, this is done when SR-IOV is
>     enabled/disabled.
> (b) PCI VF slices state cannot be changed by user as it is currently
>     not support by PCI core and vendor devices. They are always
>     created in active state.
>     PCI SF slice state is controlled by the user.
> 
> 2. What are the similarities between PCI VF and PCI SF slice?
> (a) Both slices have similar config attributes.
> (b) Both slices have eswitch devlink port and representor netdevice.
> 
> 3. What about handling VF slice similar to SF slice?
> Ans:
>    Yes this is desired. When VF vendor device supports this mode,
>    There will be at least one difference between SF and VF
>    slice handling, i.e. SRIOV enablement will continue using sysfs on
>    host system. Once SRIOV is enabled, VF slice commands should
>    function in similar way as SF slice.
> 
> 4. What are the similaries with SF slice and PF slice?
> Ans:
> (a) Both slices have similar config attributes.
> (b) Both slices have eswitch devlink port and representor netdevice.
> 
> 5. Can slice be used to have dynamic PF as slice?
> Ans:
>    Yes. Whenever a vendor device support dynamic PF, same lifecycle,
>    APIs, attributes can be used for PF slice.
> 
> 6. Can slice exist without a eswitch?
> Ans: Yes
> 
> 7. Why not overload devlink port instead of creating new object slice?
> Ans:
>    In one implementation a devlink port can be overloaded/retrofit
>    what slice object wants to achieve.
>    Same reasoning can be applied to overload and retrofit netdevice to
>    achieve what a slice object wants to achieve.
>    However, it is more natural to create a new object (slice) that
>    represents a device for below rationale.
>    (a) Even though a devlink slice has devlink port attached to it,
>        it is narrow model to always have such association. It limits
>        the devlink to be used only with eswitch.
>    (b) slice undergoes state transitions from
>        create->config->activate->inactivate->delete.
>        It is weird to have few port flavours follow state transition
>        and few don't.
> 
> 8. Why a bus is needed?
> Ans:
> (a) To get unique, persistent and deterministic names of netdev,
>     rdma dev of slice/sf.
> (b) device lifecycle operates using similar pci bus and current
>     driver model. No need to invent new lifecycle scheme.
> (c) To follow uniform device config model for VF and SFs, where
>     user must be able to configure the attributes before binding a
>     VF/SF to driver.
> (d) In future, if needed a virtbus SF device can be bound to some
>     other driver similar to how a PCI PF/VF device can be bound to
>     mlx5_core or vfio_pci device.
> (e) Reuse kernel's existing power management framework to
>     suspend/resume SF devices.
> (f) When using SFs in smartnic based system where SF eswitch port and
>     SF slice are located on two different systems, user desire to
>     configure eswitch before activating the SF slice device.
> Bus allows to follow existing driver model for above needs as,
> create->configure_multiple_attributes->deploy.
> 
> 9. Why not further extend (or abuse) mdev bus?
> Ans: Because
> (a) mdev and vfio are coupled together from iommu perspective
> (b) mdev shouldn't be further abuse to use mdev in current state
>     as suggested by Greg-KH, Jason Gunthrope.
> (c) One bus for "sw_api" purpose, hence the new virtbus.
> (d) Few don't like weird guid/uuid of mdev for sub function purpose
> (e) If needed to map a SF to VM using mdev, new mdev slice flavour
>     should be created for lifecycling via devlink or may be continue
>     life cycle via mdev sysfs; but do remaining slice handling via
>     devlink similar to how PCI VF slices are managed.
> 
> 10. Is it ok to reuse virtbus used for matching service?
> Ans: Yes.
> Greg-KH guided in [1] that its ok for virtbus to hold devices with different
> attributes. Such as matching services vs SF devices where SF device will have
> (a) PCI BAR resource info
> (b) IRQ info
> (c) sfnum
> 
> Greg-KH also guided in [1] that its ok to anchor netdev and rdma dev of a SF
> device to SF virtbus device while rdma and netdev of matching service
> virtbus device to anchored at the parent PCI device.
> 
> [1] https://www.spinics.net/lists/linux-rdma/msg87124.html
> 
> 11. Why platform and mfd framework is not used for creating slices?
> Ans:
> (a) As platform documentation clearly describes, platform devices
>     are for a specific platform. They are autonomous devices, unlike
>     user created slices.
> (b) slices are dynamic in nature where each individual slice is
>     created/destroyed independently.
> (c) MFD (multi-function) devices are for a device that comprise
>     more than one non-unique yet varying hardware functionality.
>     While, here each slice is of same type as that of parent device,
>     with less capabilities than the parent device. Given that mfd
>     devices are built on top of platform devices, it inherits similar
>     limitations as that of platform device.
> (d) In few other threads Greg-KH said to not (ab)use platform
>     devices for such purpose.
> (e) Given that for networking related slices, slice is linked to
>     eswitch which is managed by devlink, having lifecycle through
>     devlink, improves locking/synchronization to single subsystem.
> 
> 12. A given slice supports multiple class of devices, such as RDMA,
>    netdev, vdpa device. How can I disable/enable such attributes of
>    a slice?
> Ans:
>    Slice attributes should be extended in vendor neutral and also
>    vendor specific way to enable/disable such attributes depending
>    on attribute type. This should be handled in future patches.
> 
> virtbus limitations:
> --------------------
> 1. virtbus bus will not support IOMMU like how pci bus does.
>    Hence, all protocol devices (such as netdev, rdmadev) must use its
>    parent's DMA device.
>    RDMA subsystem will use ib_device->dma_device with existing ib_dma_
>    wrapper.
>    netdev will use mlx5_core_dev->pci_dev for dma purposes.
>    This is suggested/hinted by Christoph and Jason.
> 
>    Why is it this way?
>    (a) Because currently only rdma and netdev intend to use it.
>    (b) In future if more use case arise, virtbus device can share DMAR
>        and group of same parent PCI device for in-kernel usecase.
> 
> 2. Dedicated or shared irq vector(s) assignment per sub function and
>    its exposure in sysfs will not be supported in initial version.
>    It will be supported in future series.
> 
> 3. PCI BAR resource information as resource files in sysfs will not be
>    supported in initial version.
>    It will be supported in future series.
> 
> Post initial series:
> --------------------
> Following plumbing will be done post initial series.
> 
> 1. Max number of SF slices resource at devlink level 2. Max IRQ vectors
> resource config at devlink level 3. sf lifecycle for kernel vDPA support, still
> under discussion 4. systemd/udev user space patches for persistent naming
> for
>    netdev and rdma
> 
> Out-of-scope:
> -------------
> 1. Creating mdev/vhost backend/vdpa devices using devlink slice APIs.
>    To support vhost backend 'struct device' creation, SF slice
>    create/set API can be extended to enable/disable specific
>    capabilities of the slice. Such as enable/disable kernel vDPA
>    feature or enable/disable RDMA device for a SF slice.
> 2. Similar extension is applicable for a VF slice.
> 3. Multi host support is orthogonal to this and it can be extended
>    in future.
> 
> Example software/system view:
> -----------------------------
>        _______
>       | admin |
>       | user  |----------
>       |_______|         |
>           |             |
>       ____|____       __|______            _____________
>      |         |     |         |          |             |
>      | devlink |     |   ovs   |          |    user     |
>      | tool    |     |_________|          | application |
>      |_________|         |                |_____________|
>            |             |                   |       |
> -----------|-------------|-------------------|-------|-----------
>            |             |           +----------+   +----------+
>            |             |           |  netdev  |   | rdma dev |
>            |             |           +----------+   +----------+
>       (slice cmds,       |              ^             ^
>        add/del/set)      |              |             |
>            |             |              +-------------|
>       _____|___          |              |         ____|________
>      |         |         |              |        |             |
>      | devlink |   +------------+       |        | mlx5_core/ib|
>      | kernel  |   | rep netdev |       |        | drivers     |
>      |_________|   +------------+       |        |_____________|
>            |             |              |             ^
>      (slice cmds)        |              |        (probe/remove)
>       _____|____         |              |         ____|_____
>      |          |        |    +--------------+   |          |
>      | mlx5_core|---------    | virtbus dev  |---|  virtbus |
>      | driver   |             +--------------+   |  driver  |
>      |__________|                                |__________|
>            |                                          ^
>       (sf add/del, events)                            |
>            |                                   (device add/del)
>       _____|____                                  ____|_____
>      |          |                                |          |
>      |  PCI NIC |---- admin activate/deactive    | mlx5_core|
>      |__________|           deactive events ---->| driver   |
>                                                  |__________|
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-04-08  5:07                                 ` Parav Pandit
@ 2020-04-08 16:59                                   ` Jakub Kicinski
  2020-04-08 18:13                                     ` Parav Pandit
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-04-08 16:59 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

On Wed, 8 Apr 2020 05:07:04 +0000 Parav Pandit wrote:
> > > > > 3. In future at eswitch pci port, I will be adding dpipe support
> > > > > for the internal flow tables done by the driver.
> > > > > 4. There were inconsistency among vendor drivers in using/abusing
> > > > > phys_port_name of the eswitch ports. This is consolidated via
> > > > > devlink port in core. This provides consistent view among all
> > > > > vendor drivers.
> > > > >
> > > > > So PCI eswitch side ports are useful regardless of slice.
> > > > >  
> > > > > >> Additionally devlink port object doesn't go through the same
> > > > > >> state machine as that what slice has to go through.
> > > > > >> So its weird that some devlink port has state machine and some
> > > > > >> doesn't.  
> > > > > >
> > > > > > You mean for VFs? I think you can add the states to the API.
> > > > > >  
> > > > > As we agreed above that eswitch side objects (devlink port and
> > > > > representor netdev) should not be used for 'portion of device',  
> > > >
> > > > We haven't agreed, I just explained how we differ.  
> > >
> > > You mentioned that " Right, in my mental model representor _is_ a port
> > > of the eswitch, so repr would not make sense to me."
> > >
> > > With that I infer that 'any object that is directly and _always_
> > > linked to eswitch and represents an eswitch port is out of question,
> > > this includes devlink port of eswitch and netdev representor. Hence,
> > > the comment 'we agree conceptually' to not involve devlink port of
> > > eswitch and representor netdev to represent 'portion of the device'.  
> > 
> > I disagree, repr is one to one with eswitch port. Just because repr is
> > associated with a devlink port doesn't mean devlink port must be associated
> > with a repr or a netdev.  
> Devlink port which is on eswitch side is registered with switch_id and also linked to the rep netdev.
> From this port phys_port_name is derived.
> This eswitch port shouldn't represent 'portion of the device'.

switch_id is per port, so it's perfectly fine for a devlink port not to
have one, or for two ports of the same device to have a different ID.

The phys_port_name argument I don't follow. How does that matter in the
"should we create another object" debate?

IMO introducing the slice if it's 1:1 with ports is a no-go. I also
don't like how creating a slice implicitly creates a devlink port in
your design. If those objects are so strongly linked that creating one
implies the other they should just be merged.

I'm also concerned that the slice is basically a non-networking port.
I bet some of the things we add there will one day be useful for
networking or DSA ports.

So I'd suggest to maybe step back from the SmartNIC scenario and try to
figure out how slices are useful on their own.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC] current devlink extension plan for NICs
  2020-04-08 16:59                                   ` Jakub Kicinski
@ 2020-04-08 18:13                                     ` Parav Pandit
  2020-04-09  2:07                                       ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Parav Pandit @ 2020-04-08 18:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta, magnus.karlsson

> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Jakub Kicinski
> 
> On Wed, 8 Apr 2020 05:07:04 +0000 Parav Pandit wrote:
> > > > > > 3. In future at eswitch pci port, I will be adding dpipe
> > > > > > support for the internal flow tables done by the driver.
> > > > > > 4. There were inconsistency among vendor drivers in
> > > > > > using/abusing phys_port_name of the eswitch ports. This is
> > > > > > consolidated via devlink port in core. This provides
> > > > > > consistent view among all vendor drivers.
> > > > > >
> > > > > > So PCI eswitch side ports are useful regardless of slice.
> > > > > >
> > > > > > >> Additionally devlink port object doesn't go through the
> > > > > > >> same state machine as that what slice has to go through.
> > > > > > >> So its weird that some devlink port has state machine and
> > > > > > >> some doesn't.
> > > > > > >
> > > > > > > You mean for VFs? I think you can add the states to the API.
> > > > > > >
> > > > > > As we agreed above that eswitch side objects (devlink port and
> > > > > > representor netdev) should not be used for 'portion of
> > > > > > device',
> > > > >
> > > > > We haven't agreed, I just explained how we differ.
> > > >
> > > > You mentioned that " Right, in my mental model representor _is_ a
> > > > port of the eswitch, so repr would not make sense to me."
> > > >
> > > > With that I infer that 'any object that is directly and _always_
> > > > linked to eswitch and represents an eswitch port is out of
> > > > question, this includes devlink port of eswitch and netdev
> > > > representor. Hence, the comment 'we agree conceptually' to not
> > > > involve devlink port of eswitch and representor netdev to represent
> 'portion of the device'.
> > >
> > > I disagree, repr is one to one with eswitch port. Just because repr
> > > is associated with a devlink port doesn't mean devlink port must be
> > > associated with a repr or a netdev.
> > Devlink port which is on eswitch side is registered with switch_id and also
> linked to the rep netdev.
> > From this port phys_port_name is derived.
> > This eswitch port shouldn't represent 'portion of the device'.
> 
> switch_id is per port, so it's perfectly fine for a devlink port not to have one, or
> for two ports of the same device to have a different ID.
> 
> The phys_port_name argument I don't follow. How does that matter in the
> "should we create another object" debate?
> 
Its very clear in net/core/devlink.c code that a devlink port with a switch_id belongs to switch side and linked to eswitch representor netdev.
It just cannot/should not be overloaded to drive host side attributes.

> IMO introducing the slice if it's 1:1 with ports is a no-go. 
I disagree.
With that argument devlink port for eswitch should not have existed and netdev should have been self-describing.
But it is not done that way for 3 reasons I described already in this thread.
Please get rid of devlink eswitch port and put all of it in representor netdev, after that 1:1 no-go point make sense. :-)

Also we already discussed that its not 1:1. A slice might not have devlink port.
We don't want to start with lowest denominator and narrow use case.

I also described you that slice runs through state machine which devlink port doesn't.
We don't want to overload devlink port object.

> I also don't like how
> creating a slice implicitly creates a devlink port in your design. If those objects
> are so strongly linked that creating one implies the other they should just be
> merged.
I disagree.
When netdev representor is created, its multiple health reporters (strongly linked) are created implicitly.
We didn't merge and user didn't explicitly created them for right reasons.

A slice as described represents 'portion of a device'. As described in RFC, it's the master object for which other associated sub-objects gets created.
Like an optional devlink port, representor, health reporters, resources.
Again, it is not 1:1.

As Jiri described and you acked that devlink slice need not have to have a devlink port.

There are enough examples in devlink subsystem today where 1:1 and non 1:1 objects can be related.
Shared buffers, devlink ports, health reporters, representors have such mapping with each other.
> 
> I'm also concerned that the slice is basically a non-networking port.
What is the concern?
How is shared-buffer, health reporter is attributed as networking object?

> I bet some of the things we add there will one day be useful for networking or
> DSA ports.
> 
I think this is mis-interpretation of a devlink slice object.
All things we intent to do in devlink slice is useful for networking and non-networking use.
So saying 'devlink slice is non networking port, hence it cannot be used for networking' -> is a wrong interpretation.

I do not understand DSA port much, but what blocks users to use slice if it fits the need in future.

How is shared buffer, health reporter are 'networking' object which exists under devlink, but not strictly under devlink port?

> So I'd suggest to maybe step back from the SmartNIC scenario and try to figure
> out how slices are useful on their own.
I already went through the requirements, scenario, examples and use model in the RFC extension that describes 
(a) how slice fits smartnic and non smartnic both cases.
(b) how user gets same experience and commands regardless of use cases.

A 'good' in-kernel example where one object is overloaded to do multiple things would support a thought to overload devlink port.
For example merge macvlan and vlan driver object to do both functionalities.
An overloaded recently introduced qdisc to multiple things as another.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] current devlink extension plan for NICs
  2020-04-08 18:13                                     ` Parav Pandit
@ 2020-04-09  2:07                                       ` Jakub Kicinski
  2020-04-09  6:43                                         ` Parav Pandit
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-04-09  2:07 UTC (permalink / raw)
  To: Parav Pandit, magnus.karlsson
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta

On Wed, 8 Apr 2020 18:13:50 +0000 Parav Pandit wrote:
> > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> > Behalf Of Jakub Kicinski
> > 
> > On Wed, 8 Apr 2020 05:07:04 +0000 Parav Pandit wrote:  
> > > > > > > 3. In future at eswitch pci port, I will be adding dpipe
> > > > > > > support for the internal flow tables done by the driver.
> > > > > > > 4. There were inconsistency among vendor drivers in
> > > > > > > using/abusing phys_port_name of the eswitch ports. This is
> > > > > > > consolidated via devlink port in core. This provides
> > > > > > > consistent view among all vendor drivers.
> > > > > > >
> > > > > > > So PCI eswitch side ports are useful regardless of slice.
> > > > > > >  
> > > > > > > >> Additionally devlink port object doesn't go through the
> > > > > > > >> same state machine as that what slice has to go through.
> > > > > > > >> So its weird that some devlink port has state machine and
> > > > > > > >> some doesn't.  
> > > > > > > >
> > > > > > > > You mean for VFs? I think you can add the states to the API.
> > > > > > > >  
> > > > > > > As we agreed above that eswitch side objects (devlink port and
> > > > > > > representor netdev) should not be used for 'portion of
> > > > > > > device',  
> > > > > >
> > > > > > We haven't agreed, I just explained how we differ.  
> > > > >
> > > > > You mentioned that " Right, in my mental model representor _is_ a
> > > > > port of the eswitch, so repr would not make sense to me."
> > > > >
> > > > > With that I infer that 'any object that is directly and _always_
> > > > > linked to eswitch and represents an eswitch port is out of
> > > > > question, this includes devlink port of eswitch and netdev
> > > > > representor. Hence, the comment 'we agree conceptually' to not
> > > > > involve devlink port of eswitch and representor netdev to represent  
> > 'portion of the device'.  
> > > >
> > > > I disagree, repr is one to one with eswitch port. Just because repr
> > > > is associated with a devlink port doesn't mean devlink port must be
> > > > associated with a repr or a netdev.  
> > > Devlink port which is on eswitch side is registered with switch_id and also  
> > linked to the rep netdev.  
> > > From this port phys_port_name is derived.
> > > This eswitch port shouldn't represent 'portion of the device'.  
> > 
> > switch_id is per port, so it's perfectly fine for a devlink port not to have one, or
> > for two ports of the same device to have a different ID.
> > 
> > The phys_port_name argument I don't follow. How does that matter in the
> > "should we create another object" debate?
> >   
> Its very clear in net/core/devlink.c code that a devlink port with a
> switch_id belongs to switch side and linked to eswitch representor
> netdev.
> 
> It just cannot/should not be overloaded to drive host side attributes.
>
> > IMO introducing the slice if it's 1:1 with ports is a no-go.   
> I disagree.
> With that argument devlink port for eswitch should not have existed and netdev should have been self-describing.
> But it is not done that way for 3 reasons I described already in this thread.
> Please get rid of devlink eswitch port and put all of it in representor netdev, after that 1:1 no-go point make sense. :-)
> 
> Also we already discussed that its not 1:1. A slice might not have devlink port.
> We don't want to start with lowest denominator and narrow use case.
> 
> I also described you that slice runs through state machine which devlink port doesn't.
> We don't want to overload devlink port object.
> 
> > I also don't like how
> > creating a slice implicitly creates a devlink port in your design. If those objects
> > are so strongly linked that creating one implies the other they should just be
> > merged.  
> I disagree.
> When netdev representor is created, its multiple health reporters (strongly linked) are created implicitly.
> We didn't merge and user didn't explicitly created them for right reasons.
> 
> A slice as described represents 'portion of a device'. As described in RFC, it's the master object for which other associated sub-objects gets created.
> Like an optional devlink port, representor, health reporters, resources.
> Again, it is not 1:1.
> 
> As Jiri described and you acked that devlink slice need not have to have a devlink port.
> 
> There are enough examples in devlink subsystem today where 1:1 and non 1:1 objects can be related.
> Shared buffers, devlink ports, health reporters, representors have such mapping with each other.

I'm not going to respond to any of that. We're going in circles.

I bet you remember the history of PCI ports, and my initial patch set.

We even had a call about this. Clearly all of it was a waste of time.

> > I'm also concerned that the slice is basically a non-networking port.  
> What is the concern?

What I wrote below, but you decided to split off in your reply for
whatever reason.

> How is shared-buffer, health reporter is attributed as networking object?

By non-networking I mean non-ethernet, or host facing. Which should be
clear from what I wrote below.

> > I bet some of the things we add there will one day be useful for networking or
> > DSA ports.
> >   
> I think this is mis-interpretation of a devlink slice object.
> All things we intent to do in devlink slice is useful for networking and non-networking use.
> So saying 'devlink slice is non networking port, hence it cannot be used for networking' -> is a wrong interpretation.
> 
> I do not understand DSA port much, but what blocks users to use slice if it fits the need in future.
> 
> How is shared buffer, health reporter are 'networking' object which exists under devlink, but not strictly under devlink port?

E.g. you ad rate limiting on the slice. That's something that may be
useful for other ingress points of the device. But it's added to the
slice, not the port. So we can't reuse the API for network ports.

> > So I'd suggest to maybe step back from the SmartNIC scenario and try to figure
> > out how slices are useful on their own.  
> I already went through the requirements, scenario, examples and use model in the RFC extension that describes 
> (a) how slice fits smartnic and non smartnic both cases.
> (b) how user gets same experience and commands regardless of use cases.
> 
> A 'good' in-kernel example where one object is overloaded to do multiple things would support a thought to overload devlink port.
> For example merge macvlan and vlan driver object to do both functionalities.
> An overloaded recently introduced qdisc to multiple things as another.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC] current devlink extension plan for NICs
  2020-04-09  2:07                                       ` Jakub Kicinski
@ 2020-04-09  6:43                                         ` Parav Pandit
  0 siblings, 0 replies; 50+ messages in thread
From: Parav Pandit @ 2020-04-09  6:43 UTC (permalink / raw)
  To: Jakub Kicinski, magnus.karlsson
  Cc: Jiri Pirko, netdev, davem, Yuval Avnery, jgg, Saeed Mahameed,
	leon, andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, jacob.e.keller, Alex Vesker,
	linyunsheng, lihong.yang, vikas.gupta



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, April 9, 2020 7:37 AM
> 
> On Wed, 8 Apr 2020 18:13:50 +0000 Parav Pandit wrote:
> > > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
> On
> > > Behalf Of Jakub Kicinski
> > >
> > > On Wed, 8 Apr 2020 05:07:04 +0000 Parav Pandit wrote:
> > > > > > > > 3. In future at eswitch pci port, I will be adding dpipe
> > > > > > > > support for the internal flow tables done by the driver.
> > > > > > > > 4. There were inconsistency among vendor drivers in
> > > > > > > > using/abusing phys_port_name of the eswitch ports. This is
> > > > > > > > consolidated via devlink port in core. This provides
> > > > > > > > consistent view among all vendor drivers.
> > > > > > > >
> > > > > > > > So PCI eswitch side ports are useful regardless of slice.
> > > > > > > >
> > > > > > > > >> Additionally devlink port object doesn't go through the
> > > > > > > > >> same state machine as that what slice has to go through.
> > > > > > > > >> So its weird that some devlink port has state machine
> > > > > > > > >> and some doesn't.
> > > > > > > > >
> > > > > > > > > You mean for VFs? I think you can add the states to the API.
> > > > > > > > >
> > > > > > > > As we agreed above that eswitch side objects (devlink port
> > > > > > > > and representor netdev) should not be used for 'portion of
> > > > > > > > device',
> > > > > > >
> > > > > > > We haven't agreed, I just explained how we differ.
> > > > > >
> > > > > > You mentioned that " Right, in my mental model representor
> > > > > > _is_ a port of the eswitch, so repr would not make sense to me."
> > > > > >
> > > > > > With that I infer that 'any object that is directly and
> > > > > > _always_ linked to eswitch and represents an eswitch port is
> > > > > > out of question, this includes devlink port of eswitch and
> > > > > > netdev representor. Hence, the comment 'we agree conceptually'
> > > > > > to not involve devlink port of eswitch and representor netdev
> > > > > > to represent
> > > 'portion of the device'.
> > > > >
> > > > > I disagree, repr is one to one with eswitch port. Just because
> > > > > repr is associated with a devlink port doesn't mean devlink port
> > > > > must be associated with a repr or a netdev.
> > > > Devlink port which is on eswitch side is registered with switch_id
> > > > and also
> > > linked to the rep netdev.
> > > > From this port phys_port_name is derived.
> > > > This eswitch port shouldn't represent 'portion of the device'.
> > >
> > > switch_id is per port, so it's perfectly fine for a devlink port not
> > > to have one, or for two ports of the same device to have a different ID.
> > >
> > > The phys_port_name argument I don't follow. How does that matter in
> > > the "should we create another object" debate?
> > >
> > Its very clear in net/core/devlink.c code that a devlink port with a
> > switch_id belongs to switch side and linked to eswitch representor
> > netdev.
> >
> > It just cannot/should not be overloaded to drive host side attributes.
> >
> > > IMO introducing the slice if it's 1:1 with ports is a no-go.
> > I disagree.
> > With that argument devlink port for eswitch should not have existed and
> netdev should have been self-describing.
> > But it is not done that way for 3 reasons I described already in this thread.
> > Please get rid of devlink eswitch port and put all of it in
> > representor netdev, after that 1:1 no-go point make sense. :-)
> >
> > Also we already discussed that its not 1:1. A slice might not have devlink
> port.
> > We don't want to start with lowest denominator and narrow use case.
> >
> > I also described you that slice runs through state machine which devlink
> port doesn't.
> > We don't want to overload devlink port object.
> >
> > > I also don't like how
> > > creating a slice implicitly creates a devlink port in your design.
> > > If those objects are so strongly linked that creating one implies
> > > the other they should just be merged.
> > I disagree.
> > When netdev representor is created, its multiple health reporters (strongly
> linked) are created implicitly.
> > We didn't merge and user didn't explicitly created them for right reasons.
> >
> > A slice as described represents 'portion of a device'. As described in RFC,
> it's the master object for which other associated sub-objects gets created.
> > Like an optional devlink port, representor, health reporters, resources.
> > Again, it is not 1:1.
> >
> > As Jiri described and you acked that devlink slice need not have to have a
> devlink port.
> >
> > There are enough examples in devlink subsystem today where 1:1 and non
> 1:1 objects can be related.
> > Shared buffers, devlink ports, health reporters, representors have such
> mapping with each other.
> 
> I'm not going to respond to any of that. We're going in circles.
>
I don't think we are going in circle.
Its clear to me that some devlink objects are 1:1 mapped with other objects, some are not.
And 1:1 mapping/no-mapping is not good enough reason to avoid crisp definition of a host facing 'portion of the device'.

So lets focus on technical aspects below.
 
> I bet you remember the history of PCI ports, and my initial patch set.
> 
> We even had a call about this. Clearly all of it was a waste of time.
> 
> > > I'm also concerned that the slice is basically a non-networking port.
> > What is the concern?
> 
> What I wrote below, but you decided to split off in your reply for whatever
> reason.
> 
> > How is shared-buffer, health reporter is attributed as networking object?
> 
> By non-networking I mean non-ethernet, or host facing. Which should be
> clear from what I wrote below.
Ok. I agree, host facing is good name.

> 
> > > I bet some of the things we add there will one day be useful for
> > > networking or DSA ports.
> > >
> > I think this is mis-interpretation of a devlink slice object.
> > All things we intent to do in devlink slice is useful for networking and non-
> networking use.
> > So saying 'devlink slice is non networking port, hence it cannot be used for
> networking' -> is a wrong interpretation.
> >
> > I do not understand DSA port much, but what blocks users to use slice if it
> fits the need in future.
> >
> > How is shared buffer, health reporter are 'networking' object which exists
> under devlink, but not strictly under devlink port?
> 
> E.g. you ad rate limiting on the slice. That's something that may be useful for
> other ingress points of the device. But it's added to the slice, not the port. So
> we can't reuse the API for network ports.
> 
To my knowledge, there is no user API existed in upstream kernel for device (VF/PF) ingress rate limiting.
So in future if user wants to do ingress rate limiting, it should better use rich eswitch and representor with tc.

Device egress rate limiting at slice level is mainly coming from users who are migrating from using 'ip link set vf rate'.
And grouping those VFs further as Jiri described.

May be at this junction, we migrate users to start using tc as,

tc filter add dev enp59s0f0_0 root protocol ip matchall action police rate 1mbit burst 20k

Is there already a good way to group set of related netdevs and issuing tc filter to their grouped netdev?
Or should we create one?
Is tc block sharing good for that purpose?

With that we have cleaner slice interface leaving legacy behind.

Jiri, Jakub,
What is your input on grouping eswitch ports and configuring their ingress rate limiting as individual port as group/block/team/something else?

> > > So I'd suggest to maybe step back from the SmartNIC scenario and try
> > > to figure out how slices are useful on their own.
> > I already went through the requirements, scenario, examples and use
> > model in the RFC extension that describes
> > (a) how slice fits smartnic and non smartnic both cases.
> > (b) how user gets same experience and commands regardless of use cases.
> >
> > A 'good' in-kernel example where one object is overloaded to do multiple
> things would support a thought to overload devlink port.
> > For example merge macvlan and vlan driver object to do both
> functionalities.
> > An overloaded recently introduced qdisc to multiple things as another.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2020-04-09  6:43 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-19 19:27 [RFC] current devlink extension plan for NICs Jiri Pirko
2020-03-20  3:32 ` Jakub Kicinski
2020-03-20  7:35   ` Jiri Pirko
2020-03-20 21:25     ` Jakub Kicinski
2020-03-21  9:07       ` Parav Pandit
2020-03-23 19:31         ` Jakub Kicinski
2020-03-23 22:50           ` Jason Gunthorpe
2020-03-24  3:41             ` Jakub Kicinski
2020-03-24 13:43               ` Jason Gunthorpe
2020-03-24  5:36           ` Parav Pandit
2020-03-21  9:35       ` Jiri Pirko
2020-03-23 19:21         ` Jakub Kicinski
2020-03-23 22:06           ` Jason Gunthorpe
2020-03-24  3:56             ` Jakub Kicinski
2020-03-24 13:20               ` Jason Gunthorpe
2020-03-26 14:37           ` Jiri Pirko
2020-03-26 14:43           ` Jiri Pirko
2020-03-26 14:47           ` Jiri Pirko
2020-03-26 14:51             ` Jiri Pirko
2020-03-26 20:30               ` Jakub Kicinski
2020-03-27  7:47                 ` Jiri Pirko
2020-03-27 16:38                   ` Jakub Kicinski
2020-03-27 18:49                     ` Samudrala, Sridhar
2020-03-27 19:10                       ` Jakub Kicinski
2020-03-27 19:45                         ` Saeed Mahameed
2020-03-27 20:42                           ` Jakub Kicinski
2020-03-30  9:07                             ` Parav Pandit
2020-04-08  6:10                               ` Parav Pandit
2020-03-27 20:47                           ` Samudrala, Sridhar
2020-03-27 20:59                             ` Jakub Kicinski
2020-03-30  7:09                           ` Parav Pandit
2020-03-30  7:48                     ` Parav Pandit
2020-03-30 19:36                       ` Jakub Kicinski
2020-03-31  7:45                         ` Parav Pandit
2020-03-31 17:32                           ` Jakub Kicinski
2020-04-01  7:32                             ` Parav Pandit
2020-04-01 20:12                               ` Jakub Kicinski
2020-04-02  6:16                                 ` Jiri Pirko
2020-04-08  5:10                                   ` Parav Pandit
2020-04-08  5:07                                 ` Parav Pandit
2020-04-08 16:59                                   ` Jakub Kicinski
2020-04-08 18:13                                     ` Parav Pandit
2020-04-09  2:07                                       ` Jakub Kicinski
2020-04-09  6:43                                         ` Parav Pandit
2020-03-30  5:30                   ` Parav Pandit
2020-03-26 14:59           ` Jiri Pirko
2020-03-23 23:32         ` Andy Gospodarek
2020-03-24  0:11           ` Jason Gunthorpe
2020-03-24  5:53           ` Parav Pandit
2020-03-23 21:32       ` Andy Gospodarek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).