netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
@ 2022-08-05 16:58 ecree
  2022-08-05 19:15 ` Randy Dunlap
  2022-08-06  1:43 ` Jakub Kicinski
  0 siblings, 2 replies; 12+ messages in thread
From: ecree @ 2022-08-05 16:58 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, pabeni, edumazet, corbet, linux-doc, Edward Cree,
	linux-net-drivers

From: Edward Cree <ecree.xilinx@gmail.com>

There's no clear explanation of what VF Representors are for, their
 semantics, etc., outside of vendor docs and random conference slides.
Add a document explaining Representors and defining what drivers that
 implement them are expected to do.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
---
This documents representors as I understand them, but I suspect others
 (including other vendors) might disagree (particularly with the "what
 functions should have a rep" section).  I'm hoping that through review
 of this doc we can converge on a consensus.

 Documentation/networking/index.rst        |   1 +
 Documentation/networking/representors.rst | 219 ++++++++++++++++++++++
 Documentation/networking/switchdev.rst    |   1 +
 3 files changed, 221 insertions(+)
 create mode 100644 Documentation/networking/representors.rst

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 03b215bddde8..c37ea2b54c29 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -93,6 +93,7 @@ Contents:
    radiotap-headers
    rds
    regulatory
+   representors
    rxrpc
    sctp
    secid
diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst
new file mode 100644
index 000000000000..4d28731a5b5b
--- /dev/null
+++ b/Documentation/networking/representors.rst
@@ -0,0 +1,219 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Network Function Representors
+=============================
+
+This document describes the semantics and usage of representor netdevices, as
+used to control internal switching on SmartNICs.  For the closely-related port
+representors on physical (multi-port) switches, see
+:ref:`Documentation/networking/switchdev.rst <switchdev>`.
+
+Motivation
+----------
+
+Since the mid-2010s, network cards have started offering more complex
+virtualisation capabilities than the legacy SR-IOV approach (with its simple
+MAC/VLAN-based switching model) can support.  This led to a desire to offload
+software-defined networks (such as OpenVSwitch) to these NICs to specify the
+network connectivity of each function.  The resulting designs are variously
+called SmartNICs or DPUs.
+
+Network function representors provide the mechanism by which network functions
+on an internal switch are managed.  They are used both to configure the
+corresponding function ('representee') and to handle slow-path traffic to and
+from the representee for which no fast-path switching rule is matched.
+
+That is, a representor is both a control plane object (representing the function
+in administrative commands) and a data plane object (one end of a virtual pipe).
+As a virtual link endpoint, the representor can be configured like any other
+netdevice; in some cases (e.g. link state) the representee will follow the
+representor's configuration, while in others there are separate APIs to
+configure the representee.
+
+What does a representor do?
+---------------------------
+
+A representor has three main rôles.
+
+1. It is used to configure the representee's virtual MAC, e.g. link up/down,
+   MTU, etc.  For instance, bringing the representor administratively UP should
+   cause the representee to see a link up / carrier on event.
+2. It provides the slow path for traffic which does not hit any offloaded
+   fast-path rules in the virtual switch.  Packets transmitted on the
+   representor netdevice should be delivered to the representee; packets
+   transmitted to the representee which fail to match any switching rule should
+   be received on the representor netdevice.  (That is, there is a virtual pipe
+   connecting the representor to the representee, similar in concept to a veth
+   pair.)
+   This allows software switch implementations (such as OpenVSwitch or a Linux
+   bridge) to forward packets between representees and the rest of the network.
+3. It acts as a handle by which switching rules (such as TC filters) can refer
+   to the representee, allowing these rules to be offloaded.
+
+The combination of 2) and 3) means that the behaviour (apart from performance)
+should be the same whether a TC filter is offloaded or not.  E.g. a TC rule
+on a VF representor applies in software to packets received on that representor
+netdevice, while in hardware offload it would apply to packets transmitted by
+the representee VF.  Conversely, a mirred egress redirect to a VF representor
+corresponds in hardware to delivery directly to the representee VF.
+
+What functions should have a representor?
+-----------------------------------------
+
+Essentially, for each virtual port on the device's internal switch, there
+should be a representor.
+The only exceptions are the management PF (whose port is used for traffic to
+and from all other representors) and perhaps the physical network port (for
+which the management PF may act as a kind of port representor.  Devices that
+combine multiple physical ports and SR-IOV capability may need to have port
+representors in addition to PF/VF representors).
+
+Thus, the following should all have representors:
+
+ - VFs belonging to the management PF.
+ - Other PFs on the PCIe controller, and any VFs belonging to them.
+ - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
+   System-on-Chip within the SmartNIC).
+ - PFs and VFs with other personalities, including network block devices (such
+   as a vDPA virtio-blk PF backed by remote/distributed storage).
+ - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
+   their own port on the switch (as opposed to using their parent PF's port).
+ - Any accelerators or plugins on the device whose interface to the network is
+   through a virtual switch port, even if they do not have a corresponding PCIe
+   PF or VF.
+
+This allows the entire switching behaviour of the NIC to be controlled through
+representor TC rules.
+
+An example of a PCIe function that should *not* have a representor is, on an
+FPGA-based NIC, a PF which is only used to deploy a new bitstream to the FPGA,
+and which cannot create RX and TX queues.  Since such a PF does not have network
+access through the internal switch, not even indirectly via a distributed
+storage endpoint, there is no switch virtual port for the representor to
+configure or to be the other end of the virtual pipe.
+
+How are representors created?
+-----------------------------
+
+The driver instance attached to the management PF should enumerate the virtual
+ports on the switch, and for each representee, create a pure-software netdevice
+which has some form of in-kernel reference to the PF's own netdevice or driver
+private data (``netdev_priv()``).
+If switch ports can dynamically appear/disappear, the PF driver should create
+and destroy representors appropriately.
+The operations of the representor netdevice will generally involve acting
+through the management PF.  For example, ``ndo_start_xmit()`` might send the
+packet, specially marked for delivery to the representee, through a TX queue
+attached to the management PF.
+
+How are representors identified?
+--------------------------------
+
+The representor netdevice should *not* directly refer to a PCIe device (e.g.
+through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
+representee or of the management PF.
+Instead, it should implement the ``ndo_get_port_parent_id()`` and
+``ndo_get_phys_port_name()`` netdevice ops (corresponding to the
+``phys_switch_id`` and ``phys_port_name`` sysfs nodes).
+``ndo_get_port_parent_id()`` should return a string identical to that returned
+by the management PF's ``ndo_get_phys_port_id()`` (typically the MAC address of
+the physical port), while ``ndo_get_phys_port_name()`` should return a string
+describing the representee's relation to the management PF.
+
+For instance, if the management PF has a ``phys_port_name`` of ``p0`` (physical
+port 0), then the representor for the third VF on the second PF should typically
+be ``p0pf1vf2`` (i.e. "port 0, PF 1, VF 2").  More generally, the
+``phys_port_name`` for a PCIe function should be the concatenation of one or
+more of:
+
+ - ``p<N>``, physical port number *N*.
+ - ``if<N>``, PCIe controller number *N*.  The semantics of these numbers are
+   vendor-defined, and controller 0 need not correspond to the controller on
+   which the management PF resides.
+ - ``pf<N>``, PCIe physical function index *N*.
+ - ``vf<N>``, PCIe virtual function index *N*.
+ - ``sf<N>``, Subfunction index *N*.
+
+It is expected that userland will use this information (e.g. through udev rules)
+to construct an appropriately informative name or alias for the netdevice.  For
+instance if the management PF is ``eth4`` then our representor with a
+``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
+
+There are as yet no established conventions for naming representors which do not
+correspond to PCIe functions (e.g. accelerators and plugins).
+
+How do representors interact with TC rules?
+-------------------------------------------
+
+Any TC rule on a representor applies (in software TC) to packets received by
+that representor netdevice.  Thus, if the delivery part of the rule corresponds
+to another port on the virtual switch, the driver may choose to offload it to
+hardware, applying it to packets transmitted by the representee.
+
+Similarly, since a TC mirred egress action targeting the representor would (in
+software) send the packet through the representor (and thus indirectly deliver
+it to the representee), hardware offload should interpret this as delivery to
+the representee.
+
+As a simple example, if ``eth0`` is the management PF's netdevice and ``eth1``
+is a VF representor, the following rules::
+
+    tc filter add dev eth1 parent ffff: protocol ipv4 flower \
+        action mirred egress redirect dev eth0
+    tc filter add dev eth0 parent ffff: protocol ipv4 flower \
+        action mirred egress mirror dev eth1
+
+would mean that all IPv4 packets from the VF are sent out the physical port, and
+all IPv4 packets received on the physical port are delivered to the VF in
+addition to the management PF.
+
+Of course the rules can (if supported by the NIC) include packet-modifying
+actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
+
+Tunnel encapsulation and decapsulation are rather more complicated, as they
+involve a third netdevice (a tunnel netdev operating in metadata mode, such as
+a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
+require an IP address to be bound to the underlay device (e.g. management PF or
+port representor).  TC rules such as::
+
+    tc filter add dev eth1 parent ffff: flower \
+        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
+                              dst_port 4789 \
+        action mirred egress redirect dev vxlan0
+    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
+        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
+        action tunnel_key unset action mirred egress redirect dev eth1
+
+where ``LOCAL_IP`` is an IP address bound to ``eth0``, and ``REMOTE_IP`` is
+another IP address on the same subnet, mean that packets sent by the VF should
+be VxLAN encapsulated and sent out the physical port (the driver has to deduce
+this by a route lookup of ``LOCAL_IP`` leading to ``eth0``, and also perform an
+ARP/neighbour table lookup to find the MAC addresses to use in the outer
+Ethernet frame), while UDP packets received on the physical port with UDP port
+4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, decapsulated
+and forwarded to the VF.
+
+If this all seems complicated, just remember the 'golden rule' of TC offload:
+the hardware should ensure the same final results as if the packets were
+processed through the slow path, traversed software TC and were transmitted or
+received through the representor netdevices.
+
+Configuring the representee's MAC
+---------------------------------
+
+The representee's link state is controlled through the representor.  Setting the
+representor administratively UP or DOWN should cause carrier ON or OFF at the
+representee.
+
+Setting an MTU on the representor should cause that same MTU to be reported to
+the representee.
+(On hardware that allows configuring separate and distinct MTU and MRU values,
+the representor MTU should correspond to the representee's MRU and vice-versa.)
+
+Currently there is no way to use the representor to set the station permanent
+MAC address of the representee; other methods available to do this include:
+
+ - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
+ - devlink port function (see **devlink-port(8)** and
+   :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
index f1f4e6a85a29..21e80c8e661b 100644
--- a/Documentation/networking/switchdev.rst
+++ b/Documentation/networking/switchdev.rst
@@ -1,5 +1,6 @@
 .. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
+.. _switchdev:
 
 ===============================================
 Ethernet switch device driver model (switchdev)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-05 16:58 [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors ecree
@ 2022-08-05 19:15 ` Randy Dunlap
  2022-08-08 20:48   ` Edward Cree
  2022-08-06  1:43 ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: Randy Dunlap @ 2022-08-05 19:15 UTC (permalink / raw)
  To: ecree, netdev
  Cc: davem, kuba, pabeni, edumazet, corbet, linux-doc, Edward Cree,
	linux-net-drivers

Hi--

On 8/5/22 09:58, ecree@xilinx.com wrote:
> From: Edward Cree <ecree.xilinx@gmail.com>
> 
> There's no clear explanation of what VF Representors are for, their
>  semantics, etc., outside of vendor docs and random conference slides.
> Add a document explaining Representors and defining what drivers that
>  implement them are expected to do.
> 
> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
> ---
> diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst
> new file mode 100644
> index 000000000000..4d28731a5b5b
> --- /dev/null
> +++ b/Documentation/networking/representors.rst
> @@ -0,0 +1,219 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============================
> +Network Function Representors
> +=============================
> +
> +This document describes the semantics and usage of representor netdevices, as
> +used to control internal switching on SmartNICs.  For the closely-related port
> +representors on physical (multi-port) switches, see
> +:ref:`Documentation/networking/switchdev.rst <switchdev>`.
> +
> +Motivation
> +----------
> +
> +Since the mid-2010s, network cards have started offering more complex
> +virtualisation capabilities than the legacy SR-IOV approach (with its simple
> +MAC/VLAN-based switching model) can support.  This led to a desire to offload
> +software-defined networks (such as OpenVSwitch) to these NICs to specify the
> +network connectivity of each function.  The resulting designs are variously
> +called SmartNICs or DPUs.
> +
> +Network function representors provide the mechanism by which network functions
> +on an internal switch are managed.  They are used both to configure the
> +corresponding function ('representee') and to handle slow-path traffic to and
> +from the representee for which no fast-path switching rule is matched.
> +
> +That is, a representor is both a control plane object (representing the function
> +in administrative commands) and a data plane object (one end of a virtual pipe).
> +As a virtual link endpoint, the representor can be configured like any other
> +netdevice; in some cases (e.g. link state) the representee will follow the
> +representor's configuration, while in others there are separate APIs to
> +configure the representee.
> +
> +What does a representor do?
> +---------------------------
> +
> +A representor has three main rôles.

Just use "roles". dict.org and m-w.com are happy with that.
m-w.com says for "role":
  variants: or less commonly rôle

thanks.
-- 
~Randy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-05 16:58 [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors ecree
  2022-08-05 19:15 ` Randy Dunlap
@ 2022-08-06  1:43 ` Jakub Kicinski
  2022-08-08 16:50   ` Keller, Jacob E
  2022-08-08 20:44   ` Edward Cree
  1 sibling, 2 replies; 12+ messages in thread
From: Jakub Kicinski @ 2022-08-06  1:43 UTC (permalink / raw)
  To: ecree
  Cc: netdev, davem, pabeni, edumazet, corbet, linux-doc, Edward Cree,
	linux-net-drivers, Jacob Keller, Jesse Brandeburg, Michael Chan,
	Andy Gospodarek, Saeed Mahameed, Jiri Pirko, Shannon Nelson,
	Simon Horman, Alexander Duyck

On Fri, 5 Aug 2022 17:58:50 +0100 ecree@xilinx.com wrote:
> From: Edward Cree <ecree.xilinx@gmail.com>
> 
> There's no clear explanation of what VF Representors are for, their
>  semantics, etc., outside of vendor docs and random conference slides.
> Add a document explaining Representors and defining what drivers that
>  implement them are expected to do.
> 
> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
> ---
> This documents representors as I understand them, but I suspect others
>  (including other vendors) might disagree (particularly with the "what
>  functions should have a rep" section).  I'm hoping that through review
>  of this doc we can converge on a consensus.

Thanks for doing this, we need to CC people tho. Otherwise they won't
pay attention. (adding semi-non-exhaustively those I have in my address
book)

> +=============================
> +Network Function Representors
> +=============================
> +
> +This document describes the semantics and usage of representor netdevices, as
> +used to control internal switching on SmartNICs.  For the closely-related port
> +representors on physical (multi-port) switches, see
> +:ref:`Documentation/networking/switchdev.rst <switchdev>`.
> +
> +Motivation
> +----------
> +
> +Since the mid-2010s, network cards have started offering more complex
> +virtualisation capabilities than the legacy SR-IOV approach (with its simple
> +MAC/VLAN-based switching model) can support.  This led to a desire to offload
> +software-defined networks (such as OpenVSwitch) to these NICs to specify the
> +network connectivity of each function.  The resulting designs are variously
> +called SmartNICs or DPUs.
> +
> +Network function representors provide the mechanism by which network functions
> +on an internal switch are managed. They are used both to configure the
> +corresponding function ('representee') and to handle slow-path traffic to and
> +from the representee for which no fast-path switching rule is matched.

I think we should just describe how those netdevs bring SR-IOV
forwarding into Linux networking stack. This section reads too much
like it's a hack rather than an obvious choice. Perhaps:

The representors bring the standard Linux networking stack to IOV
functions. Same as each port of a Linux-controlled switch has a
separate netdev, each virtual function has one. When system boots 
and before any offload is configured all packets from the virtual
functions appear in the networking stack of the PF via the representors.
PF can thus always communicate freely with the virtual functions. 
PF can configure standard Linux forwarding between representors, 
the uplink or any other netdev (routing, bridging, TC classifiers).

> +That is, a representor is both a control plane object (representing the function
> +in administrative commands) and a data plane object (one end of a virtual pipe).
> +As a virtual link endpoint, the representor can be configured like any other
> +netdevice; in some cases (e.g. link state) the representee will follow the
> +representor's configuration, while in others there are separate APIs to
> +configure the representee.
> +
> +What does a representor do?
> +---------------------------
> +
> +A representor has three main rôles.
> +
> +1. It is used to configure the representee's virtual MAC, e.g. link up/down,
> +   MTU, etc.  For instance, bringing the representor administratively UP should
> +   cause the representee to see a link up / carrier on event.

I presume you're trying to start a discussion here, rather than stating
the existing behavior. Or the "virtual MAC" means something else than I
think it means?

> +2. It provides the slow path for traffic which does not hit any offloaded
> +   fast-path rules in the virtual switch.  Packets transmitted on the
> +   representor netdevice should be delivered to the representee; packets
> +   transmitted to the representee which fail to match any switching rule should
> +   be received on the representor netdevice.  (That is, there is a virtual pipe
> +   connecting the representor to the representee, similar in concept to a veth
> +   pair.)
> +
> +   This allows software switch implementations (such as OpenVSwitch or a Linux
> +   bridge) to forward packets between representees and the rest of the network.
> +3. It acts as a handle by which switching rules (such as TC filters) can refer
> +   to the representee, allowing these rules to be offloaded.
> +
> +The combination of 2) and 3) means that the behaviour (apart from performance)
> +should be the same whether a TC filter is offloaded or not.  E.g. a TC rule
> +on a VF representor applies in software to packets received on that representor
> +netdevice, while in hardware offload it would apply to packets transmitted by
> +the representee VF.  Conversely, a mirred egress redirect to a VF representor
> +corresponds in hardware to delivery directly to the representee VF.
> +
> +What functions should have a representor?
> +-----------------------------------------
> +
> +Essentially, for each virtual port on the device's internal switch, there
> +should be a representor.
> +The only exceptions are the management PF (whose port is used for traffic to
> +and from all other representors) 

AFAIK there's no "management PF" in the Linux model.

> and perhaps the physical network port (for
> +which the management PF may act as a kind of port representor.  Devices that
> +combine multiple physical ports and SR-IOV capability may need to have port
> +representors in addition to PF/VF representors).

That doesn't generalize well. If we just say that all uplinks and PFs
should have a repr we don't have to make exceptions for all the cases
where that's the case.

> +Thus, the following should all have representors:
> +
> + - VFs belonging to the management PF.

management PF -> /dev/null

> + - Other PFs on the PCIe controller, and any VFs belonging to them.

What is "the PCIe controller" here? I presume you've seen the
devlink-port doc.

> + - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
> +   System-on-Chip within the SmartNIC).
> + - PFs and VFs with other personalities, including network block devices (such
> +   as a vDPA virtio-blk PF backed by remote/distributed storage).

IDK how you can configure block forwarding (which is DMAs of command
+ data blocks, not packets AFAIU) with the networking concepts..
I've not used the storage functions tho, so I could be wrong.

> + - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
> +   their own port on the switch (as opposed to using their parent PF's port).
> + - Any accelerators or plugins on the device whose interface to the network is
> +   through a virtual switch port, even if they do not have a corresponding PCIe
> +   PF or VF.
> +
> +This allows the entire switching behaviour of the NIC to be controlled through
> +representor TC rules.
> +
> +An example of a PCIe function that should *not* have a representor is, on an
> +FPGA-based NIC, a PF which is only used to deploy a new bitstream to the FPGA,
> +and which cannot create RX and TX queues.

What's the thinking here? We're letting everyone add their own
exceptions to the doc?

>  Since such a PF does not have network
> +access through the internal switch, not even indirectly via a distributed
> +storage endpoint, there is no switch virtual port for the representor to
> +configure or to be the other end of the virtual pipe.

Does it have a netdev?

> +How are representors created?
> +-----------------------------
> +
> +The driver instance attached to the management PF should enumerate the virtual
> +ports on the switch, and for each representee, create a pure-software netdevice
> +which has some form of in-kernel reference to the PF's own netdevice or driver
> +private data (``netdev_priv()``).
> +If switch ports can dynamically appear/disappear, the PF driver should create
> +and destroy representors appropriately.
> +The operations of the representor netdevice will generally involve acting
> +through the management PF.  For example, ``ndo_start_xmit()`` might send the
> +packet, specially marked for delivery to the representee, through a TX queue
> +attached to the management PF.

IDK how common that is, RDMA NICs will likely do the "dedicated queue
per repr" thing since they pretend to have infinite queues.

> +How are representors identified?
> +--------------------------------
> +
> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
> +representee or of the management PF.

Do we know how many existing ones do? 

> +Instead, it should implement the ``ndo_get_port_parent_id()`` and
> +``ndo_get_phys_port_name()`` netdevice ops (corresponding to the
> +``phys_switch_id`` and ``phys_port_name`` sysfs nodes).
> +``ndo_get_port_parent_id()`` should return a string identical to that returned
> +by the management PF's ``ndo_get_phys_port_id()`` (typically the MAC address of
> +the physical port), while ``ndo_get_phys_port_name()`` should return a string
> +describing the representee's relation to the management PF.
> +
> +For instance, if the management PF has a ``phys_port_name`` of ``p0`` (physical
> +port 0), then the representor for the third VF on the second PF should typically
> +be ``p0pf1vf2`` (i.e. "port 0, PF 1, VF 2").  More generally, the
> +``phys_port_name`` for a PCIe function should be the concatenation of one or
> +more of:
> +
> + - ``p<N>``, physical port number *N*.
> + - ``if<N>``, PCIe controller number *N*.  The semantics of these numbers are
> +   vendor-defined, and controller 0 need not correspond to the controller on
> +   which the management PF resides.

/me checks in horror if this is already upstream

> + - ``pf<N>``, PCIe physical function index *N*.
> + - ``vf<N>``, PCIe virtual function index *N*.
> + - ``sf<N>``, Subfunction index *N*.

Yeah, nah... implement devlink port, please. This is done by the core,
you shouldn't have to document this.

> +It is expected that userland will use this information (e.g. through udev rules)
> +to construct an appropriately informative name or alias for the netdevice.  For
> +instance if the management PF is ``eth4`` then our representor with a
> +``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
> +
> +There are as yet no established conventions for naming representors which do not
> +correspond to PCIe functions (e.g. accelerators and plugins).
> +
> +How do representors interact with TC rules?
> +-------------------------------------------
> +
> +Any TC rule on a representor applies (in software TC) to packets received by
> +that representor netdevice.  Thus, if the delivery part of the rule corresponds
> +to another port on the virtual switch, the driver may choose to offload it to
> +hardware, applying it to packets transmitted by the representee.
> +
> +Similarly, since a TC mirred egress action targeting the representor would (in
> +software) send the packet through the representor (and thus indirectly deliver
> +it to the representee), hardware offload should interpret this as delivery to
> +the representee.
> +
> +As a simple example, if ``eth0`` is the management PF's netdevice and ``eth1``
> +is a VF representor, the following rules::
> +
> +    tc filter add dev eth1 parent ffff: protocol ipv4 flower \
> +        action mirred egress redirect dev eth0
> +    tc filter add dev eth0 parent ffff: protocol ipv4 flower \
> +        action mirred egress mirror dev eth1
> +
> +would mean that all IPv4 packets from the VF are sent out the physical port, and
> +all IPv4 packets received on the physical port are delivered to the VF in
> +addition to the management PF.
> +
> +Of course the rules can (if supported by the NIC) include packet-modifying
> +actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
> +
> +Tunnel encapsulation and decapsulation are rather more complicated, as they
> +involve a third netdevice (a tunnel netdev operating in metadata mode, such as
> +a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
> +require an IP address to be bound to the underlay device (e.g. management PF or
> +port representor).  TC rules such as::
> +
> +    tc filter add dev eth1 parent ffff: flower \
> +        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
> +                              dst_port 4789 \
> +        action mirred egress redirect dev vxlan0
> +    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
> +        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
> +        action tunnel_key unset action mirred egress redirect dev eth1
> +
> +where ``LOCAL_IP`` is an IP address bound to ``eth0``, and ``REMOTE_IP`` is
> +another IP address on the same subnet, mean that packets sent by the VF should
> +be VxLAN encapsulated and sent out the physical port (the driver has to deduce
> +this by a route lookup of ``LOCAL_IP`` leading to ``eth0``, and also perform an
> +ARP/neighbour table lookup to find the MAC addresses to use in the outer
> +Ethernet frame), while UDP packets received on the physical port with UDP port
> +4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, decapsulated
> +and forwarded to the VF.
> +
> +If this all seems complicated, just remember the 'golden rule' of TC offload:
> +the hardware should ensure the same final results as if the packets were
> +processed through the slow path, traversed software TC and were transmitted or
> +received through the representor netdevices.
> +
> +Configuring the representee's MAC
> +---------------------------------
> +
> +The representee's link state is controlled through the representor.  Setting the
> +representor administratively UP or DOWN should cause carrier ON or OFF at the
> +representee.
> +
> +Setting an MTU on the representor should cause that same MTU to be reported to
> +the representee.
> +(On hardware that allows configuring separate and distinct MTU and MRU values,
> +the representor MTU should correspond to the representee's MRU and vice-versa.)

Why worry about that?

> +Currently there is no way to use the representor to set the station permanent
> +MAC address of the representee; other methods available to do this include:
> +
> + - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
> + - devlink port function (see **devlink-port(8)** and
> +   :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
> diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
> index f1f4e6a85a29..21e80c8e661b 100644
> --- a/Documentation/networking/switchdev.rst
> +++ b/Documentation/networking/switchdev.rst
> @@ -1,5 +1,6 @@
>  .. SPDX-License-Identifier: GPL-2.0
>  .. include:: <isonum.txt>
> +.. _switchdev:
>  
>  ===============================================
>  Ethernet switch device driver model (switchdev)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-06  1:43 ` Jakub Kicinski
@ 2022-08-08 16:50   ` Keller, Jacob E
  2022-08-08 20:44   ` Edward Cree
  1 sibling, 0 replies; 12+ messages in thread
From: Keller, Jacob E @ 2022-08-08 16:50 UTC (permalink / raw)
  To: Jakub Kicinski, ecree
  Cc: netdev, davem, pabeni, edumazet, corbet, linux-doc, Edward Cree,
	linux-net-drivers, Brandeburg, Jesse, Michael Chan,
	Andy Gospodarek, Saeed Mahameed, Jiri Pirko, Shannon Nelson,
	Simon Horman, Alexander Duyck



> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, August 05, 2022 6:44 PM
> To: ecree@xilinx.com
> Cc: netdev@vger.kernel.org; davem@davemloft.net; pabeni@redhat.com;
> edumazet@google.com; corbet@lwn.net; linux-doc@vger.kernel.org; Edward
> Cree <ecree.xilinx@gmail.com>; linux-net-drivers@amd.com; Keller, Jacob E
> <jacob.e.keller@intel.com>; Brandeburg, Jesse <jesse.brandeburg@intel.com>;
> Michael Chan <michael.chan@broadcom.com>; Andy Gospodarek
> <andy@greyhouse.net>; Saeed Mahameed <saeed@kernel.org>; Jiri Pirko
> <jiri@resnulli.us>; Shannon Nelson <snelson@pensando.io>; Simon Horman
> <simon.horman@corigine.com>; Alexander Duyck
> <alexander.duyck@gmail.com>
> Subject: Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other)
> Representors
> 
> On Fri, 5 Aug 2022 17:58:50 +0100 ecree@xilinx.com wrote:
> > From: Edward Cree <ecree.xilinx@gmail.com>
> >
> > There's no clear explanation of what VF Representors are for, their
> >  semantics, etc., outside of vendor docs and random conference slides.
> > Add a document explaining Representors and defining what drivers that
> >  implement them are expected to do.
> >
> > Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
> > ---
> > This documents representors as I understand them, but I suspect others
> >  (including other vendors) might disagree (particularly with the "what
> >  functions should have a rep" section).  I'm hoping that through review
> >  of this doc we can converge on a consensus.
> 
> Thanks for doing this, we need to CC people tho. Otherwise they won't
> pay attention. (adding semi-non-exhaustively those I have in my address
> book)
> 
> > +=============================
> > +Network Function Representors
> > +=============================
> > +
> > +This document describes the semantics and usage of representor netdevices,
> as
> > +used to control internal switching on SmartNICs.  For the closely-related port
> > +representors on physical (multi-port) switches, see
> > +:ref:`Documentation/networking/switchdev.rst <switchdev>`.
> > +
> > +Motivation
> > +----------
> > +
> > +Since the mid-2010s, network cards have started offering more complex
> > +virtualisation capabilities than the legacy SR-IOV approach (with its simple
> > +MAC/VLAN-based switching model) can support.  This led to a desire to
> offload
> > +software-defined networks (such as OpenVSwitch) to these NICs to specify the
> > +network connectivity of each function.  The resulting designs are variously
> > +called SmartNICs or DPUs.
> > +
> > +Network function representors provide the mechanism by which network
> functions
> > +on an internal switch are managed. They are used both to configure the
> > +corresponding function ('representee') and to handle slow-path traffic to and
> > +from the representee for which no fast-path switching rule is matched.
> 
> I think we should just describe how those netdevs bring SR-IOV
> forwarding into Linux networking stack. This section reads too much
> like it's a hack rather than an obvious choice.

I agree. Though not all of the devices can support it, representor devices and switchdev are able to be supported even in some cases which may not be as fully featured  or capable as "SmartNIC". Ofcourse the terminology here can get muddled with various branding etc..

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-06  1:43 ` Jakub Kicinski
  2022-08-08 16:50   ` Keller, Jacob E
@ 2022-08-08 20:44   ` Edward Cree
  2022-08-09  3:41     ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: Edward Cree @ 2022-08-08 20:44 UTC (permalink / raw)
  To: Jakub Kicinski, ecree
  Cc: netdev, davem, pabeni, edumazet, corbet, linux-doc,
	linux-net-drivers, Jacob Keller, Jesse Brandeburg, Michael Chan,
	Andy Gospodarek, Saeed Mahameed, Jiri Pirko, Shannon Nelson,
	Simon Horman, Alexander Duyck

On 06/08/2022 02:43, Jakub Kicinski wrote:
> On Fri, 5 Aug 2022 17:58:50 +0100 ecree@xilinx.com wrote:
>> +Network function representors provide the mechanism by which network functions
>> +on an internal switch are managed. They are used both to configure the
>> +corresponding function ('representee') and to handle slow-path traffic to and
>> +from the representee for which no fast-path switching rule is matched.
> 
> I think we should just describe how those netdevs bring SR-IOV
> forwarding into Linux networking stack. This section reads too much
> like it's a hack rather than an obvious choice. Perhaps:
> 
> The representors bring the standard Linux networking stack to IOV
> functions. Same as each port of a Linux-controlled switch has a
> separate netdev, each virtual function has one. When system boots 
> and before any offload is configured all packets from the virtual
> functions appear in the networking stack of the PF via the representors.
> PF can thus always communicate freely with the virtual functions. 
> PF can configure standard Linux forwarding between representors, 
> the uplink or any other netdev (routing, bridging, TC classifiers).

Makes sense, yes.

>> +1. It is used to configure the representee's virtual MAC, e.g. link up/down,
>> +   MTU, etc.  For instance, bringing the representor administratively UP should
>> +   cause the representee to see a link up / carrier on event.
> 
> I presume you're trying to start a discussion here, rather than stating
> the existing behavior. Or the "virtual MAC" means something else than I
> think it means?

Virtual MAC in the sense that the VF (or whatever) is presented with the
 illusion that it owns a MAC on the wire whereas really it's just a port
 on a virtual switch.  So it has a link state, MAC address, MAC stats,
 probably other things I've forgotten about.
Link state following the VF rep is from [1], I just assumed that mlx had
 in fact implemented that.  (I haven't implemented it for sfc ef100.)
 Hopefully Mellanox folks can clarify what they see as the current design?
I was trying to describe the concept behind that in a more general way;
 my intuition is that this model can be applied to more than just the
 link state.  (Though we saw what a mess I made the first time I tried to
 apply this to the MAC address...)

>> +What functions should have a representor?
>> +-----------------------------------------
>> +
>> +Essentially, for each virtual port on the device's internal switch, there
>> +should be a representor.
>> +The only exceptions are the management PF (whose port is used for traffic to
>> +and from all other representors) 
> 
> AFAIK there's no "management PF" in the Linux model.

Maybe a bad word choice.  I'm referring to whichever PF (which likely
 also has an ordinary netdevice) has administrative rights over the NIC /
 internal switch at a firmware level.  Other names I've seen tossed
 around include "primary PF", "admin PF".

>> and perhaps the physical network port (for
>> +which the management PF may act as a kind of port representor.  Devices that
>> +combine multiple physical ports and SR-IOV capability may need to have port
>> +representors in addition to PF/VF representors).
> 
> That doesn't generalize well. If we just say that all uplinks and PFs
> should have a repr we don't have to make exceptions for all the cases
> where that's the case.

We could, but AFAIK that's not how existing drivers behave.  At least
 when I experimented with a mlx NIC a couple of years ago I don't
 recall it creating a repr for the primary PF or for the physical port,
 only reprs for the VFs.

>> + - Other PFs on the PCIe controller, and any VFs belonging to them.
> 
> What is "the PCIe controller" here? I presume you've seen the
> devlink-port doc.

Yes, that's where I got this terminology from.
"the" PCIe controller here is the one on which the mgmt PF lives.  For
 instance you might have a NIC where you run OVS on a SoC inside the
 chip, that has its own PCIe controller including a PF it uses to drive
 the hardware v-switch (so it can offload OVS rules), in addition to
 the PCIe controller that exposes PFs & VFs to the host you plug it
 into through the physical PCIe socket / edge connector.
In that case this bullet would refer to any additional PFs the SoC has
 besides the management one...

>> + - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
>> +   System-on-Chip within the SmartNIC).

... and this bullet to the PFs the host sees.

>> + - PFs and VFs with other personalities, including network block devices (such
>> +   as a vDPA virtio-blk PF backed by remote/distributed storage).
> 
> IDK how you can configure block forwarding (which is DMAs of command
> + data blocks, not packets AFAIU) with the networking concepts..
> I've not used the storage functions tho, so I could be wrong.

Maybe I'm way off the beam here, but my understanding is that this
 sort of thing involves a block interface between the host and the
 NIC, but then something internal to the NIC converts those
 operations into network operations (e.g. RDMA traffic or Ceph TCP
 packets), which then go out on the network to access the actual
 data.  In that case the back-end has to have network connectivity,
 and the obvious™ way to do that is give it a v-port on the v-switch
 just like anyone else.

>> +An example of a PCIe function that should *not* have a representor is, on an
>> +FPGA-based NIC, a PF which is only used to deploy a new bitstream to the FPGA,
>> +and which cannot create RX and TX queues.
> 
> What's the thinking here? We're letting everyone add their own
> exceptions to the doc?

It was just the only example I could come up with of the more general
 rule: if it doesn't have the ability to send and receive packets over
 the network (directly or indirectly), then it won't have a virtual
 port on the virtual switch, and so it doesn't make sense for it to
 have a representor.
No way to TX = nothing will ever be RXed on the rep; no way to RX = no
 way to deliver anything you TX from the rep.  And nothing for TC
 offload to act upon either for the same reasons.

>>  Since such a PF does not have network
>> +access through the internal switch, not even indirectly via a distributed
>> +storage endpoint, there is no switch virtual port for the representor to
>> +configure or to be the other end of the virtual pipe.
> 
> Does it have a netdev?

No.  But per the bit about block devices above, that's not a sufficient
 condition; PFs that terminate network traffic inside the hardware to
 implement some other functionality can have a v-switch port (and thus
 need a repr) despite not having a netdev either.
(Also, you *could* have a PF with a netdev that only talks to some kind
 of NOC and isn't connected to the v-switch, in which case that PF
 *wouldn't* have a repr.  But that seems sufficiently perverse that I
 didn't think it worth mentioning in the doc.)

>> For example, ``ndo_start_xmit()`` might send the
>> +packet, specially marked for delivery to the representee, through a TX queue
>> +attached to the management PF.
> 
> IDK how common that is, RDMA NICs will likely do the "dedicated queue
> per repr" thing since they pretend to have infinite queues.

Right.  But the queue is still created by the driver bound to the mgmt
 PF, and using that PF for whatever BAR accesses it uses to create and
 administer the queue, no?
That's the important bit, and the details of how the NIC knows which
 representee to deliver it to (dedicated queue, special TX descriptor,
 whatever) are vendor-specific magic.  Better ways of phrasing that
 are welcome :)

>> +How are representors identified?
>> +--------------------------------
>> +
>> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
>> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
>> +representee or of the management PF.
> 
> Do we know how many existing ones do? 

Idk.  From a quick look on lxr, mlx5 and ice do; as far as I can tell
 nfp/flower does for "phy_reprs" but not "vnic_reprs".  nfp/abm does.

My reasoning for this "should not" here is that a repr is a pure
 software device; compare e.g. if you build a vlan netdev on top of
 eth0 it doesn't inherit eth0's device.
Also, at least in theory this should avoid the problem with OpenStack
 picking the wrong netdevice that you mentioned in [2], as this is
 what controls the 'device' symlink in sysfs.

>> + - ``pf<N>``, PCIe physical function index *N*.
>> + - ``vf<N>``, PCIe virtual function index *N*.
>> + - ``sf<N>``, Subfunction index *N*.
> 
> Yeah, nah... implement devlink port, please. This is done by the core,
> you shouldn't have to document this.

Oh huh, I didn't know about __devlink_port_phys_port_name_get().
Last time I looked, the drivers all had their own
 .ndo_get_phys_port_name implementations (which is why I did one for
 sfc), and any similarity between their string formats was purely an
 (undocumented) convention.  TIL!
(And it looks like the core uses `c<N>` for my `if<N>` that you were
 so horrified by.  Devlink-port documentation doesn't make it super
 clear whether controller 0 is "the controller that's in charge" or
 "the controller from which we're viewing things", though I think in
 practice it comes to the same thing.)

>> +Setting an MTU on the representor should cause that same MTU to be reported to
>> +the representee.
>> +(On hardware that allows configuring separate and distinct MTU and MRU values,
>> +the representor MTU should correspond to the representee's MRU and vice-versa.)
> 
> Why worry about that?

I just wanted to make clear that because the representor and
 representee are opposite ends of a virtual link, the latter is seen
 'mirrored' for these kinds of configuration operations.
(This makes sense because e.g. the representor's MRU is the largest
 packet that the representee can send and still have it delivered to
 the representor, thus it's the representee's MTU at least for the
 slow path.)

Thanks for the thorough review!
I've already learned some things, which was part of my objective in
 writing up this doc ;-)
-ed

[1]: https://legacy.netdevconf.info/1.2/slides/oct6/04_gerlitz_efraim_introduction_to_switchdev_sriov_offloads.pdf
[2]: https://lore.kernel.org/all/20220728113231.26fdfab0@kernel.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-05 19:15 ` Randy Dunlap
@ 2022-08-08 20:48   ` Edward Cree
  0 siblings, 0 replies; 12+ messages in thread
From: Edward Cree @ 2022-08-08 20:48 UTC (permalink / raw)
  To: Randy Dunlap, ecree, netdev
  Cc: davem, kuba, pabeni, edumazet, corbet, linux-doc, linux-net-drivers

On 05/08/2022 20:15, Randy Dunlap wrote:
> On 8/5/22 09:58, ecree@xilinx.com wrote:
>> +A representor has three main rôles.
> 
> Just use "roles". dict.org and m-w.com are happy with that.
> m-w.com says for "role":
>   variants: or less commonly rôle

Okay, will do.
-ed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-08 20:44   ` Edward Cree
@ 2022-08-09  3:41     ` Jakub Kicinski
  2022-08-10 16:02       ` Edward Cree
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2022-08-09  3:41 UTC (permalink / raw)
  To: Edward Cree
  Cc: ecree, netdev, davem, pabeni, edumazet, corbet, linux-doc,
	linux-net-drivers, Jacob Keller, Jesse Brandeburg, Michael Chan,
	Andy Gospodarek, Saeed Mahameed, Jiri Pirko, Shannon Nelson,
	Simon Horman, Alexander Duyck

On Mon, 8 Aug 2022 21:44:45 +0100 Edward Cree wrote:
> >> +What functions should have a representor?
> >> +-----------------------------------------
> >> +
> >> +Essentially, for each virtual port on the device's internal switch, there
> >> +should be a representor.
> >> +The only exceptions are the management PF (whose port is used for traffic to
> >> +and from all other representors)   
> > 
> > AFAIK there's no "management PF" in the Linux model.  
> 
> Maybe a bad word choice.  I'm referring to whichever PF (which likely
>  also has an ordinary netdevice) has administrative rights over the NIC /
>  internal switch at a firmware level.  Other names I've seen tossed
>  around include "primary PF", "admin PF".

I believe someone (mellanox?) used the term eswitch manager.
I'd use "host PF", somehow that makes most sense to me.

> >> and perhaps the physical network port (for
> >> +which the management PF may act as a kind of port representor.  Devices that
> >> +combine multiple physical ports and SR-IOV capability may need to have port
> >> +representors in addition to PF/VF representors).  
> > 
> > That doesn't generalize well. If we just say that all uplinks and PFs
> > should have a repr we don't have to make exceptions for all the cases
> > where that's the case.  
> 
> We could, but AFAIK that's not how existing drivers behave.  At least
>  when I experimented with a mlx NIC a couple of years ago I don't
>  recall it creating a repr for the primary PF or for the physical port,
>  only reprs for the VFs.

Mellanox is not the best example, I think they don't even support
uplink to uplink forwarding cleanly.

> >> + - Other PFs on the PCIe controller, and any VFs belonging to them.  
> > 
> > What is "the PCIe controller" here? I presume you've seen the
> > devlink-port doc.  
> 
> Yes, that's where I got this terminology from.
> "the" PCIe controller here is the one on which the mgmt PF lives.  For
>  instance you might have a NIC where you run OVS on a SoC inside the
>  chip, that has its own PCIe controller including a PF it uses to drive
>  the hardware v-switch (so it can offload OVS rules), in addition to
>  the PCIe controller that exposes PFs & VFs to the host you plug it
>  into through the physical PCIe socket / edge connector.
> In that case this bullet would refer to any additional PFs the SoC has
>  besides the management one...

IMO the model where there's a overall controller for the entire device
is also a mellanox limitation, due to lack of support for nested
switches.

Say I pay for a bare metal instance in my favorite public could. 
Why would the forwarding between VFs I spawn be controlled by the cloud
provider and not me?!

But perhaps Netronome was the only vendor capable of nested switching.

> >> + - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
> >> +   System-on-Chip within the SmartNIC).  
> 
> ... and this bullet to the PFs the host sees.
> 
> >> + - PFs and VFs with other personalities, including network block devices (such
> >> +   as a vDPA virtio-blk PF backed by remote/distributed storage).  
> > 
> > IDK how you can configure block forwarding (which is DMAs of command
> > + data blocks, not packets AFAIU) with the networking concepts..
> > I've not used the storage functions tho, so I could be wrong.  
> 
> Maybe I'm way off the beam here, but my understanding is that this
>  sort of thing involves a block interface between the host and the
>  NIC, but then something internal to the NIC converts those
>  operations into network operations (e.g. RDMA traffic or Ceph TCP
>  packets), which then go out on the network to access the actual
>  data.  In that case the back-end has to have network connectivity,
>  and the obvious™ way to do that is give it a v-port on the v-switch
>  just like anyone else.

I see. I don't think this covers all implementations. 

> >> +An example of a PCIe function that should *not* have a representor is, on an
> >> +FPGA-based NIC, a PF which is only used to deploy a new bitstream to the FPGA,
> >> +and which cannot create RX and TX queues.  
> > 
> > What's the thinking here? We're letting everyone add their own
> > exceptions to the doc?  
> 
> It was just the only example I could come up with of the more general
>  rule: if it doesn't have the ability to send and receive packets over
>  the network (directly or indirectly), then it won't have a virtual
>  port on the virtual switch, and so it doesn't make sense for it to
>  have a representor.
> No way to TX = nothing will ever be RXed on the rep; no way to RX = no
>  way to deliver anything you TX from the rep.  And nothing for TC
>  offload to act upon either for the same reasons.

No need to mention that, I'd think. Seems obvious.

> >> For example, ``ndo_start_xmit()`` might send the
> >> +packet, specially marked for delivery to the representee, through a TX queue
> >> +attached to the management PF.  
> > 
> > IDK how common that is, RDMA NICs will likely do the "dedicated queue
> > per repr" thing since they pretend to have infinite queues.  
> 
> Right.  But the queue is still created by the driver bound to the mgmt
>  PF, and using that PF for whatever BAR accesses it uses to create and
>  administer the queue, no?
> That's the important bit, and the details of how the NIC knows which
>  representee to deliver it to (dedicated queue, special TX descriptor,
>  whatever) are vendor-specific magic.  Better ways of phrasing that
>  are welcome :)

"TX queue attached to" made me think of a netdev Tx queue with a qdisc
rather than just a HW queue. No better ideas tho.

> >> +How are representors identified?
> >> +--------------------------------
> >> +
> >> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
> >> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
> >> +representee or of the management PF.  
> > 
> > Do we know how many existing ones do?   
> 
> Idk.  From a quick look on lxr, mlx5 and ice do; as far as I can tell
>  nfp/flower does for "phy_reprs" but not "vnic_reprs".  nfp/abm does.
> 
> My reasoning for this "should not" here is that a repr is a pure
>  software device; compare e.g. if you build a vlan netdev on top of
>  eth0 it doesn't inherit eth0's device.
> Also, at least in theory this should avoid the problem with OpenStack
>  picking the wrong netdevice that you mentioned in [2], as this is
>  what controls the 'device' symlink in sysfs.

It makes sense. The thought I had was "what if a user reads this and
assumes it's never the case". But to be fair "should not" != "must not"
so we're probably good with your wording as is.

> >> + - ``pf<N>``, PCIe physical function index *N*.
> >> + - ``vf<N>``, PCIe virtual function index *N*.
> >> + - ``sf<N>``, Subfunction index *N*.  
> > 
> > Yeah, nah... implement devlink port, please. This is done by the core,
> > you shouldn't have to document this.  
> 
> Oh huh, I didn't know about __devlink_port_phys_port_name_get().
> Last time I looked, the drivers all had their own
>  .ndo_get_phys_port_name implementations (which is why I did one for
>  sfc), and any similarity between their string formats was purely an
>  (undocumented) convention.  TIL!
> (And it looks like the core uses `c<N>` for my `if<N>` that you were
>  so horrified by.  Devlink-port documentation doesn't make it super
>  clear whether controller 0 is "the controller that's in charge" or
>  "the controller from which we're viewing things", though I think in
>  practice it comes to the same thing.)

I think we had a bit. Perhaps @external? The controller which doesn't
have @external == true should be the local one IIRC. And by extension
presumably in charge.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-09  3:41     ` Jakub Kicinski
@ 2022-08-10 16:02       ` Edward Cree
  2022-08-10 17:58         ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Edward Cree @ 2022-08-10 16:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: ecree, netdev, davem, pabeni, edumazet, corbet, linux-doc,
	linux-net-drivers, Jacob Keller, Jesse Brandeburg, Michael Chan,
	Andy Gospodarek, Saeed Mahameed, Jiri Pirko, Shannon Nelson,
	Simon Horman, Alexander Duyck

On 09/08/2022 04:41, Jakub Kicinski wrote:
>>> AFAIK there's no "management PF" in the Linux model.  
>>
>> Maybe a bad word choice.  I'm referring to whichever PF (which likely
>>  also has an ordinary netdevice) has administrative rights over the NIC /
>>  internal switch at a firmware level.  Other names I've seen tossed
>>  around include "primary PF", "admin PF".
> 
> I believe someone (mellanox?) used the term eswitch manager.
> I'd use "host PF", somehow that makes most sense to me.

Not sure about that, I've seen "host" used as antonym of "SoC", so
 if the device is configured with the SoC as the admin this could
 confuse people.
I think whatever term we settle on, this document might need to
 have a 'Definitions' section to make it clear :S

>>> What is "the PCIe controller" here? I presume you've seen the
>>> devlink-port doc.  
>>
>> Yes, that's where I got this terminology from.
>> "the" PCIe controller here is the one on which the mgmt PF lives.  For
>>  instance you might have a NIC where you run OVS on a SoC inside the
>>  chip, that has its own PCIe controller including a PF it uses to drive
>>  the hardware v-switch (so it can offload OVS rules), in addition to
>>  the PCIe controller that exposes PFs & VFs to the host you plug it
>>  into through the physical PCIe socket / edge connector.
>> In that case this bullet would refer to any additional PFs the SoC has
>>  besides the management one...
> 
> IMO the model where there's a overall controller for the entire device
> is also a mellanox limitation, due to lack of support for nested
> switches
Instead of "the PCIe controller" I should probably say "the local PCIe
 controller", since that's the wording the devlink-port doc uses.

> Say I pay for a bare metal instance in my favorite public could. 
> Why would the forwarding between VFs I spawn be controlled by the cloud
> provider and not me?!
> 
> But perhaps Netronome was the only vendor capable of nested switching.

Quite possibly.  Current EF100 NICs can't do nested switching either.

>>>> + - PFs and VFs with other personalities, including network block devices (such
>>>> +   as a vDPA virtio-blk PF backed by remote/distributed storage).  
>>>
>>> IDK how you can configure block forwarding (which is DMAs of command
>>> + data blocks, not packets AFAIU) with the networking concepts..
>>> I've not used the storage functions tho, so I could be wrong.  
>>
>> Maybe I'm way off the beam here, but my understanding is that this
>>  sort of thing involves a block interface between the host and the
>>  NIC, but then something internal to the NIC converts those
>>  operations into network operations (e.g. RDMA traffic or Ceph TCP
>>  packets), which then go out on the network to access the actual
>>  data.  In that case the back-end has to have network connectivity,
>>  and the obvious™ way to do that is give it a v-port on the v-switch
>>  just like anyone else.
> 
> I see. I don't think this covers all implementations. 

Right, I should probably make it more clear that this isn't the only
 way it could be done.
I'm merely trying to make clear that things that don't look like
 netdevices might still have a v-port and hence need a repr.

> "TX queue attached to" made me think of a netdev Tx queue with a qdisc
> rather than just a HW queue. No better ideas tho.

Would adding the word "hardware" before "TX queue" help?  Have to
 admit the netdev-queue interpretation hadn't occurred to me.

>> (And it looks like the core uses `c<N>` for my `if<N>` that you were
>>  so horrified by.  Devlink-port documentation doesn't make it super
>>  clear whether controller 0 is "the controller that's in charge" or
>>  "the controller from which we're viewing things", though I think in
>>  practice it comes to the same thing.)
> 
> I think we had a bit. Perhaps @external? The controller which doesn't
> have @external == true should be the local one IIRC. And by extension
> presumably in charge.

Yes, and that should work fine per se.  It's just not reflected in the
 phys_port_name string in any way, so legacy userland that relies on
 that won't have that piece of info (but it never did) and probably
 assumes that c0 is local.

-ed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-10 16:02       ` Edward Cree
@ 2022-08-10 17:58         ` Jakub Kicinski
  2022-08-10 19:21           ` Edward Cree
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2022-08-10 17:58 UTC (permalink / raw)
  To: Edward Cree
  Cc: ecree, netdev, davem, pabeni, edumazet, corbet, linux-doc,
	linux-net-drivers, Jacob Keller, Jesse Brandeburg, Michael Chan,
	Andy Gospodarek, Saeed Mahameed, Jiri Pirko, Shannon Nelson,
	Simon Horman, Alexander Duyck

On Wed, 10 Aug 2022 17:02:33 +0100 Edward Cree wrote:
> On 09/08/2022 04:41, Jakub Kicinski wrote:
> >> Maybe a bad word choice.  I'm referring to whichever PF (which likely
> >>  also has an ordinary netdevice) has administrative rights over the NIC /
> >>  internal switch at a firmware level.  Other names I've seen tossed
> >>  around include "primary PF", "admin PF".  
> > 
> > I believe someone (mellanox?) used the term eswitch manager.
> > I'd use "host PF", somehow that makes most sense to me.  
> 
> Not sure about that, I've seen "host" used as antonym of "SoC", so
>  if the device is configured with the SoC as the admin this could
>  confuse people.

In the literal definition of the word "host" it is the entity which
"owns the place".

> I think whatever term we settle on, this document might need to
>  have a 'Definitions' section to make it clear :S

Ack, to perhaps clarify my concern further, I've seen the term
"management PF" refer to a PF which is not a netdev PF, it only
performs management functions. Which I don't believe is what we
are describing here. So a perfect term would describe the privilege
not the function (as the primary function of such PF should still
networking).

> >> Yes, that's where I got this terminology from.
> >> "the" PCIe controller here is the one on which the mgmt PF lives.  For
> >>  instance you might have a NIC where you run OVS on a SoC inside the
> >>  chip, that has its own PCIe controller including a PF it uses to drive
> >>  the hardware v-switch (so it can offload OVS rules), in addition to
> >>  the PCIe controller that exposes PFs & VFs to the host you plug it
> >>  into through the physical PCIe socket / edge connector.
> >> In that case this bullet would refer to any additional PFs the SoC has
> >>  besides the management one...  
> > 
> > IMO the model where there's a overall controller for the entire device
> > is also a mellanox limitation, due to lack of support for nested
> > switches  
> Instead of "the PCIe controller" I should probably say "the local PCIe
>  controller", since that's the wording the devlink-port doc uses.

SG!

> > "TX queue attached to" made me think of a netdev Tx queue with a qdisc
> > rather than just a HW queue. No better ideas tho.  
> 
> Would adding the word "hardware" before "TX queue" help?  Have to
>  admit the netdev-queue interpretation hadn't occurred to me.

It would!

> >> (And it looks like the core uses `c<N>` for my `if<N>` that you were
> >>  so horrified by.  Devlink-port documentation doesn't make it super
> >>  clear whether controller 0 is "the controller that's in charge" or
> >>  "the controller from which we're viewing things", though I think in
> >>  practice it comes to the same thing.)  
> > 
> > I think we had a bit. Perhaps @external? The controller which doesn't
> > have @external == true should be the local one IIRC. And by extension
> > presumably in charge.  
> 
> Yes, and that should work fine per se.  It's just not reflected in the
>  phys_port_name string in any way, so legacy userland that relies on
>  that won't have that piece of info (but it never did) and probably
>  assumes that c0 is local.

Ack, we could check the archive but I think that's indeed the case.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-10 17:58         ` Jakub Kicinski
@ 2022-08-10 19:21           ` Edward Cree
  2022-08-10 22:58             ` Alexander Duyck
  0 siblings, 1 reply; 12+ messages in thread
From: Edward Cree @ 2022-08-10 19:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: ecree, netdev, davem, pabeni, edumazet, corbet, linux-doc,
	linux-net-drivers, Jacob Keller, Jesse Brandeburg, Michael Chan,
	Andy Gospodarek, Saeed Mahameed, Jiri Pirko, Shannon Nelson,
	Simon Horman, Alexander Duyck

On 10/08/2022 18:58, Jakub Kicinski wrote:
> On Wed, 10 Aug 2022 17:02:33 +0100 Edward Cree wrote:
>> On 09/08/2022 04:41, Jakub Kicinski wrote:
>>> I'd use "host PF", somehow that makes most sense to me.  
>>
>> Not sure about that, I've seen "host" used as antonym of "SoC", so
>>  if the device is configured with the SoC as the admin this could
>>  confuse people.
> 
> In the literal definition of the word "host" it is the entity which
> "owns the place".

Sure, but as an application of that, people talk about e.g. "host"
 and "device" ends of a bus, DMA transfer, etc.  As a result of which
 "host" has come to mean "computer; server; the big rack-mounted box
 you plug cards into".
A connotation which is unfortunate once a single device can live on
 two separate PCIe hierarchies, connected to two computers each with
 its own hostname, and the one which owns the device is the cluster
 of embedded CPUs inside the card, rather than the big metal box.

>> I think whatever term we settle on, this document might need to
>>  have a 'Definitions' section to make it clear :S
> 
> Ack, to perhaps clarify my concern further, I've seen the term
> "management PF" refer to a PF which is not a netdev PF, it only
> performs management functions.

Yeah, I saw that interpretation as soon as you queried it.  I agree
 we probably can't use "management PF".

> So a perfect term would describe the privilege
> not the function (as the primary function of such PF should still
> networking).

I'm probably gonna get shot for suggesting this, but how about
 "master PF"?

-ed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-10 19:21           ` Edward Cree
@ 2022-08-10 22:58             ` Alexander Duyck
  2022-08-11 16:17               ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Duyck @ 2022-08-10 22:58 UTC (permalink / raw)
  To: Edward Cree
  Cc: Jakub Kicinski, ecree, netdev, davem, pabeni, edumazet, corbet,
	linux-doc, linux-net-drivers, Jacob Keller, Jesse Brandeburg,
	Michael Chan, Andy Gospodarek, Saeed Mahameed, Jiri Pirko,
	Shannon Nelson, Simon Horman

On Wed, Aug 10, 2022 at 12:21 PM Edward Cree <ecree.xilinx@gmail.com> wrote:
>
> On 10/08/2022 18:58, Jakub Kicinski wrote:
> > On Wed, 10 Aug 2022 17:02:33 +0100 Edward Cree wrote:
> >> On 09/08/2022 04:41, Jakub Kicinski wrote:
> >>> I'd use "host PF", somehow that makes most sense to me.
> >>
> >> Not sure about that, I've seen "host" used as antonym of "SoC", so
> >>  if the device is configured with the SoC as the admin this could
> >>  confuse people.
> >
> > In the literal definition of the word "host" it is the entity which
> > "owns the place".
>
> Sure, but as an application of that, people talk about e.g. "host"
>  and "device" ends of a bus, DMA transfer, etc.  As a result of which
>  "host" has come to mean "computer; server; the big rack-mounted box
>  you plug cards into".
> A connotation which is unfortunate once a single device can live on
>  two separate PCIe hierarchies, connected to two computers each with
>  its own hostname, and the one which owns the device is the cluster
>  of embedded CPUs inside the card, rather than the big metal box.

I agree that "host" isn't going to work as a multi-host capable device
might end up having only one "host" that can actually handle the
configuration of the switch for the entire device. So then you have
different types of "host" interfaces.

> >> I think whatever term we settle on, this document might need to
> >>  have a 'Definitions' section to make it clear :S
> >
> > Ack, to perhaps clarify my concern further, I've seen the term
> > "management PF" refer to a PF which is not a netdev PF, it only
> > performs management functions.
>
> Yeah, I saw that interpretation as soon as you queried it.  I agree
>  we probably can't use "management PF".

One thing we may want to think about is looking more at "interfaces"
rather than "devices" or "functions". Essentially a PF is a "Host
Network Interface", a VF or sub-function would be a "Virtual Network
Interface", and an external port would be an "External/Uplink
Interface". Then we have a set of "interfaces" which would allow us to
get away from confusing networking and PCI bus topology where we also
have functions that are present on the device that may not expose
networking interfaces and provide control only. In addition something
like a VNI is more extensible so if we start getting into some other
new virtualization option in the future we are not stuck having to go
through and add yet more documentation to describe it all.

> > So a perfect term would describe the privilege
> > not the function (as the primary function of such PF should still
> > networking).
>
> I'm probably gonna get shot for suggesting this, but how about
>  "master PF"?

Usually with "master" you are talking about something like a bus. It
also occurs to me that the use of PF is assuming a single PCIe
function dedicated to performing this role. With sub-functions
floating around I could easily see a PF getting partitioned to
dedicate queues to handling switchdev operations while still allowing
other networking to pass over the original network interface. Then the
question is which one is the PF and which one is the subfunction.

I'd be more a fan of sticking with the "interface" naming and
describing what the interface would be used for. The first thought
that comes to mind is to just refer to the configuration interface as
a "NIC" since that would be the "Network Interface Controller",
however I could see how that could easily be confusing since that is
the PCI description for the device. Maybe something like a "Controller
Interface", "CI", would make sense since it seems like OVS uses
"Controller" to describe the instance that programs the flows, so we
could use similar terminology.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
  2022-08-10 22:58             ` Alexander Duyck
@ 2022-08-11 16:17               ` Jakub Kicinski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2022-08-11 16:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Edward Cree, ecree, netdev, davem, pabeni, edumazet, corbet,
	linux-doc, linux-net-drivers, Jacob Keller, Jesse Brandeburg,
	Michael Chan, Andy Gospodarek, Saeed Mahameed, Jiri Pirko,
	Shannon Nelson, Simon Horman

On Wed, 10 Aug 2022 15:58:54 -0700 Alexander Duyck wrote:
> > Sure, but as an application of that, people talk about e.g. "host"
> >  and "device" ends of a bus, DMA transfer, etc.  As a result of which
> >  "host" has come to mean "computer; server; the big rack-mounted box
> >  you plug cards into".
> > A connotation which is unfortunate once a single device can live on
> >  two separate PCIe hierarchies, connected to two computers each with
> >  its own hostname, and the one which owns the device is the cluster
> >  of embedded CPUs inside the card, rather than the big metal box.  
> 
> I agree that "host" isn't going to work as a multi-host capable device
> might end up having only one "host" that can actually handle the
> configuration of the switch for the entire device. So then you have
> different types of "host" interfaces.

Thank $deity I haven't had to think about multi-host NPU/DPU/IPUs
for a couple of years now, but I think trying to elect a leader in
charge across the hosts is not a good idea there. Much easier to proxy
all configuration thru FW, as much as I hate that (since FW is usually
closed).

That said choosing the term is about intuition not proofs so "host"
won't fly.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-08-11 16:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-05 16:58 [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors ecree
2022-08-05 19:15 ` Randy Dunlap
2022-08-08 20:48   ` Edward Cree
2022-08-06  1:43 ` Jakub Kicinski
2022-08-08 16:50   ` Keller, Jacob E
2022-08-08 20:44   ` Edward Cree
2022-08-09  3:41     ` Jakub Kicinski
2022-08-10 16:02       ` Edward Cree
2022-08-10 17:58         ` Jakub Kicinski
2022-08-10 19:21           ` Edward Cree
2022-08-10 22:58             ` Alexander Duyck
2022-08-11 16:17               ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).