All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Kicinski <jakub.kicinski@netronome.com>
To: Parav Pandit <parav@mellanox.com>, Or Gerlitz <gerlitz.or@gmail.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	michal.lkml@markovi.net, davem@davemloft.net,
	gregkh@linuxfoundation.org, jiri@mellanox.com
Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
Date: Fri, 1 Mar 2019 12:03:58 -0800	[thread overview]
Message-ID: <20190301120358.7970f0ad@cakuba.netronome.com> (raw)
In-Reply-To: <1551418672-12822-1-git-send-email-parav@mellanox.com>

On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
> Use case:
> ---------
> A user wants to create/delete hardware linked sub devices without
> using SR-IOV.
> These devices for a pci device can be netdev (optional rdma device)
> or other devices. Such sub devices share some of the PCI device
> resources and also have their own dedicated resources.
> 
> Few examples are:
> 1. netdev having its own txq(s), rq(s) and/or hw offload parameters.
> 2. netdev with switchdev mode using netdev representor
> 3. rdma device with IB link layer and IPoIB netdev
> 4. rdma/RoCE device and a netdev
> 5. rdma device with multiple ports
> 
> Requirements for above use cases:
> --------------------------------
> 1. We need a generic user interface & core APIs to create sub devices
> from a parent pci device but should be generic enough for other parent
> devices
> 2. Interface should be vendor agnostic
> 3. User should be able to set device params at creation time
> 4. In future if needed, tool should be able to create passthrough
> device to map to a virtual machine

Like a mediated device?

https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
https://www.dpdk.org/wp-content/uploads/sites/35/2018/06/Mediated-Devices-Better-Userland-IO.pdf

Other than pass-through it is entirely unclear to me why you'd need 
a bus.  (Or should I say VM pass through or DPDK?)  Could you clarify
why the need for a bus?

My thinking is that we should allow spawning subports in devlink and 
if user specifies "passthrough" the device spawned would be an mdev.

> 5. A device can have multiple ports

What does this mean, in practice?  You want to spawn a subdev which can
access both ports?  That'd be for RDMA use cases, more than Ethernet,
right?  (Just clarifying :))

> 6. An orchestration software wants to know how many such sub devices
> can be created from a parent device so that it can manage them in global
> cluster resources.
> 
> So how is it done?
> ------------------
> (a) user in control
> To address above requirements, a generic tool iproute2/devlink is
> extended for sub device's life cycle.
> However a devlink tool and its kernel counter part is not sufficient
> to create protocol agnostic devices on a existing PCI bus.

"Protocol agnostic"?...  What does that mean?

> (b) subdev bus
> A given bus defines well defined addressing scheme. Creating sub devices
> on existing PCI bus with a different naming scheme is just weird.
> So, creating well named devices on appropriate bus is desired.

What's that address scheme you're referring to, you seem to assign IDs
in sequence?

> Hence a new 'subdev' bus is created.
> User adds/removes new sub devices subdev on this bus via a devlink tool.
> devlink tool instructs hardware driver to create/remove/configure
> such devices. Hardware vendor driver places devices on the bus.
> Another or same vendor driver matches based on vendor-id, device-id
> scheme and run through classic device driver model.
> 
> Given that, these are user created devices for a given hardware and in
> absence of a central entity like PCISIG to assign vendor and device ids,
> A unique vendor and device id are maintained as enum in
> include/linux/subdev_ids.h.

Why do we need IDs?  The sysfs hierarchy isn't sufficient?  Do we need
a driver to match on those again?  Is it going to be a different driver?

> subdev bus device names follow default device naming scheme of Linux
> kernel. It is done as 'subdev<instance_id>' such as, subdev0, subdev3.
> 
> subdev device inherits its parent's DMA parameters.
> subdev will follow rich power management infrastructure of core kernel/
> So that every vendor driver doesn't have to iterate over its child
> devices, invent a locking and device anchoring scheme.
> 
> Patchset summary:
> -----------------
> Patch-1, 2 introduces a subdev bus and interface for subdev life cycle.
> Patch-3 extends modpost tool for module device id table.
> Patch-4,5,6 implements a devlink vendor driver to add/remove devices.
> Patch-7 mlx5 driver implements subdev devices and places them on subdev
> bus. 
> Patch-8 match against the subdev for mlx5 vendor, device id and creates
> fake netdevice.
> 
> All patches are only a reference implementation to see RFC in works
> at devlink, sysfs and device model level. Once RFC looks good, more
> solid upstreamable version of the implementation will be done.
> All patches are functional except the last two patches, which just
> create fake subdev devices and fake netdevice.
> 
> System example view:
> --------------------
> 
> $ devlink dev show
> pci/0000:05:00.0
> 
> $ devlink dev add pci/0000:05:00.0

That does not look great.  

Also you have to return the id of the spawned device, otherwise this 
is very racy.

> $ devlink dev show
> pci/0000:05:00.0
> subdev/subdev0

Please don't spawn devlink instances.  Devlink instance is supposed to
represent an ASIC.  If we start spawning them willy nilly for whatever
software construct we want to model the clarity of the ontology will
suffer a lot.

Please see the discussion on my recent patchset.  I think Jiri CCed you.

> sysfs view with subdev:
> 
> $ ls -l /sys/bus/pci/devices/0000:05:00.0
> [..]
> drwxr-xr-x 3 root root        0 Feb 13 15:57 infiniband
> -rw-r--r-- 1 root root     4096 Feb 13 15:57 msi_bus
> drwxr-xr-x 3 root root        0 Feb 13 15:57 net
> drwxr-xr-x 2 root root        0 Feb 13 15:57 power
> drwxr-xr-x 3 root root        0 Feb 13 15:57 ptp
> drwxr-xr-x 4 root root        0 Feb 13 15:57 subdev0
> 
> $ ls -l /sys/bus/pci/devices/0000:05:00.0/subdev0
> lrwxrwxrwx 1 root root    0 Feb 13 15:58 driver -> ../../../../../bus/subdev/drivers/mlx5_core
> drwxr-xr-x 3 root root    0 Feb 13 15:58 net
> drwxr-xr-x 2 root root    0 Feb 13 15:58 power
> lrwxrwxrwx 1 root root    0 Feb 13 15:58 subsystem -> ../../../../../bus/subdev
> -rw-r--r-- 1 root root 4096 Feb 13 15:58 uevent
> 
> $ ls -l /sys/bus/pci/devices/0000:05:00.0/subdev0/net/
> drwxr-xr-x 5 root root 0 Feb 13 15:58 eth0
> 
> Software view:
> -------------
> Some of you if you prefer to see in picture, below diagram tries to
> show software modules in bus/device hierarchy.
> 
> devlink user (iproute2/devlink)
> ------------------------------
>          |
>          |
>   +----------------+
>   | devlink module |
>   |         doit() |     +------------------+
>   |            |   |     |   vendor driver  |
>   +------------|---+     |   (mlx5)         |
>                ----------+-> subdev_ops()   |
>                          +|-----------------+
>                           |
>                 +---------|--+  +-----------+  +------------------+
>                 | subdev bus |  | core      |  | subdev device    |
>                 | driver     |  | kernel    |  | drivers          |
>                 | (add/del)  |  | dev model |  | (netdev, rdma)   |
>                 |       ----------------------> probe/remove()    |
>                 +------------+  +-----------+  +------------------+
> 
> Alternatives considered:
> ------------------------
> Will discuss separately if needed to keep this RFC short.

Please do discuss.

The things key thing for me on the netdev side is what is the
forwarding model to this new entity.  Is this basically VMDQ?  
Should we just go ahead and mandate "switchdev mode" here?

Thanks for working on a common architecture and suffering through
people's reviews rather than adding a debugfs interface that does 
this like a different vendor did :)

  parent reply	other threads:[~2019-03-01 20:04 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-01  5:37 [RFC net-next 0/8] Introducing subdev bus and devlink extension Parav Pandit
2019-03-01  5:37 ` [RFC net-next 1/8] subdev: Introducing subdev bus Parav Pandit
2019-03-01  7:17   ` Greg KH
2019-03-01 16:35     ` Parav Pandit
2019-03-01 17:00       ` Greg KH
2019-03-26 11:48     ` Lorenzo Pieralisi
2019-03-01  5:37 ` [RFC net-next 2/8] subdev: Introduce pm callbacks Parav Pandit
2019-03-01  5:37 ` [RFC net-next 3/8] modpost: Add support for subdev device id table Parav Pandit
2019-03-01  5:37 ` [RFC net-next 4/8] devlink: Introduce and use devlink_init/cleanup() in alloc/free Parav Pandit
2019-03-01  5:37 ` [RFC net-next 5/8] devlink: Add variant of devlink_register/unregister Parav Pandit
2019-03-01  5:37 ` [RFC net-next 6/8] devlink: Add support for devlink subdev lifecycle Parav Pandit
2019-03-01  5:37 ` [RFC net-next 7/8] net/mlx5: Add devlink subdev life cycle command support Parav Pandit
2019-03-01  7:18   ` Greg KH
2019-03-01 16:04     ` Parav Pandit
2019-03-01  5:37 ` [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices Parav Pandit
2019-03-01  7:21   ` Greg KH
2019-03-01 17:21     ` Parav Pandit
2019-03-05  7:13       ` Greg KH
2019-03-05 17:57         ` Parav Pandit
2019-03-05 19:27           ` Greg KH
2019-03-05 21:37             ` Parav Pandit
2019-03-01 22:12   ` Saeed Mahameed
2019-03-04 16:45     ` Parav Pandit
2019-03-01 20:03 ` Jakub Kicinski [this message]
2019-03-04  4:41   ` [RFC net-next 0/8] Introducing subdev bus and devlink extension Parav Pandit
2019-03-05  1:35     ` Jakub Kicinski
2019-03-05 19:46       ` Parav Pandit
2019-03-05 22:39         ` Kirti Wankhede
2019-03-05 23:17           ` Parav Pandit
2019-03-05 23:44             ` Parav Pandit
2019-03-06  0:44               ` Parav Pandit
2019-03-06  3:51                 ` Kirti Wankhede
2019-03-06  5:42                   ` Parav Pandit
2019-03-07 19:04                     ` Kirti Wankhede
2019-03-07 20:27                       ` Parav Pandit
2019-03-07 20:53                         ` Kirti Wankhede
2019-03-07 21:02                           ` Parav Pandit
2019-03-07 21:07                             ` Kirti Wankhede
2019-03-07 21:21                               ` Parav Pandit
2019-03-07 22:01                                 ` Kirti Wankhede
2019-03-07 22:31                                   ` Parav Pandit
2019-03-08 12:19                                     ` Kirti Wankhede
2019-03-08 17:09                                       ` Parav Pandit
2019-03-05  1:45     ` Jakub Kicinski
2019-03-05 16:52       ` Parav Pandit
2021-05-31 10:36         ` moyufeng
2021-06-01  5:37           ` Jakub Kicinski
2021-06-01  7:33             ` Yunsheng Lin
2021-06-01 21:34               ` Jakub Kicinski
2021-06-02  2:24                 ` Yunsheng Lin
2021-06-02 16:34                   ` Jakub Kicinski
2021-06-03  3:46                     ` Yunsheng Lin
2021-06-03 17:53                       ` Jakub Kicinski
2021-06-04  1:18                         ` Yunsheng Lin
2021-06-04 18:41                           ` Jakub Kicinski
2021-06-07  1:36                             ` Yunsheng Lin
2021-06-07 19:46                               ` Jakub Kicinski
2021-06-08 12:10                                 ` Yunsheng Lin
2021-06-08 17:29                                   ` Jakub Kicinski
2021-06-09  9:16                                     ` Yunsheng Lin
2021-06-09  9:38                                       ` Parav Pandit
2021-06-09 11:05                                         ` Yunsheng Lin
2021-06-09 11:59                                           ` Parav Pandit
2021-06-09 12:30                                             ` Yunsheng Lin
2021-06-09 13:45                                               ` Parav Pandit
2021-06-10  7:04                                                 ` Yunsheng Lin
2021-06-10  7:17                                                   ` Parav Pandit
2021-06-09 16:40                                       ` Jakub Kicinski
2021-06-10  6:52                                         ` Yunsheng Lin
2021-06-09  9:52                                   ` Parav Pandit
2021-06-09 11:16                                     ` Yunsheng Lin
2021-06-09 12:00                                       ` Parav Pandit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190301120358.7970f0ad@cakuba.netronome.com \
    --to=jakub.kicinski@netronome.com \
    --cc=davem@davemloft.net \
    --cc=gerlitz.or@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jiri@mellanox.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michal.lkml@markovi.net \
    --cc=netdev@vger.kernel.org \
    --cc=parav@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.