linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
@ 2023-01-30 19:11 Viacheslav A.Dubeyko
  2023-01-31 17:41 ` Jonathan Cameron
  0 siblings, 1 reply; 13+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-01-30 19:11 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, linux-cxl, Dan Williams, Adam Manzanares,
	Jonathan Cameron, Cong Wang, Viacheslav Dubeyko

Hello,

I would like to suggest Fabric Manager (FM) architecture discussion. As far as I can see,
FM architecture requires: (1) FM configuration tool, (2) FM daemon, (3) QEMU emulation
of CXL hardware features. FM daemon receives requests from configuration tool and
executes commands by means of interaction with kernel-space subsystem and CXL switch
(that can be emulated by QEMU). So, the key questions for discussion:
(1) How to distribute functionality between user-space and kernel-space?
(2) Which functionality kernel-space needs to provide for implementation FM features?
      Which kernel-space functionality do we need to implement yet?
(3) Do we need MCTP (Management Component Transport Protocol) or some other
      protocol can be used for interaction between configuration tool, FM daemon, and
      CXL switch?
(4) What architecture FM implementation requires?
(5) Does it make sense to use Rust as implementation language?

CXL Fabric Manager (FM) is the application logic responsible for system composition and
allocation of resources. The FM can be embedded in the firmware of a device such as
a CXL switch, reside on a host, or could be run on a Baseboard Management Controller (BMC).
CXL Specification 3.0 defines Fabric Management as: "CXL devices can be configured statically
or dynamically via a Fabric Manager (FM), an external logical process that queries and configures
the system’s operational state using the FM commands defined in this specification. The FM is
defined as the logical process that decides when reconfiguration is necessary and initiates
the commands to perform configurations. It can take any form, including, but not limited to,
software running on a host machine, embedded software running on a BMC, embedded firmware
running on another CXL device or CXL switch, or a state machine running within the CXL device
itself.”. CXL devices are configured by FM through the Fabric Manager Application Programming
Interface (FM API) command sets through a CCI (Component Command Interface). A CCI is
exposed through a device’s Mailbox registers or through an MCTP-capable (Management
Component Transport Protocol) interface.

FM API Commands (defined by CXL Specification 3.0):
(1) Physical switch (Identify Switch Device, Get Physical Port State, Physical Port Control,
      Send PPB (PCI-to-PCI Bridge) CXL.io Configuration Request).
(2) Virtual Switch (Get Virtual CXL Switch Info, Bind vPPB (Virtual PCI-to-PCI Bridge),
      Unbind vPPB, Generate AER (Advanced Error Reporting Event).
(3) MLD Port (Tunnel Management Command, Send LD (Logical Device) or
     FMLD (Fabric Manager-owned Logical Device) CXL.io Configuration Request,
     Send LD CXL.io Memory Request).
(4) MLD Components (Get LD (Logical Device) Info, Get LD Allocations, Set LD Allocations,
     Get QoS Control, Set QoS Control, Get QoS Status, Get QoS Allocated Bandwidth,
     Set QoS Allocated Bandwidth, Get QoS Bandwidth Limit, Set QoS Bandwidth Limit).
(5) Multi-Headed Devices (Get Multi-Headed Info).
(6) DCD (Dynamic Capacity Device) Management (Get DCD Info, Get Host Dynamic
     Capacity Region Configuration, Set Dynamic Capacity Region Configuration, Get DCD
     Extent Lists, Initiate Dynamic Capacity Add, Initiate Dynamic Capacity Release).

After the initial configuration is complete and a CCI on the switch is operational, an FM can
send Management Commands to the switch. An FM may perform the following dynamic
management actions on a CXL switch: (1) Query switch information and configuration details,
(2) Bind or Unbind ports, (3) Register to receive and handle event notifications from the switch
(e.g., hot plug, surprise removal, and failures). A switch with MLD (Multi-Logical Device)
requires an FM to perform the following management activities: (1) MLD discovery,
(2) LD (Logical Device) binding/unbinding, (3) Management Command Tunneling. The FM can
connect to an MLD (Multi-Logical Device) over a direct connection or by tunneling its
management commands through the CCI of the CXL switch to which the device is connected.
The FM can perform the following operations: (1) Memory allocation and QoS Telemetry
management, (2) Security (e.g., LD erasure after unbinding), (3) Error handling.

FM configuration tool requires such commands:

Discover - discover available agents
Subcommands:
    - fm_cli discover fm - discover FM instances
    - fm_cli discover cxl_devices - discover CXL devices
    - fm_cli discover logical_devices - discover logical devices

FM - manage Fabric Manager
Subcommands:
    - fm_cli fm get_info - get FM status/info
    - fm_cli fm start - start FM instance
    - fm_cli fm restart - restart FM instance
    - fm_cli fm stop - stop FM instance
    - fm_cli fm get_config - get FM configuration
    - fm_cli fm set_config - set FM configuration
    - fm_cli fm get_events - get event records

Switch - manage CXL switch
Subcommands:
    - fm_cli switch get_info - get CXL switch info/status
    - fm_cli switch get_config - get switch configuraiton
    - fm_cli switch set_config - set switch configuration

Logical Device - manage logical devices
Subcommands:
    - fm_cli multi_headed_device info - retrieves the number of heads, number of
           supported LDs, and Head-to- LD mapping of a Multi-Headed device
    - fm_cli logical_device bind - bind logical device
    - fm_cli logical_device unbind - unbind logical device
    - fm_cli logical_device connect - connect Multi Logical Device to CXL switch
    - fm_cli logical_device disconnect - disconnect Multi Logical Device from CXL switch
    - fm_cli logical_device get_allocation - Get LD Allocations (retrieves the memory
           allocations of the MLD)
    - fm_cli logical_device set_allocation - Set LD Allocations (sets the memory allocation
           for each LD)
    - fm_cli logical_device get_qos_control - Get QoS Control (retrieves the MLD’s QoS
           control parameters)
    - fm_cli logical_device set_qos_control - Set QoS Control (sets the MLD’s QoS control
           parameters)
    - fm_cli logical_device get_qos_status - Get QoS Status (retrieves the MLD’s QoS Status)
    - fm_cli logical_device get_qos_allocated_bandwidth - Get QoS Allocated Bandwidth
          (retrieves the MLD’s QoS allocated bandwidth on a per-LD basis)
    - fm_cli logical_device set_qos_allocated_bandwidth - Set QoS Allocated Bandwidth
          (sets the MLD’s QoS allocated bandwidth on a per-LD basis)
    - fm_cli logical_device get_qos_bandwidth_limit - Get QoS Bandwidth Limit (retrieves the
          MLD’s QoS bandwidth limit on a per-LD basis)
    - fm_cli logical_device set_qos_bandwidth_limit - Set QoS Bandwidth Limit (sets the
          MLD’s QoS bandwidth limit on a per-LD basis)
    - fm_cli logical_device erase - secure erase after unbinding

PCI-to-PCI Bridge - manage PPB (PCI-to-PCI Bridge)
Subcommands:
    - fm_cli ppb config - Send PPB (PCI-to-PCI Bridge) CXL.io Configuration Request
    - fm_cli ppb bind - Bind vPPB (Virtual PCI-to-PCI Bridge inside a CXL switch that is
           host-owned)
    - fm_cli ppb unbind - Unbind vPPB (unbinds the physical port or LD from the virtual
           hierarchy PPB)

Physical Port - manage physical ports
Subcommands:
    - fm_cli physical_port get_info - get state of physical port
    - fm_cli physical_port control - control unbound ports and MLD ports, including issuing
           resets and controlling sidebands
    - fm_cli physical_port bind - bind physical port to vPPB (Virtual PCI-to-PCI Bridge)
    - fm_cli physical_port unbind - unbind physical port from vPPB (Virtual PCI-to-PCI Bridge)

MLD (Multi-Logical Device) Port - manage Multi-Logical Device ports
Subcommands:
    - fm_cli mld_port tunnel - Tunnel Management Command (tunnels the provided command
           to LD FFFFh of the MLD on the specified port)
    - fm_cli mld_port send_config - Send LD (Logical Device) or FMLD (Fabric
           Manager-owned Logical Device) CXL.io Configuration Request
    - fm_cli mld_port send_memory_request - Send LD CXL.io Memory Request

DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device
Subcommands:
    - fm_cli dcd get_info - Get DCD Info (retrieves the number of supported hosts,
         total Dynamic Capacity of the device, and supported region configurations)
    - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity Region Configuration
         (retrieves the Dynamic Capacity configuration for a specified host)
    - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region Configuration
         (sets the configuration of a DC Region)
    - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the Dynamic Capacity
         Extent List for a specified host)
    - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add (initiates the addition of
         Dynamic Capacity to the specified region on a host)
    - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release (initiates the release of
         Dynamic Capacity from a host)

FM daemon receives requests from configuration tool and executes commands by means of
interaction with kernel-space subsystems. The responsibility of FM daemon could be:
    - Execute configuration tool commands
    - Manage hot-add and hot-removal of devices
    - Manage surprise removal of devices
    - Receive and handle even notifications from the CXL switch
    - Logging events
    - Memory allocation and QoS Telemetry management
    - Error/Failure handling

Thanks,
Slava.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-01-30 19:11 [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture Viacheslav A.Dubeyko
@ 2023-01-31 17:41 ` Jonathan Cameron
  2023-02-01 20:04   ` [External] " Viacheslav A.Dubeyko
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron @ 2023-01-31 17:41 UTC (permalink / raw)
  To: Viacheslav A.Dubeyko
  Cc: lsf-pc, linux-mm, linux-cxl, Dan Williams, Adam Manzanares,
	Cong Wang, Viacheslav Dubeyko

On Mon, 30 Jan 2023 11:11:23 -0800
"Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:

> Hello,

Hi Slava,

I'll throw some opinions at this :)

> 
> I would like to suggest Fabric Manager (FM) architecture discussion. As far as I can see,
> FM architecture requires: (1) FM configuration tool, (2) FM daemon, (3) QEMU emulation
> of CXL hardware features. FM daemon receives requests from configuration tool and
> executes commands by means of interaction with kernel-space subsystem and CXL switch
> (that can be emulated by QEMU). So, the key questions for discussion:

Worth describing operating modes to be supported: You kind of cover this later
but I think pulling it out make it clearer that we want one bit of software to
do several different things.

1) FM separate from hosts and talked to by higher level orchestration software
   but using a Switch CCI or MHD mailbox (over PCI)
   This one is fairly easy because any security / shooting self in foot problems
   are an issue for higher level software. 
2) FM on host.  Probably mostly going be relevant for debug but may use
   the same mailbox as is being used by the existing CXL drivers (for Multi
   Head Device it might be the end point mailbox, for Multi Logical Device
   behind a switch it might be the switch mailbox).
3) All out of band (MCTP or similar - want some shared code, but no
   need for anything in kernel as far as I can tell).


> (1) How to distribute functionality between user-space and kernel-space?

Kernel for transport if mailbox based (switch or MHD).
Possibly help in kernel with the host to Multiheaded device FM LD tunneling
and host to switch to Multi Logical Device - Logical Device tunneling
but that could also be left to userspace.

If MCTP use the existing MCTP framework which is underlying transport independent.
I posted a PoC for how this might work a while ago (hack on top of MCTP-I2C
and some emulation) In the cover letter of the emulation PoC
https://lore.kernel.org/linux-cxl/20220520170128.4436-1-Jonathan.Cameron@huawei.com/

I think everything else belongs in userspace. I believe there are redfish APIs
etc that would then be used to query and drive the userspace program from an
orchestrator or similar level software.

> (2) Which functionality kernel-space needs to provide for implementation FM features?
>       Which kernel-space functionality do we need to implement yet?

Very little needed if we just expose the transport via PCI mailboxes.
There is a possible concern that FM-API commands are frequently
destructive and currently we don't let userspace poke destructive
commands. That may just need a specific opt in to say we know we
can shoot ourselves in the foot.

> (3) Do we need MCTP (Management Component Transport Protocol) or some other
>       protocol can be used for interaction between configuration tool, FM daemon, and
>       CXL switch?

Yes MCTP is needed.
I don't think we want the actual management code to be different
depending on transport / protocol.  However we might layer it so that there
is an interface program that sits between the management library / program and
the FM-API transport.

Note I was struggling to find a suitable MCTP interface to emulate - so would
welcome suggestions on that.  I hacked the above PoC using an aspeed i2c
controller that supported the right magic combination of features needed
for MCTP over I2C but it doesn't have ACPI support which rather limits
usage (and I doubt anyone will be keen on adding ACPI support just to
test CXL related code :)  If anyone knows of a suitable MCTP host we
could use for this that would be great (MCTP over PCI VDM might be nice for
example)

> (4) What architecture FM implementation requires?
> (5) Does it make sense to use Rust as implementation language?

Take your pick ;) First person to write a lot of code gets to pick the language.

> 
> CXL Fabric Manager (FM) is the application logic responsible for system composition and
> allocation of resources. The FM can be embedded in the firmware of a device such as
> a CXL switch, reside on a host, or could be run on a Baseboard Management Controller (BMC).
> CXL Specification 3.0 defines Fabric Management as: "CXL devices can be configured statically
> or dynamically via a Fabric Manager (FM), an external logical process that queries and configures
> the system’s operational state using the FM commands defined in this specification. The FM is
> defined as the logical process that decides when reconfiguration is necessary and initiates
> the commands to perform configurations. It can take any form, including, but not limited to,
> software running on a host machine, embedded software running on a BMC, embedded firmware
> running on another CXL device or CXL switch, or a state machine running within the CXL device
> itself.”. CXL devices are configured by FM through the Fabric Manager Application Programming
> Interface (FM API) command sets through a CCI (Component Command Interface). A CCI is
> exposed through a device’s Mailbox registers or through an MCTP-capable (Management
> Component Transport Protocol) interface.
> 
> FM API Commands (defined by CXL Specification 3.0):
> (1) Physical switch (Identify Switch Device, Get Physical Port State, Physical Port Control,
>       Send PPB (PCI-to-PCI Bridge) CXL.io Configuration Request).
> (2) Virtual Switch (Get Virtual CXL Switch Info, Bind vPPB (Virtual PCI-to-PCI Bridge),
>       Unbind vPPB, Generate AER (Advanced Error Reporting Event).
> (3) MLD Port (Tunnel Management Command, Send LD (Logical Device) or
>      FMLD (Fabric Manager-owned Logical Device) CXL.io Configuration Request,
>      Send LD CXL.io Memory Request).
> (4) MLD Components (Get LD (Logical Device) Info, Get LD Allocations, Set LD Allocations,
>      Get QoS Control, Set QoS Control, Get QoS Status, Get QoS Allocated Bandwidth,
>      Set QoS Allocated Bandwidth, Get QoS Bandwidth Limit, Set QoS Bandwidth Limit).
> (5) Multi-Headed Devices (Get Multi-Headed Info).
> (6) DCD (Dynamic Capacity Device) Management (Get DCD Info, Get Host Dynamic
>      Capacity Region Configuration, Set Dynamic Capacity Region Configuration, Get DCD
>      Extent Lists, Initiate Dynamic Capacity Add, Initiate Dynamic Capacity Release).
> 
> After the initial configuration is complete and a CCI on the switch is operational, an FM can
> send Management Commands to the switch. An FM may perform the following dynamic
> management actions on a CXL switch: (1) Query switch information and configuration details,
> (2) Bind or Unbind ports, (3) Register to receive and handle event notifications from the switch
> (e.g., hot plug, surprise removal, and failures). A switch with MLD (Multi-Logical Device)
> requires an FM to perform the following management activities: (1) MLD discovery,
> (2) LD (Logical Device) binding/unbinding, (3) Management Command Tunneling. The FM can
> connect to an MLD (Multi-Logical Device) over a direct connection or by tunneling its
> management commands through the CCI of the CXL switch to which the device is connected.
> The FM can perform the following operations: (1) Memory allocation and QoS Telemetry
> management, (2) Security (e.g., LD erasure after unbinding), (3) Error handling.
> 
> FM configuration tool requires such commands:

A command line tool is fine, but like the 'real' FM configuration interface will be via
a protocol (e.g. redfish).
https://www.dmtf.org/standards/redfish
There is a WIP for CXL, though not sure on latest status on this (document on there is from
2021)

So ultimately I'd expect fm_cli to be a wrapper around libredfish / redfishtool
http://github.com/DMTF/RedFishTool etc that just makes it a bit easier to poke
with common commands.

I'm far from an expert of redfish so may have this all wrong.

> 
> Discover - discover available agents
> Subcommands:
>     - fm_cli discover fm - discover FM instances

If we are allowing more than one FM then I'd expect all the
other commands to be directed at that by some sort of FM specific
ID. If only one, what does this command do that isn't better
done with fm get_info


>     - fm_cli discover cxl_devices - discover CXL devices
>     - fm_cli discover logical_devices - discover logical devices

Discover switches as well.

> 
> FM - manage Fabric Manager
> Subcommands:
>     - fm_cli fm get_info - get FM status/info
>     - fm_cli fm start - start FM instance
>     - fm_cli fm restart - restart FM instance
>     - fm_cli fm stop - stop FM instance
>     - fm_cli fm get_config - get FM configuration
>     - fm_cli fm set_config - set FM configuration

I'd keep this slim for now.  No idea what FM config we might want to
set so don't bother listing command yet.

>     - fm_cli fm get_events - get event records
Not sure what FM would have in the way of events (as opposed to
things it is talking to).

> 
> Switch - manage CXL switch
> Subcommands:
>     - fm_cli switch get_info - get CXL switch info/status

These all need an ID field of some type to identify which switch.

>     - fm_cli switch get_config - get switch configuraiton
>     - fm_cli switch set_config - set switch configuration
> 
> Logical Device - manage logical devices
> Subcommands:
>     - fm_cli multi_headed_device info - retrieves the number of heads, number of
>            supported LDs, and Head-to- LD mapping of a Multi-Headed device
>     - fm_cli logical_device bind - bind logical device
>     - fm_cli logical_device unbind - unbind logical device
>     - fm_cli logical_device connect - connect Multi Logical Device to CXL switch
>     - fm_cli logical_device disconnect - disconnect Multi Logical Device from CXL switch
>     - fm_cli logical_device get_allocation - Get LD Allocations (retrieves the memory
>            allocations of the MLD)
>     - fm_cli logical_device set_allocation - Set LD Allocations (sets the memory allocation
>            for each LD)
>     - fm_cli logical_device get_qos_control - Get QoS Control (retrieves the MLD’s QoS
>            control parameters)
>     - fm_cli logical_device set_qos_control - Set QoS Control (sets the MLD’s QoS control
>            parameters)
>     - fm_cli logical_device get_qos_status - Get QoS Status (retrieves the MLD’s QoS Status)
>     - fm_cli logical_device get_qos_allocated_bandwidth - Get QoS Allocated Bandwidth
>           (retrieves the MLD’s QoS allocated bandwidth on a per-LD basis)
>     - fm_cli logical_device set_qos_allocated_bandwidth - Set QoS Allocated Bandwidth
>           (sets the MLD’s QoS allocated bandwidth on a per-LD basis)
>     - fm_cli logical_device get_qos_bandwidth_limit - Get QoS Bandwidth Limit (retrieves the
>           MLD’s QoS bandwidth limit on a per-LD basis)
>     - fm_cli logical_device set_qos_bandwidth_limit - Set QoS Bandwidth Limit (sets the
>           MLD’s QoS bandwidth limit on a per-LD basis)
>     - fm_cli logical_device erase - secure erase after unbinding
> 
> PCI-to-PCI Bridge - manage PPB (PCI-to-PCI Bridge)
> Subcommands:
>     - fm_cli ppb config - Send PPB (PCI-to-PCI Bridge) CXL.io Configuration Request

That one may want a more convenient interface as likely a lot of commands would be sent
if aim is to configure a device before binding.  Also CXL.io Memory requests want to be
here I think.

>     - fm_cli ppb bind - Bind vPPB (Virtual PCI-to-PCI Bridge inside a CXL switch that is
>            host-owned)
>     - fm_cli ppb unbind - Unbind vPPB (unbinds the physical port or LD from the virtual
>            hierarchy PPB)
> 
> Physical Port - manage physical ports
> Subcommands:
>     - fm_cli physical_port get_info - get state of physical port
>     - fm_cli physical_port control - control unbound ports and MLD ports, including issuing
>            resets and controlling sidebands
>     - fm_cli physical_port bind - bind physical port to vPPB (Virtual PCI-to-PCI Bridge)
>     - fm_cli physical_port unbind - unbind physical port from vPPB (Virtual PCI-to-PCI Bridge)
> 
> MLD (Multi-Logical Device) Port - manage Multi-Logical Device ports
> Subcommands:
>     - fm_cli mld_port tunnel - Tunnel Management Command (tunnels the provided command
>            to LD FFFFh of the MLD on the specified port)

Make if clear how nesting of commands in a tunnel would be specified.

>     - fm_cli mld_port send_config - Send LD (Logical Device) or FMLD (Fabric
>            Manager-owned Logical Device) CXL.io Configuration Request
>     - fm_cli mld_port send_memory_request - Send LD CXL.io Memory Request
> 
> DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device
> Subcommands:
>     - fm_cli dcd get_info - Get DCD Info (retrieves the number of supported hosts,
>          total Dynamic Capacity of the device, and supported region configurations)
>     - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity Region Configuration
>          (retrieves the Dynamic Capacity configuration for a specified host)
>     - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region Configuration
>          (sets the configuration of a DC Region)
>     - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the Dynamic Capacity
>          Extent List for a specified host)
>     - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add (initiates the addition of
>          Dynamic Capacity to the specified region on a host)

That one is complex ;) Probably needs a whole man page to itself.

>     - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release (initiates the release of
>          Dynamic Capacity from a host)
> 
> FM daemon receives requests from configuration tool and executes commands by means of
> interaction with kernel-space subsystems. The responsibility of FM daemon could be:
>     - Execute configuration tool commands
>     - Manage hot-add and hot-removal of devices

In what sense?  I'd expect it to notify some higher level entity
(orchestrator or similar) but not sure I see what management the
FM would do.  

>     - Manage surprise removal of devices

Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
what to do in the way of managing this.  Scream loudly?

>     - Receive and handle even notifications from the CXL switch
>     - Logging events
>     - Memory allocation and QoS Telemetry management
>     - Error/Failure handling

I'm not sure on separation of role between this component and
higher level policy / admin driven software.

For memory allocation it might take a 'give host A this much
memory with this characteristic set' command and own the
allocations across all present devices, or it might just
act as an interface layer to higher level software that does
the fine detail of figuring out which device to allocate memory
from to satisfy such a request.

Whilst I agree having a broad vision for an interface is good
there are a lot of subtle details in some of these commands
so I'd not spend too long refining the whole lot. Probably better
to look at them one at a time and then just have whoever ends
up maintaining / reviewing this thing responsible for making sure the
parameter format etc is consistent across commands.

Fun fun fun

Jonathan

> 
> Thanks,
> Slava.
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-01-31 17:41 ` Jonathan Cameron
@ 2023-02-01 20:04   ` Viacheslav A.Dubeyko
  2023-02-02  9:54     ` Jonathan Cameron
  0 siblings, 1 reply; 13+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-02-01 20:04 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: lsf-pc, linux-mm, linux-cxl, Dan Williams, Adam Manzanares,
	Cong Wang, Viacheslav Dubeyko

Hi Jonathan,

> On Jan 31, 2023, at 9:41 AM, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> On Mon, 30 Jan 2023 11:11:23 -0800
> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> 
>> Hello,
> 
> Hi Slava,
> 
> I'll throw some opinions at this :)
> 
>> 
>> I would like to suggest Fabric Manager (FM) architecture discussion. As far as I can see,
>> FM architecture requires: (1) FM configuration tool, (2) FM daemon, (3) QEMU emulation
>> of CXL hardware features. FM daemon receives requests from configuration tool and
>> executes commands by means of interaction with kernel-space subsystem and CXL switch
>> (that can be emulated by QEMU). So, the key questions for discussion:
> 
> Worth describing operating modes to be supported: You kind of cover this later
> but I think pulling it out make it clearer that we want one bit of software to
> do several different things.
> 
> 1) FM separate from hosts and talked to by higher level orchestration software
>   but using a Switch CCI or MHD mailbox (over PCI)
>   This one is fairly easy because any security / shooting self in foot problems
>   are an issue for higher level software. 
> 2) FM on host.  Probably mostly going be relevant for debug but may use
>   the same mailbox as is being used by the existing CXL drivers (for Multi
>   Head Device it might be the end point mailbox, for Multi Logical Device
>   behind a switch it might be the switch mailbox).
> 3) All out of band (MCTP or similar - want some shared code, but no
>   need for anything in kernel as far as I can tell).
> 

Most probably, we will have multiple FM implementations in firmware.
Yes, FM on host could be important for debug and to verify correctness
firmware-based implementations. But FM daemon on host could be important
to receive notifications and react somehow on these events. Also, journalling
of events/messages/events could be important responsibility of FM daemon
on host. 

> 
>> (1) How to distribute functionality between user-space and kernel-space?
> 
> Kernel for transport if mailbox based (switch or MHD).
> Possibly help in kernel with the host to Multiheaded device FM LD tunneling
> and host to switch to Multi Logical Device - Logical Device tunneling
> but that could also be left to userspace.
> 

People loves to move everything in user-space now. But I believe we could have
as kernel-space as user-space solutions. I think we ned to check what way could be
more efficient and elegant solution.

> If MCTP use the existing MCTP framework which is underlying transport independent.
> I posted a PoC for how this might work a while ago (hack on top of MCTP-I2C
> and some emulation) In the cover letter of the emulation PoC
> 

Sounds interesting. Let me check it. But I believe it could not be not the first task
in this implementation. :)

> 
> I think everything else belongs in userspace. I believe there are redfish APIs
> etc that would then be used to query and drive the userspace program from an
> orchestrator or similar level software.
> 

I need to check the redfish API. It sounds reasonable to employ some existing
framework.

>> (2) Which functionality kernel-space needs to provide for implementation FM features?
>>      Which kernel-space functionality do we need to implement yet?
> 
> Very little needed if we just expose the transport via PCI mailboxes.
> There is a possible concern that FM-API commands are frequently
> destructive and currently we don't let userspace poke destructive
> commands. That may just need a specific opt in to say we know we
> can shoot ourselves in the foot.
> 

I think this is why we need kernel. It sounds for me that we have to have user-space
and kernel-space collaboration here.

>> (3) Do we need MCTP (Management Component Transport Protocol) or some other
>>      protocol can be used for interaction between configuration tool, FM daemon, and
>>      CXL switch?
> 
> Yes MCTP is needed.
> I don't think we want the actual management code to be different
> depending on transport / protocol.  However we might layer it so that there
> is an interface program that sits between the management library / program and
> the FM-API transport.
> 
> Note I was struggling to find a suitable MCTP interface to emulate - so would
> welcome suggestions on that.  I hacked the above PoC using an aspeed i2c
> controller that supported the right magic combination of features needed
> for MCTP over I2C but it doesn't have ACPI support which rather limits
> usage (and I doubt anyone will be keen on adding ACPI support just to
> test CXL related code :)  If anyone knows of a suitable MCTP host we
> could use for this that would be great (MCTP over PCI VDM might be nice for
> example)
> 

Let us start some command/feature implementation and we will figure it out.
But, I assume we need to start from something like CXL devices discovery at first.

>> (4) What architecture FM implementation requires?
>> (5) Does it make sense to use Rust as implementation language?
> 
> Take your pick ;) First person to write a lot of code gets to pick the language.
> 

Yeah, I see the point. Rust can provide some benefits (memory safety model, for example).
But it could introduce some issue with collaboration and makes implementation more
slow. Everybody develops in C language. But switching on Rust could be not so easy
target.

<skipped>

>> 
>> 
>> FM configuration tool requires such commands:
> 
> A command line tool is fine, but like the 'real' FM configuration interface will be via
> a protocol (e.g. redfish).
> There is a WIP for CXL, though not sure on latest status on this (document on there is from
> 2021)
> 
> So ultimately I'd expect fm_cli to be a wrapper around libredfish / redfishtoo
>  that just makes it a bit easier to poke
> with common commands.
> 
> I'm far from an expert of redfish so may have this all wrong.
> 

Sounds reasonable to me. Let me check how good it could be for this project.

>> 
>> Discover - discover available agents
>> Subcommands:
>>    - fm_cli discover fm - discover FM instances
> 
> If we are allowing more than one FM then I'd expect all the
> other commands to be directed at that by some sort of FM specific
> ID. If only one, what does this command do that isn't better
> done with fm get_info
> 

Yes, we need to identify every object somehow. And it’s interesting point.
From point of view, some human-friendly names could be good.
But firmware-based FM implementation needs to follow the same rules.
And it sounds for me that CXL specification should define how CXL FM or
CXL device identify itself. Anyway, we need to ask CXL device and it should
return to us some ID. Probably, it will be some GUID or likewise number.

> 
>>    - fm_cli discover cxl_devices - discover CXL devices
>>    - fm_cli discover logical_devices - discover logical devices
> 
> Discover switches as well.
> 

I assumed that CXL switch is a subclass of CXL devices. Do you mean that
it is independent case?

>> 
>> FM - manage Fabric Manager
>> Subcommands:
>>    - fm_cli fm get_info - get FM status/info
>>    - fm_cli fm start - start FM instance
>>    - fm_cli fm restart - restart FM instance
>>    - fm_cli fm stop - stop FM instance
>>    - fm_cli fm get_config - get FM configuration
>>    - fm_cli fm set_config - set FM configuration
> 
> I'd keep this slim for now.  No idea what FM config we might want to
> set so don't bother listing command yet.
> 

Yeah, it’s not completely clear yet. But I assume we can consider such
configuration options like:
(1) register to receive event notifications
(2) logging of events
(3) errors handling

>>    - fm_cli fm get_events - get event records
> Not sure what FM would have in the way of events (as opposed to
> things it is talking to).
> 

I think FM can log events. If we consider FM daemon on host, then it
could issue messages to end user as reaction to some events.

>> 
>> Switch - manage CXL switch
>> Subcommands:
>>    - fm_cli switch get_info - get CXL switch info/status
> 
> These all need an ID field of some type to identify which switch.
> 

Yeah, it is exactly what we need for every command. We need to identify
an object for a request.

>>    - fm_cli switch get_config - get switch configuraiton
>>    - fm_cli switch set_config - set switch configuration

<skipped>

>> 
>> DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device
>> Subcommands:
>>    - fm_cli dcd get_info - Get DCD Info (retrieves the number of supported hosts,
>>         total Dynamic Capacity of the device, and supported region configurations)
>>    - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity Region Configuration
>>         (retrieves the Dynamic Capacity configuration for a specified host)
>>    - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region Configuration
>>         (sets the configuration of a DC Region)
>>    - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the Dynamic Capacity
>>         Extent List for a specified host)
>>    - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add (initiates the addition of
>>         Dynamic Capacity to the specified region on a host)
> 
> That one is complex ;) Probably needs a whole man page to itself.
> 

Currently, it’s only declaration of command set. Yeah, implementation will be complex. :)

>>    - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release (initiates the release of
>>         Dynamic Capacity from a host)
>> 
>> FM daemon receives requests from configuration tool and executes commands by means of
>> interaction with kernel-space subsystems. The responsibility of FM daemon could be:
>>    - Execute configuration tool commands
>>    - Manage hot-add and hot-removal of devices
> 
> In what sense?  I'd expect it to notify some higher level entity
> (orchestrator or similar) but not sure I see what management the
> FM would do.  
> 

I assume that if FM manages some metadata, then hot-add or hot-removal could
require some metadata corrections. Also, hot-add and hot-removal can generate some
events that FM can receive and process somehow. For example, it is possible to log
event messages into some journal.

>>    - Manage surprise removal of devices
> 
> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> what to do in the way of managing this.  Scream loudly?
> 

Maybe, it could require application(s) notification. Let’s imagine that application
uses some resources from removed device. Maybe, FM can manage kernel-space
metadata correction and helping to manage application requests to not existing
entities.

>>    - Receive and handle even notifications from the CXL switch
>>    - Logging events
>>    - Memory allocation and QoS Telemetry management
>>    - Error/Failure handling
> 
> I'm not sure on separation of role between this component and
> higher level policy / admin driven software.
> 
> For memory allocation it might take a 'give host A this much
> memory with this characteristic set' command and own the
> allocations across all present devices, or it might just
> act as an interface layer to higher level software that does
> the fine detail of figuring out which device to allocate memory
> from to satisfy such a request.
> 
> Whilst I agree having a broad vision for an interface is good
> there are a lot of subtle details in some of these commands
> so I'd not spend too long refining the whole lot. Probably better
> to look at them one at a time and then just have whoever ends
> up maintaining / reviewing this thing responsible for making sure the
> parameter format etc is consistent across commands.
> 

Yes, I agree. Let’s do it step by step. I believe we need to start from
implementation the application that process commands and do nothing
at first. And first command that needs to be implemented is a discovery
of CXL devices, switches, and FM instances because we need to identify
CXL object somehow for any other command.

Thanks,
Slava.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-01 20:04   ` [External] " Viacheslav A.Dubeyko
@ 2023-02-02  9:54     ` Jonathan Cameron
  2023-02-08 16:38       ` Adam Manzanares
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron @ 2023-02-02  9:54 UTC (permalink / raw)
  To: Viacheslav A.Dubeyko
  Cc: lsf-pc, linux-mm, linux-cxl, Dan Williams, Adam Manzanares,
	Cong Wang, Viacheslav Dubeyko

On Wed, 1 Feb 2023 12:04:56 -0800
"Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:

> Hi Jonathan,
> 
> > On Jan 31, 2023, at 9:41 AM, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > 
> > On Mon, 30 Jan 2023 11:11:23 -0800
> > "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> >   
> >> Hello,  
> > 
> > Hi Slava,
> > 
> > I'll throw some opinions at this :)
> >   
> >> 
> >> I would like to suggest Fabric Manager (FM) architecture discussion. As far as I can see,
> >> FM architecture requires: (1) FM configuration tool, (2) FM daemon, (3) QEMU emulation
> >> of CXL hardware features. FM daemon receives requests from configuration tool and
> >> executes commands by means of interaction with kernel-space subsystem and CXL switch
> >> (that can be emulated by QEMU). So, the key questions for discussion:  
> > 
> > Worth describing operating modes to be supported: You kind of cover this later
> > but I think pulling it out make it clearer that we want one bit of software to
> > do several different things.
> > 
> > 1) FM separate from hosts and talked to by higher level orchestration software
> >   but using a Switch CCI or MHD mailbox (over PCI)
> >   This one is fairly easy because any security / shooting self in foot problems
> >   are an issue for higher level software. 
> > 2) FM on host.  Probably mostly going be relevant for debug but may use
> >   the same mailbox as is being used by the existing CXL drivers (for Multi
> >   Head Device it might be the end point mailbox, for Multi Logical Device
> >   behind a switch it might be the switch mailbox).
> > 3) All out of band (MCTP or similar - want some shared code, but no
> >   need for anything in kernel as far as I can tell).
> >   
> 
> Most probably, we will have multiple FM implementations in firmware.
> Yes, FM on host could be important for debug and to verify correctness
> firmware-based implementations. But FM daemon on host could be important
> to receive notifications and react somehow on these events. Also, journalling
> of events/messages/events could be important responsibility of FM daemon
> on host. 

I agree with an FM daemon somewhere (potentially running on the BMC type chip
that also has the lower level FM-API access).  I think it is somewhat
separate from the rest of this on basis it may well just be talking redfish
to the FM and there are lots of tools for that sort of handling already.

> 
> >   
> >> (1) How to distribute functionality between user-space and kernel-space?  
> > 
> > Kernel for transport if mailbox based (switch or MHD).
> > Possibly help in kernel with the host to Multiheaded device FM LD tunneling
> > and host to switch to Multi Logical Device - Logical Device tunneling
> > but that could also be left to userspace.
> >   
> 
> People loves to move everything in user-space now. But I believe we could have
> as kernel-space as user-space solutions. I think we ned to check what way could be
> more efficient and elegant solution.

Agreed - though I think we need to remember running this on the host that is
using the devices isn't likely to be a common actual usecase.  So we should
design for that to 'work' but not to be the assumed method. Hence if any
sync type activity is needed it might be a case of don't do the wrong thing
rather than hard protections.

> 
> > If MCTP use the existing MCTP framework which is underlying transport independent.
> > I posted a PoC for how this might work a while ago (hack on top of MCTP-I2C
> > and some emulation) In the cover letter of the emulation PoC
> >   
> 
> Sounds interesting. Let me check it. But I believe it could not be not the first task
> in this implementation. :)

Some level of MCTP support needs to be early enough that we don't get
any design decisions wrong.  For MCTP I think the vast majority of handling
has to be in userspace. I don't want to end up with duplication because we did
some of that down in the kernel for the mailbox solution.

> 
> > 
> > I think everything else belongs in userspace. I believe there are redfish APIs
> > etc that would then be used to query and drive the userspace program from an
> > orchestrator or similar level software.
> >   
> 
> I need to check the redfish API. It sounds reasonable to employ some existing
> framework.
> 
> >> (2) Which functionality kernel-space needs to provide for implementation FM features?
> >>      Which kernel-space functionality do we need to implement yet?  
> > 
> > Very little needed if we just expose the transport via PCI mailboxes.
> > There is a possible concern that FM-API commands are frequently
> > destructive and currently we don't let userspace poke destructive
> > commands. That may just need a specific opt in to say we know we
> > can shoot ourselves in the foot.
> >   
> 
> I think this is why we need kernel. It sounds for me that we have to have user-space
> and kernel-space collaboration here.

I think it will be lightweight and looks like the existing CXL mailbox userspace
interface (some commands are the same).

> 
> >> (3) Do we need MCTP (Management Component Transport Protocol) or some other
> >>      protocol can be used for interaction between configuration tool, FM daemon, and
> >>      CXL switch?  
> > 
> > Yes MCTP is needed.
> > I don't think we want the actual management code to be different
> > depending on transport / protocol.  However we might layer it so that there
> > is an interface program that sits between the management library / program and
> > the FM-API transport.
> > 
> > Note I was struggling to find a suitable MCTP interface to emulate - so would
> > welcome suggestions on that.  I hacked the above PoC using an aspeed i2c
> > controller that supported the right magic combination of features needed
> > for MCTP over I2C but it doesn't have ACPI support which rather limits
> > usage (and I doubt anyone will be keen on adding ACPI support just to
> > test CXL related code :)  If anyone knows of a suitable MCTP host we
> > could use for this that would be great (MCTP over PCI VDM might be nice for
> > example)
> >   
> 
> Let us start some command/feature implementation and we will figure it out.
> But, I assume we need to start from something like CXL devices discovery at first.

Sure - some of the kernel side of that was present in the switch-cci mailbox PoC
Obviously tooling was a test hack though ;)

> 
> >> (4) What architecture FM implementation requires?
> >> (5) Does it make sense to use Rust as implementation language?  
> > 
> > Take your pick ;) First person to write a lot of code gets to pick the language.
> >   
> 
> Yeah, I see the point. Rust can provide some benefits (memory safety model, for example).
> But it could introduce some issue with collaboration and makes implementation more
> slow. Everybody develops in C language. But switching on Rust could be not so easy
> target.
> 
> <skipped>
> 
> >> 
> >> 
> >> FM configuration tool requires such commands:  
> > 
> > A command line tool is fine, but like the 'real' FM configuration interface will be via
> > a protocol (e.g. redfish).
> > There is a WIP for CXL, though not sure on latest status on this (document on there is from
> > 2021)
> > 
> > So ultimately I'd expect fm_cli to be a wrapper around libredfish / redfishtoo
> >  that just makes it a bit easier to poke
> > with common commands.
> > 
> > I'm far from an expert of redfish so may have this all wrong.
> >   
> 
> Sounds reasonable to me. Let me check how good it could be for this project.
> 
> >> 
> >> Discover - discover available agents
> >> Subcommands:
> >>    - fm_cli discover fm - discover FM instances  
> > 
> > If we are allowing more than one FM then I'd expect all the
> > other commands to be directed at that by some sort of FM specific
> > ID. If only one, what does this command do that isn't better
> > done with fm get_info
> >   
> 
> Yes, we need to identify every object somehow. And it’s interesting point.
> From point of view, some human-friendly names could be good.
> But firmware-based FM implementation needs to follow the same rules.
> And it sounds for me that CXL specification should define how CXL FM or
> CXL device identify itself. Anyway, we need to ask CXL device and it should
> return to us some ID. Probably, it will be some GUID or likewise number.
> 
> >   
> >>    - fm_cli discover cxl_devices - discover CXL devices
> >>    - fm_cli discover logical_devices - discover logical devices  
> > 
> > Discover switches as well.
> >   
> 
> I assumed that CXL switch is a subclass of CXL devices. Do you mean that
> it is independent case?

Maybe simpler broken out. What you do with a switch is often very different
form type 3 devices.

> 
> >> 
> >> FM - manage Fabric Manager
> >> Subcommands:
> >>    - fm_cli fm get_info - get FM status/info
> >>    - fm_cli fm start - start FM instance
> >>    - fm_cli fm restart - restart FM instance
> >>    - fm_cli fm stop - stop FM instance
> >>    - fm_cli fm get_config - get FM configuration
> >>    - fm_cli fm set_config - set FM configuration  
> > 
> > I'd keep this slim for now.  No idea what FM config we might want to
> > set so don't bother listing command yet.
> >   
> 
> Yeah, it’s not completely clear yet. But I assume we can consider such
> configuration options like:
> (1) register to receive event notifications
> (2) logging of events
> (3) errors handling
> 
> >>    - fm_cli fm get_events - get event records  
> > Not sure what FM would have in the way of events (as opposed to
> > things it is talking to).
> >   
> 
> I think FM can log events. If we consider FM daemon on host, then it
> could issue messages to end user as reaction to some events.
> 
> >> 
> >> Switch - manage CXL switch
> >> Subcommands:
> >>    - fm_cli switch get_info - get CXL switch info/status  
> > 
> > These all need an ID field of some type to identify which switch.
> >   
> 
> Yeah, it is exactly what we need for every command. We need to identify
> an object for a request.
> 
> >>    - fm_cli switch get_config - get switch configuraiton
> >>    - fm_cli switch set_config - set switch configuration  
> 
> <skipped>
> 
> >> 
> >> DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device
> >> Subcommands:
> >>    - fm_cli dcd get_info - Get DCD Info (retrieves the number of supported hosts,
> >>         total Dynamic Capacity of the device, and supported region configurations)
> >>    - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity Region Configuration
> >>         (retrieves the Dynamic Capacity configuration for a specified host)
> >>    - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region Configuration
> >>         (sets the configuration of a DC Region)
> >>    - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the Dynamic Capacity
> >>         Extent List for a specified host)
> >>    - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add (initiates the addition of
> >>         Dynamic Capacity to the specified region on a host)  
> > 
> > That one is complex ;) Probably needs a whole man page to itself.
> >   
> 
> Currently, it’s only declaration of command set. Yeah, implementation will be complex. :)
> 
> >>    - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release (initiates the release of
> >>         Dynamic Capacity from a host)
> >> 
> >> FM daemon receives requests from configuration tool and executes commands by means of
> >> interaction with kernel-space subsystems. The responsibility of FM daemon could be:
> >>    - Execute configuration tool commands
> >>    - Manage hot-add and hot-removal of devices  
> > 
> > In what sense?  I'd expect it to notify some higher level entity
> > (orchestrator or similar) but not sure I see what management the
> > FM would do.  
> >   
> 
> I assume that if FM manages some metadata, then hot-add or hot-removal could
> require some metadata corrections. Also, hot-add and hot-removal can generate some
> events that FM can receive and process somehow. For example, it is possible to log
> event messages into some journal.

Ok. Potentially stuff there - though exactly which layer ends up managing this
stuff isn't obvious to me yet.

> 
> >>    - Manage surprise removal of devices  
> > 
> > Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> > what to do in the way of managing this.  Scream loudly?
> >   
> 
> Maybe, it could require application(s) notification. Let’s imagine that application
> uses some resources from removed device. Maybe, FM can manage kernel-space
> metadata correction and helping to manage application requests to not existing
> entities.

Notifications for the host are likely to come via inband means - so type3 driver
handling rather than related to FM.  As far as the host is concerned this is the
same as case where there is no FM and someone ripped a device out.

There might indeed be meta data to manage, but doubt it will have anything to
do with kernel.

> 
> >>    - Receive and handle even notifications from the CXL switch
> >>    - Logging events
> >>    - Memory allocation and QoS Telemetry management
> >>    - Error/Failure handling  
> > 
> > I'm not sure on separation of role between this component and
> > higher level policy / admin driven software.
> > 
> > For memory allocation it might take a 'give host A this much
> > memory with this characteristic set' command and own the
> > allocations across all present devices, or it might just
> > act as an interface layer to higher level software that does
> > the fine detail of figuring out which device to allocate memory
> > from to satisfy such a request.
> > 
> > Whilst I agree having a broad vision for an interface is good
> > there are a lot of subtle details in some of these commands
> > so I'd not spend too long refining the whole lot. Probably better
> > to look at them one at a time and then just have whoever ends
> > up maintaining / reviewing this thing responsible for making sure the
> > parameter format etc is consistent across commands.
> >   
> 
> Yes, I agree. Let’s do it step by step. I believe we need to start from
> implementation the application that process commands and do nothing
> at first. And first command that needs to be implemented is a discovery
> of CXL devices, switches, and FM instances because we need to identify
> CXL object somehow for any other command.

Agreed discover of devices and capabilities is definitely where to start
+ I think presenting that as a redfish model.

Jonathan

> 
> Thanks,
> Slava.
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-02  9:54     ` Jonathan Cameron
@ 2023-02-08 16:38       ` Adam Manzanares
  2023-02-08 18:03         ` Viacheslav A.Dubeyko
  0 siblings, 1 reply; 13+ messages in thread
From: Adam Manzanares @ 2023-02-08 16:38 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Viacheslav A.Dubeyko, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko

On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:
> On Wed, 1 Feb 2023 12:04:56 -0800
> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> 
> > Hi Jonathan,
> > 
> > > On Jan 31, 2023, at 9:41 AM, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > 
> > > On Mon, 30 Jan 2023 11:11:23 -0800
> > > "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> > >   
> > >> Hello,  
> > > 
> > > Hi Slava,
> > > 
> > > I'll throw some opinions at this :)
> > >   
> > >> 
> > >> I would like to suggest Fabric Manager (FM) architecture discussion. As far as I can see,
> > >> FM architecture requires: (1) FM configuration tool, (2) FM daemon, (3) QEMU emulation
> > >> of CXL hardware features. FM daemon receives requests from configuration tool and
> > >> executes commands by means of interaction with kernel-space subsystem and CXL switch
> > >> (that can be emulated by QEMU). So, the key questions for discussion:  
> > > 
> > > Worth describing operating modes to be supported: You kind of cover this later
> > > but I think pulling it out make it clearer that we want one bit of software to
> > > do several different things.
> > > 
> > > 1) FM separate from hosts and talked to by higher level orchestration software
> > >   but using a Switch CCI or MHD mailbox (over PCI)
> > >   This one is fairly easy because any security / shooting self in foot problems
> > >   are an issue for higher level software. 
> > > 2) FM on host.  Probably mostly going be relevant for debug but may use
> > >   the same mailbox as is being used by the existing CXL drivers (for Multi
> > >   Head Device it might be the end point mailbox, for Multi Logical Device
> > >   behind a switch it might be the switch mailbox).
> > > 3) All out of band (MCTP or similar - want some shared code, but no
> > >   need for anything in kernel as far as I can tell).
> > >   
> > 
> > Most probably, we will have multiple FM implementations in firmware.
> > Yes, FM on host could be important for debug and to verify correctness
> > firmware-based implementations. But FM daemon on host could be important
> > to receive notifications and react somehow on these events. Also, journalling
> > of events/messages/events could be important responsibility of FM daemon
> > on host. 
> 
> I agree with an FM daemon somewhere (potentially running on the BMC type chip
> that also has the lower level FM-API access).  I think it is somewhat
> separate from the rest of this on basis it may well just be talking redfish
> to the FM and there are lots of tools for that sort of handling already.
> 

I would be interested in particpating in a BOF about this topic. I wonder what
happens when we have multiple switches with multiple FMs each on a separate BMC.
In this case, does it make more sense to have an owner of the global FM state 
be a user space application. Is this the job of the orchestrator?

The BMC based FM seems to have scalability issues, but will we hit them in
practice any time soon.

> > 
> > >   
> > >> (1) How to distribute functionality between user-space and kernel-space?  
> > > 
> > > Kernel for transport if mailbox based (switch or MHD).
> > > Possibly help in kernel with the host to Multiheaded device FM LD tunneling
> > > and host to switch to Multi Logical Device - Logical Device tunneling
> > > but that could also be left to userspace.
> > >   
> > 
> > People loves to move everything in user-space now. But I believe we could have
> > as kernel-space as user-space solutions. I think we ned to check what way could be
> > more efficient and elegant solution.
> 
> Agreed - though I think we need to remember running this on the host that is
> using the devices isn't likely to be a common actual usecase.  So we should
> design for that to 'work' but not to be the assumed method. Hence if any
> sync type activity is needed it might be a case of don't do the wrong thing
> rather than hard protections.
> 
> > 
> > > If MCTP use the existing MCTP framework which is underlying transport independent.
> > > I posted a PoC for how this might work a while ago (hack on top of MCTP-I2C
> > > and some emulation) In the cover letter of the emulation PoC
> > >   
> > 
> > Sounds interesting. Let me check it. But I believe it could not be not the first task
> > in this implementation. :)
> 
> Some level of MCTP support needs to be early enough that we don't get
> any design decisions wrong.  For MCTP I think the vast majority of handling
> has to be in userspace. I don't want to end up with duplication because we did
> some of that down in the kernel for the mailbox solution.
> 
> > 
> > > 
> > > I think everything else belongs in userspace. I believe there are redfish APIs
> > > etc that would then be used to query and drive the userspace program from an
> > > orchestrator or similar level software.
> > >   
> > 
> > I need to check the redfish API. It sounds reasonable to employ some existing
> > framework.
> > 
> > >> (2) Which functionality kernel-space needs to provide for implementation FM features?
> > >>      Which kernel-space functionality do we need to implement yet?  
> > > 
> > > Very little needed if we just expose the transport via PCI mailboxes.
> > > There is a possible concern that FM-API commands are frequently
> > > destructive and currently we don't let userspace poke destructive
> > > commands. That may just need a specific opt in to say we know we
> > > can shoot ourselves in the foot.
> > >   
> > 
> > I think this is why we need kernel. It sounds for me that we have to have user-space
> > and kernel-space collaboration here.
> 
> I think it will be lightweight and looks like the existing CXL mailbox userspace
> interface (some commands are the same).
> 
> > 
> > >> (3) Do we need MCTP (Management Component Transport Protocol) or some other
> > >>      protocol can be used for interaction between configuration tool, FM daemon, and
> > >>      CXL switch?  
> > > 
> > > Yes MCTP is needed.
> > > I don't think we want the actual management code to be different
> > > depending on transport / protocol.  However we might layer it so that there
> > > is an interface program that sits between the management library / program and
> > > the FM-API transport.
> > > 
> > > Note I was struggling to find a suitable MCTP interface to emulate - so would
> > > welcome suggestions on that.  I hacked the above PoC using an aspeed i2c
> > > controller that supported the right magic combination of features needed
> > > for MCTP over I2C but it doesn't have ACPI support which rather limits
> > > usage (and I doubt anyone will be keen on adding ACPI support just to
> > > test CXL related code :)  If anyone knows of a suitable MCTP host we
> > > could use for this that would be great (MCTP over PCI VDM might be nice for
> > > example)
> > >   
> > 
> > Let us start some command/feature implementation and we will figure it out.
> > But, I assume we need to start from something like CXL devices discovery at first.
> 
> Sure - some of the kernel side of that was present in the switch-cci mailbox PoC
> Obviously tooling was a test hack though ;)
> 
> > 
> > >> (4) What architecture FM implementation requires?
> > >> (5) Does it make sense to use Rust as implementation language?  
> > > 
> > > Take your pick ;) First person to write a lot of code gets to pick the language.
> > >   
> > 
> > Yeah, I see the point. Rust can provide some benefits (memory safety model, for example).
> > But it could introduce some issue with collaboration and makes implementation more
> > slow. Everybody develops in C language. But switching on Rust could be not so easy
> > target.
> > 
> > <skipped>
> > 
> > >> 
> > >> 
> > >> FM configuration tool requires such commands:  
> > > 
> > > A command line tool is fine, but like the 'real' FM configuration interface will be via
> > > a protocol (e.g. redfish).
> > > There is a WIP for CXL, though not sure on latest status on this (document on there is from
> > > 2021)
> > > 
> > > So ultimately I'd expect fm_cli to be a wrapper around libredfish / redfishtoo
> > >  that just makes it a bit easier to poke
> > > with common commands.
> > > 
> > > I'm far from an expert of redfish so may have this all wrong.
> > >   
> > 
> > Sounds reasonable to me. Let me check how good it could be for this project.
> > 
> > >> 
> > >> Discover - discover available agents
> > >> Subcommands:
> > >>    - fm_cli discover fm - discover FM instances  
> > > 
> > > If we are allowing more than one FM then I'd expect all the
> > > other commands to be directed at that by some sort of FM specific
> > > ID. If only one, what does this command do that isn't better
> > > done with fm get_info
> > >   
> > 
> > Yes, we need to identify every object somehow. And it’s interesting point.
> > From point of view, some human-friendly names could be good.
> > But firmware-based FM implementation needs to follow the same rules.
> > And it sounds for me that CXL specification should define how CXL FM or
> > CXL device identify itself. Anyway, we need to ask CXL device and it should
> > return to us some ID. Probably, it will be some GUID or likewise number.
> > 
> > >   
> > >>    - fm_cli discover cxl_devices - discover CXL devices
> > >>    - fm_cli discover logical_devices - discover logical devices  
> > > 
> > > Discover switches as well.
> > >   
> > 
> > I assumed that CXL switch is a subclass of CXL devices. Do you mean that
> > it is independent case?
> 
> Maybe simpler broken out. What you do with a switch is often very different
> form type 3 devices.
> 
> > 
> > >> 
> > >> FM - manage Fabric Manager
> > >> Subcommands:
> > >>    - fm_cli fm get_info - get FM status/info
> > >>    - fm_cli fm start - start FM instance
> > >>    - fm_cli fm restart - restart FM instance
> > >>    - fm_cli fm stop - stop FM instance
> > >>    - fm_cli fm get_config - get FM configuration
> > >>    - fm_cli fm set_config - set FM configuration  
> > > 
> > > I'd keep this slim for now.  No idea what FM config we might want to
> > > set so don't bother listing command yet.
> > >   
> > 
> > Yeah, it’s not completely clear yet. But I assume we can consider such
> > configuration options like:
> > (1) register to receive event notifications
> > (2) logging of events
> > (3) errors handling
> > 
> > >>    - fm_cli fm get_events - get event records  
> > > Not sure what FM would have in the way of events (as opposed to
> > > things it is talking to).
> > >   
> > 
> > I think FM can log events. If we consider FM daemon on host, then it
> > could issue messages to end user as reaction to some events.
> > 
> > >> 
> > >> Switch - manage CXL switch
> > >> Subcommands:
> > >>    - fm_cli switch get_info - get CXL switch info/status  
> > > 
> > > These all need an ID field of some type to identify which switch.
> > >   
> > 
> > Yeah, it is exactly what we need for every command. We need to identify
> > an object for a request.
> > 
> > >>    - fm_cli switch get_config - get switch configuraiton
> > >>    - fm_cli switch set_config - set switch configuration  
> > 
> > <skipped>
> > 
> > >> 
> > >> DCD (Dynamic Capacity Device) - manage Dynamic Capacity Device
> > >> Subcommands:
> > >>    - fm_cli dcd get_info - Get DCD Info (retrieves the number of supported hosts,
> > >>         total Dynamic Capacity of the device, and supported region configurations)
> > >>    - fm_cli dcd get_capacity_config - Get Host Dynamic Capacity Region Configuration
> > >>         (retrieves the Dynamic Capacity configuration for a specified host)
> > >>    - fm_cli dcd set_capacity_config - Set Dynamic Capacity Region Configuration
> > >>         (sets the configuration of a DC Region)
> > >>    - fm_cli dcd get_extent_list - Get DCD Extent Lists (retrieves the Dynamic Capacity
> > >>         Extent List for a specified host)
> > >>    - fm_cli dcd add_capacity - Initiate Dynamic Capacity Add (initiates the addition of
> > >>         Dynamic Capacity to the specified region on a host)  
> > > 
> > > That one is complex ;) Probably needs a whole man page to itself.
> > >   
> > 
> > Currently, it’s only declaration of command set. Yeah, implementation will be complex. :)
> > 
> > >>    - fm_cli dcd release_capacity - Initiate Dynamic Capacity Release (initiates the release of
> > >>         Dynamic Capacity from a host)
> > >> 
> > >> FM daemon receives requests from configuration tool and executes commands by means of
> > >> interaction with kernel-space subsystems. The responsibility of FM daemon could be:
> > >>    - Execute configuration tool commands
> > >>    - Manage hot-add and hot-removal of devices  
> > > 
> > > In what sense?  I'd expect it to notify some higher level entity
> > > (orchestrator or similar) but not sure I see what management the
> > > FM would do.  
> > >   
> > 
> > I assume that if FM manages some metadata, then hot-add or hot-removal could
> > require some metadata corrections. Also, hot-add and hot-removal can generate some
> > events that FM can receive and process somehow. For example, it is possible to log
> > event messages into some journal.
> 
> Ok. Potentially stuff there - though exactly which layer ends up managing this
> stuff isn't obvious to me yet.
> 
> > 
> > >>    - Manage surprise removal of devices  
> > > 
> > > Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> > > what to do in the way of managing this.  Scream loudly?
> > >   
> > 
> > Maybe, it could require application(s) notification. Let’s imagine that application
> > uses some resources from removed device. Maybe, FM can manage kernel-space
> > metadata correction and helping to manage application requests to not existing
> > entities.
> 
> Notifications for the host are likely to come via inband means - so type3 driver
> handling rather than related to FM.  As far as the host is concerned this is the
> same as case where there is no FM and someone ripped a device out.
> 
> There might indeed be meta data to manage, but doubt it will have anything to
> do with kernel.
> 

I've also had similar thoughts, I think the OS responds to notifications that
are generated in-band after changes to the state of the FM are made through 
OOB means.

I envision the host sends REDFISH requests to a switch BMC that has an FM
implementation. Once the changes are implemented by the FM it would show up
as changes to the PCIe hierarchy on a host, which is capable of responding to
such changes.

> > 
> > >>    - Receive and handle even notifications from the CXL switch
> > >>    - Logging events
> > >>    - Memory allocation and QoS Telemetry management
> > >>    - Error/Failure handling  
> > > 
> > > I'm not sure on separation of role between this component and
> > > higher level policy / admin driven software.
> > > 
> > > For memory allocation it might take a 'give host A this much
> > > memory with this characteristic set' command and own the
> > > allocations across all present devices, or it might just
> > > act as an interface layer to higher level software that does
> > > the fine detail of figuring out which device to allocate memory
> > > from to satisfy such a request.
> > > 
> > > Whilst I agree having a broad vision for an interface is good
> > > there are a lot of subtle details in some of these commands
> > > so I'd not spend too long refining the whole lot. Probably better
> > > to look at them one at a time and then just have whoever ends
> > > up maintaining / reviewing this thing responsible for making sure the
> > > parameter format etc is consistent across commands.
> > >   
> > 
> > Yes, I agree. Let’s do it step by step. I believe we need to start from
> > implementation the application that process commands and do nothing
> > at first. And first command that needs to be implemented is a discovery
> > of CXL devices, switches, and FM instances because we need to identify
> > CXL object somehow for any other command.
> 
> Agreed discover of devices and capabilities is definitely where to start
> + I think presenting that as a redfish model.
> 
> Jonathan
> 
> > 
> > Thanks,
> > Slava.
> > 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-08 16:38       ` Adam Manzanares
@ 2023-02-08 18:03         ` Viacheslav A.Dubeyko
  2023-02-09 11:05           ` Jonathan Cameron
  2023-02-09 22:10           ` Adam Manzanares
  0 siblings, 2 replies; 13+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-02-08 18:03 UTC (permalink / raw)
  To: Adam Manzanares
  Cc: Jonathan Cameron, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko



> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
> 
> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:
>> On Wed, 1 Feb 2023 12:04:56 -0800
>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
>> 
>>>> 

<skipped>

>>> 
>>> Most probably, we will have multiple FM implementations in firmware.
>>> Yes, FM on host could be important for debug and to verify correctness
>>> firmware-based implementations. But FM daemon on host could be important
>>> to receive notifications and react somehow on these events. Also, journalling
>>> of events/messages/events could be important responsibility of FM daemon
>>> on host. 
>> 
>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
>> that also has the lower level FM-API access).  I think it is somewhat
>> separate from the rest of this on basis it may well just be talking redfish
>> to the FM and there are lots of tools for that sort of handling already.
>> 
> 
> I would be interested in particpating in a BOF about this topic. I wonder what
> happens when we have multiple switches with multiple FMs each on a separate BMC.
> In this case, does it make more sense to have an owner of the global FM state 
> be a user space application. Is this the job of the orchestrator?
> 
> The BMC based FM seems to have scalability issues, but will we hit them in
> practice any time soon.

I had discussion recently and it looks like there are interesting points:
(1) If we have multiple CXL switches (especially with complex hierarchy), then it is
very compute-intensive activity. So, potentially, FM on firmware side could be not
capable to digest and executes all responsibilities without potential performance
degradation.
(2) However, if we have FM on host side, then there is security concerns because
FM sees everything and all details of multiple hosts and subsystems.
(3) Technically speaking, there is one potential capability that user-space FM daemon
can run as on host side as on CXL switch side. I mean here that if we implement
user-space FM daemon, then it could be used to execute FM functionality on CXL
switch side (maybe????). :)

<skipped>

>>>>>   - Manage surprise removal of devices  
>>>> 
>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
>>>> what to do in the way of managing this.  Scream loudly?
>>>> 
>>> 
>>> Maybe, it could require application(s) notification. Let’s imagine that application
>>> uses some resources from removed device. Maybe, FM can manage kernel-space
>>> metadata correction and helping to manage application requests to not existing
>>> entities.
>> 
>> Notifications for the host are likely to come via inband means - so type3 driver
>> handling rather than related to FM.  As far as the host is concerned this is the
>> same as case where there is no FM and someone ripped a device out.
>> 
>> There might indeed be meta data to manage, but doubt it will have anything to
>> do with kernel.
>> 
> 
> I've also had similar thoughts, I think the OS responds to notifications that
> are generated in-band after changes to the state of the FM are made through 
> OOB means.
> 
> I envision the host sends REDFISH requests to a switch BMC that has an FM
> implementation. Once the changes are implemented by the FM it would show up
> as changes to the PCIe hierarchy on a host, which is capable of responding to
> such changes.
> 

I think I am not completely follow your point. :) First of all, I assume that if host
sends REDFISH request, then it will be expected the confirmation of request execution.
It means for me that host needs to receive some packet that informs that request
executed successfully or failed. It means that some subsystem or application requested
this change and only after receiving the confirmation requested capabilities can be used.
And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
that some FM subsystem should be on the host side to receive confirmation/notification
and to execute the real changes in PCIe hierarchy. Am missing something here?

Thanks,
Slava.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-08 18:03         ` Viacheslav A.Dubeyko
@ 2023-02-09 11:05           ` Jonathan Cameron
  2023-02-09 22:04             ` Viacheslav A.Dubeyko
  2023-02-09 22:10           ` Adam Manzanares
  1 sibling, 1 reply; 13+ messages in thread
From: Jonathan Cameron @ 2023-02-09 11:05 UTC (permalink / raw)
  To: Viacheslav A.Dubeyko
  Cc: Adam Manzanares, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko

On Wed, 8 Feb 2023 10:03:57 -0800
"Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:

> > On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
> > 
> > On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:  
> >> On Wed, 1 Feb 2023 12:04:56 -0800
> >> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> >>   
> >>>>   
> 
> <skipped>
> 
> >>> 
> >>> Most probably, we will have multiple FM implementations in firmware.
> >>> Yes, FM on host could be important for debug and to verify correctness
> >>> firmware-based implementations. But FM daemon on host could be important
> >>> to receive notifications and react somehow on these events. Also, journalling
> >>> of events/messages/events could be important responsibility of FM daemon
> >>> on host.   
> >> 
> >> I agree with an FM daemon somewhere (potentially running on the BMC type chip
> >> that also has the lower level FM-API access).  I think it is somewhat
> >> separate from the rest of this on basis it may well just be talking redfish
> >> to the FM and there are lots of tools for that sort of handling already.
> >>   
> > 
> > I would be interested in particpating in a BOF about this topic. I wonder what
> > happens when we have multiple switches with multiple FMs each on a separate BMC.
> > In this case, does it make more sense to have an owner of the global FM state 
> > be a user space application. Is this the job of the orchestrator?

This partly comes down to terminology. Ultimately there is an FM that is
responsible for the whole fabric (could be distributed software) and that
in turn will talk to a the various BMCs that then talk to the switches.

Depending on the setup it may not be necessary for any entity to see the
whole fabric.

Interesting point in general though.  I think it boils down to getting
layering in any software correct and that is easier done from outset.

I don't know whether the redfish stuff is flexible enough to cover this, but
if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
and in turn presenting redfish to the orchestrator.

Any of these components might run on separate machines, or in firmware on
some device, or indeed all run on one server that is acting as the FM and
a node in the orchestrator layer.

> > 
> > The BMC based FM seems to have scalability issues, but will we hit them in
> > practice any time soon.  

Who knows ;)  If anyone builds the large scale fabric stuff in CXL 3.0 then
we definitely will in the medium term.

> 
> I had discussion recently and it looks like there are interesting points:
> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
> very compute-intensive activity. So, potentially, FM on firmware side could be not
> capable to digest and executes all responsibilities without potential performance
> degradation.

There is firmware and their is firmware ;)  It's not uncommon for BMCs to be
significant devices in their own right and run Linux or other heavy weight OSes.

> (2) However, if we have FM on host side, then there is security concerns because
> FM sees everything and all details of multiple hosts and subsystems.

Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
at lest some implementations it will be running on a capable Linux machine.
In large fabrics that may be very capable indeed (basically a server dedicated to
this role).

> (3) Technically speaking, there is one potential capability that user-space FM daemon
> can run as on host side as on CXL switch side. I mean here that if we implement
> user-space FM daemon, then it could be used to execute FM functionality on CXL
> switch side (maybe????). :)

Sure, anything could run anywhere.  We should draw up some 'reference' architectures
though to guide discussion down the line.  Mind you I think there are a lot of
steps along the way and starting point should be a simple PoC where all the FM
stuff is in linux userspace (other than comms).  That's easy enough to do.
If I get a quiet week or so I'll hammer out what we need on emulation side to
start playing with this.

Jonathan



> 
> <skipped>
> 
> >>>>>   - Manage surprise removal of devices    
> >>>> 
> >>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> >>>> what to do in the way of managing this.  Scream loudly?
> >>>>   
> >>> 
> >>> Maybe, it could require application(s) notification. Let’s imagine that application
> >>> uses some resources from removed device. Maybe, FM can manage kernel-space
> >>> metadata correction and helping to manage application requests to not existing
> >>> entities.  
> >> 
> >> Notifications for the host are likely to come via inband means - so type3 driver
> >> handling rather than related to FM.  As far as the host is concerned this is the
> >> same as case where there is no FM and someone ripped a device out.
> >> 
> >> There might indeed be meta data to manage, but doubt it will have anything to
> >> do with kernel.
> >>   
> > 
> > I've also had similar thoughts, I think the OS responds to notifications that
> > are generated in-band after changes to the state of the FM are made through 
> > OOB means.
> > 
> > I envision the host sends REDFISH requests to a switch BMC that has an FM
> > implementation. Once the changes are implemented by the FM it would show up
> > as changes to the PCIe hierarchy on a host, which is capable of responding to
> > such changes.
> >   
> 
> I think I am not completely follow your point. :) First of all, I assume that if host
> sends REDFISH request, then it will be expected the confirmation of request execution.
> It means for me that host needs to receive some packet that informs that request
> executed successfully or failed. It means that some subsystem or application requested
> this change and only after receiving the confirmation requested capabilities can be used.
> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
> that some FM subsystem should be on the host side to receive confirmation/notification
> and to execute the real changes in PCIe hierarchy. Am missing something here?

Another terminology issue I think.  FM from CXL side of things is an abstract thing
(potentially highly layered / distributed) that acts on instructions from an
orchestrator (also potentially highly distributed, one implementation is hosts
can be the orchestrator) and configures the fabric.
The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
Upstream probably all Redfish.  What happens in between is impdef (though
obviously mapping to Redfish or FM-API as applicable may make it more
reuseable and flexible).

I think some diagrams of what is where will help.
I think we need (note I've always kept the controller hosts as normal hosts as well
as that includes the case where it never uses the Fabric - so BMC type cases as
a subset without needing to double the number of diagrams).

1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
   to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
   mctp over say i2c.

2) Diagram of same as above, with a multiple head device all connected to one host.

3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
   one of which is responsible for  fabric management.   FM in that manager host
   and orchestrator) - agents on other hosts able to send requests for services to that host.

4) Diagram of 3, but now with multiple switches, each with separate controlling host.
   Some other hosts that don't have any fabric control.
   Distributed FM across the controlling hosts.

5) Diagram of 4 but with layered FM and separate Orchestrator.  Hosts all talk to the
   orchestrator, that then talks to the FM.

6) 4, but push some management entities down into switches (from architecture point of
   view this is no different from layered case with a separate BMC per switch - there
   is still either a distribute FM or a layered FM, which the orchestrator talks to.)

Can mess with exactly distribution of who does what across the various layers.

I can sketch this lot up (and that will probably make some gaps in these cases apparent)
but will take a little while, hence text descriptions in the meantime.

I come back to my personal view though - which is don't worry too much at this early
stage, beyond making sure we have some layering in code so that we can distribute
it across a distributed or layered architecture later!   

Jonathan



> 
> Thanks,
> Slava.
> 
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-09 11:05           ` Jonathan Cameron
@ 2023-02-09 22:04             ` Viacheslav A.Dubeyko
  2023-02-10 12:32               ` Jonathan Cameron
  0 siblings, 1 reply; 13+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-02-09 22:04 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Adam Manzanares, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko



> On Feb 9, 2023, at 3:05 AM, Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
> On Wed, 8 Feb 2023 10:03:57 -0800
> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> 
>>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
>>> 
>>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:  
>>>> On Wed, 1 Feb 2023 12:04:56 -0800
>>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
>>>> 
>>>>>> 
>> 
>> <skipped>
>> 
>>>>> 
>>>>> Most probably, we will have multiple FM implementations in firmware.
>>>>> Yes, FM on host could be important for debug and to verify correctness
>>>>> firmware-based implementations. But FM daemon on host could be important
>>>>> to receive notifications and react somehow on these events. Also, journalling
>>>>> of events/messages/events could be important responsibility of FM daemon
>>>>> on host.   
>>>> 
>>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
>>>> that also has the lower level FM-API access).  I think it is somewhat
>>>> separate from the rest of this on basis it may well just be talking redfish
>>>> to the FM and there are lots of tools for that sort of handling already.
>>>> 
>>> 
>>> I would be interested in particpating in a BOF about this topic. I wonder what
>>> happens when we have multiple switches with multiple FMs each on a separate BMC.
>>> In this case, does it make more sense to have an owner of the global FM state 
>>> be a user space application. Is this the job of the orchestrator?
> 
> This partly comes down to terminology. Ultimately there is an FM that is
> responsible for the whole fabric (could be distributed software) and that
> in turn will talk to a the various BMCs that then talk to the switches.
> 
> Depending on the setup it may not be necessary for any entity to see the
> whole fabric.
> 
> Interesting point in general though.  I think it boils down to getting
> layering in any software correct and that is easier done from outset.
> 
> I don't know whether the redfish stuff is flexible enough to cover this, but
> if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
> and in turn presenting redfish to the orchestrator.
> 
> Any of these components might run on separate machines, or in firmware on
> some device, or indeed all run on one server that is acting as the FM and
> a node in the orchestrator layer.
> 
>>> 
>>> The BMC based FM seems to have scalability issues, but will we hit them in
>>> practice any time soon.  
> 
> Who knows ;)  If anyone builds the large scale fabric stuff in CXL 3.0 then
> we definitely will in the medium term.
> 
>> 
>> I had discussion recently and it looks like there are interesting points:
>> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
>> very compute-intensive activity. So, potentially, FM on firmware side could be not
>> capable to digest and executes all responsibilities without potential performance
>> degradation.
> 
> There is firmware and their is firmware ;)  It's not uncommon for BMCs to be
> significant devices in their own right and run Linux or other heavy weight OSes.
> 
>> (2) However, if we have FM on host side, then there is security concerns because
>> FM sees everything and all details of multiple hosts and subsystems.
> 
> Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
> at lest some implementations it will be running on a capable Linux machine.
> In large fabrics that may be very capable indeed (basically a server dedicated to
> this role).
> 
>> (3) Technically speaking, there is one potential capability that user-space FM daemon
>> can run as on host side as on CXL switch side. I mean here that if we implement
>> user-space FM daemon, then it could be used to execute FM functionality on CXL
>> switch side (maybe????). :)
> 
> Sure, anything could run anywhere.  We should draw up some 'reference' architectures
> though to guide discussion down the line.  Mind you I think there are a lot of
> steps along the way and starting point should be a simple PoC where all the FM
> stuff is in linux userspace (other than comms).  That's easy enough to do.
> If I get a quiet week or so I'll hammer out what we need on emulation side to
> start playing with this.
> 
> Jonathan
> 
> 
> 
>> 
>> <skipped>
>> 
>>>>>>>  - Manage surprise removal of devices    
>>>>>> 
>>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
>>>>>> what to do in the way of managing this.  Scream loudly?
>>>>>> 
>>>>> 
>>>>> Maybe, it could require application(s) notification. Let’s imagine that application
>>>>> uses some resources from removed device. Maybe, FM can manage kernel-space
>>>>> metadata correction and helping to manage application requests to not existing
>>>>> entities.  
>>>> 
>>>> Notifications for the host are likely to come via inband means - so type3 driver
>>>> handling rather than related to FM.  As far as the host is concerned this is the
>>>> same as case where there is no FM and someone ripped a device out.
>>>> 
>>>> There might indeed be meta data to manage, but doubt it will have anything to
>>>> do with kernel.
>>>> 
>>> 
>>> I've also had similar thoughts, I think the OS responds to notifications that
>>> are generated in-band after changes to the state of the FM are made through 
>>> OOB means.
>>> 
>>> I envision the host sends REDFISH requests to a switch BMC that has an FM
>>> implementation. Once the changes are implemented by the FM it would show up
>>> as changes to the PCIe hierarchy on a host, which is capable of responding to
>>> such changes.
>>> 
>> 
>> I think I am not completely follow your point. :) First of all, I assume that if host
>> sends REDFISH request, then it will be expected the confirmation of request execution.
>> It means for me that host needs to receive some packet that informs that request
>> executed successfully or failed. It means that some subsystem or application requested
>> this change and only after receiving the confirmation requested capabilities can be used.
>> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
>> that some FM subsystem should be on the host side to receive confirmation/notification
>> and to execute the real changes in PCIe hierarchy. Am missing something here?
> 
> Another terminology issue I think.  FM from CXL side of things is an abstract thing
> (potentially highly layered / distributed) that acts on instructions from an
> orchestrator (also potentially highly distributed, one implementation is hosts
> can be the orchestrator) and configures the fabric.
> The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
> Upstream probably all Redfish.  What happens in between is impdef (though
> obviously mapping to Redfish or FM-API as applicable may make it more
> reuseable and flexible).
> 
> I think some diagrams of what is where will help.
> I think we need (note I've always kept the controller hosts as normal hosts as well
> as that includes the case where it never uses the Fabric - so BMC type cases as
> a subset without needing to double the number of diagrams).
> 
> 1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
>   to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
>   mctp over say i2c.
> 
> 2) Diagram of same as above, with a multiple head device all connected to one host.
> 
> 3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
>   one of which is responsible for  fabric management.   FM in that manager host
>   and orchestrator) - agents on other hosts able to send requests for services to that host.
> 
> 4) Diagram of 3, but now with multiple switches, each with separate controlling host.
>   Some other hosts that don't have any fabric control.
>   Distributed FM across the controlling hosts.
> 
> 5) Diagram of 4 but with layered FM and separate Orchestrator.  Hosts all talk to the
>   orchestrator, that then talks to the FM.
> 
> 6) 4, but push some management entities down into switches (from architecture point of
>   view this is no different from layered case with a separate BMC per switch - there
>   is still either a distribute FM or a layered FM, which the orchestrator talks to.)
> 
> Can mess with exactly distribution of who does what across the various layers.
> 
> I can sketch this lot up (and that will probably make some gaps in these cases apparent)
> but will take a little while, hence text descriptions in the meantime.
> 
> I come back to my personal view though - which is don't worry too much at this early
> stage, beyond making sure we have some layering in code so that we can distribute
> it across a distributed or layered architecture later!   
> 

I had slightly more simplified image in my mind. :) We definitely need to have diagrams
to clarify the vision. But which collaboration tool could we use to work publicly on diagrams?
Any suggestion?

Thanks,
Slava.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-08 18:03         ` Viacheslav A.Dubeyko
  2023-02-09 11:05           ` Jonathan Cameron
@ 2023-02-09 22:10           ` Adam Manzanares
  2023-02-09 22:22             ` Viacheslav A.Dubeyko
  1 sibling, 1 reply; 13+ messages in thread
From: Adam Manzanares @ 2023-02-09 22:10 UTC (permalink / raw)
  To: Viacheslav A.Dubeyko
  Cc: Jonathan Cameron, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko

On Wed, Feb 08, 2023 at 10:03:57AM -0800, Viacheslav A.Dubeyko wrote:
> 
> 
> > On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
> > 
> > On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:
> >> On Wed, 1 Feb 2023 12:04:56 -0800
> >> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> >> 
> >>>> 
> 
> <skipped>
> 
> >>> 
> >>> Most probably, we will have multiple FM implementations in firmware.
> >>> Yes, FM on host could be important for debug and to verify correctness
> >>> firmware-based implementations. But FM daemon on host could be important
> >>> to receive notifications and react somehow on these events. Also, journalling
> >>> of events/messages/events could be important responsibility of FM daemon
> >>> on host. 
> >> 
> >> I agree with an FM daemon somewhere (potentially running on the BMC type chip
> >> that also has the lower level FM-API access).  I think it is somewhat
> >> separate from the rest of this on basis it may well just be talking redfish
> >> to the FM and there are lots of tools for that sort of handling already.
> >> 
> > 
> > I would be interested in particpating in a BOF about this topic. I wonder what
> > happens when we have multiple switches with multiple FMs each on a separate BMC.
> > In this case, does it make more sense to have an owner of the global FM state 
> > be a user space application. Is this the job of the orchestrator?
> > 
> > The BMC based FM seems to have scalability issues, but will we hit them in
> > practice any time soon.
> 
> I had discussion recently and it looks like there are interesting points:
> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
> very compute-intensive activity. So, potentially, FM on firmware side could be not
> capable to digest and executes all responsibilities without potential performance
> degradation.
> (2) However, if we have FM on host side, then there is security concerns because
> FM sees everything and all details of multiple hosts and subsystems.
> (3) Technically speaking, there is one potential capability that user-space FM daemon
> can run as on host side as on CXL switch side. I mean here that if we implement
> user-space FM daemon, then it could be used to execute FM functionality on CXL
> switch side (maybe????). :)
> 
> <skipped>
> 
> >>>>>   - Manage surprise removal of devices  
> >>>> 
> >>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> >>>> what to do in the way of managing this.  Scream loudly?
> >>>> 
> >>> 
> >>> Maybe, it could require application(s) notification. Let’s imagine that application
> >>> uses some resources from removed device. Maybe, FM can manage kernel-space
> >>> metadata correction and helping to manage application requests to not existing
> >>> entities.
> >> 
> >> Notifications for the host are likely to come via inband means - so type3 driver
> >> handling rather than related to FM.  As far as the host is concerned this is the
> >> same as case where there is no FM and someone ripped a device out.
> >> 
> >> There might indeed be meta data to manage, but doubt it will have anything to
> >> do with kernel.
> >> 
> > 
> > I've also had similar thoughts, I think the OS responds to notifications that
> > are generated in-band after changes to the state of the FM are made through 
> > OOB means.
> > 
> > I envision the host sends REDFISH requests to a switch BMC that has an FM
> > implementation. Once the changes are implemented by the FM it would show up
> > as changes to the PCIe hierarchy on a host, which is capable of responding to
> > such changes.
> > 
> 
> I think I am not completely follow your point. :) First of all, I assume that if host
> sends REDFISH request, then it will be expected the confirmation of request execution.
> It means for me that host needs to receive some packet that informs that request
> executed successfully or failed. It means that some subsystem or application requested
> this change and only after receiving the confirmation requested capabilities can be used.
> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
> that some FM subsystem should be on the host side to receive confirmation/notification
> and to execute the real changes in PCIe hierarchy. Am missing something here?

Hopefully I have a point ;). I do expect a host to receive a response  for a
given REDFISH request, but the request/response would be OOB. I would go back
to the example of hot plugging in a PCIe based devices. For example if an nvme
SSD is hot plugged, then the OS notified by HW that a new PCIe device has been
added. Going back to changes made by the FM, if the changes impact the CXL
hiearchy that is visible to a host, it is my expectation that the host OS will
be informed of the changes requested of the FM when the host HW becomes aware
of the changes (the in-band change).

> 
> Thanks,
> Slava.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-09 22:10           ` Adam Manzanares
@ 2023-02-09 22:22             ` Viacheslav A.Dubeyko
  0 siblings, 0 replies; 13+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-02-09 22:22 UTC (permalink / raw)
  To: Adam Manzanares
  Cc: Jonathan Cameron, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko



> On Feb 9, 2023, at 2:10 PM, Adam Manzanares <a.manzanares@samsung.com> wrote:
> 
> On Wed, Feb 08, 2023 at 10:03:57AM -0800, Viacheslav A.Dubeyko wrote:
>> 
>> 
>>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
>>> 
>>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:
>>>> On Wed, 1 Feb 2023 12:04:56 -0800
>>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
>>>> 
>>>>>> 
>> 
>> <skipped>
>> 
>>>>> 
>>>>> Most probably, we will have multiple FM implementations in firmware.
>>>>> Yes, FM on host could be important for debug and to verify correctness
>>>>> firmware-based implementations. But FM daemon on host could be important
>>>>> to receive notifications and react somehow on these events. Also, journalling
>>>>> of events/messages/events could be important responsibility of FM daemon
>>>>> on host. 
>>>> 
>>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
>>>> that also has the lower level FM-API access).  I think it is somewhat
>>>> separate from the rest of this on basis it may well just be talking redfish
>>>> to the FM and there are lots of tools for that sort of handling already.
>>>> 
>>> 
>>> I would be interested in particpating in a BOF about this topic. I wonder what
>>> happens when we have multiple switches with multiple FMs each on a separate BMC.
>>> In this case, does it make more sense to have an owner of the global FM state 
>>> be a user space application. Is this the job of the orchestrator?
>>> 
>>> The BMC based FM seems to have scalability issues, but will we hit them in
>>> practice any time soon.
>> 
>> I had discussion recently and it looks like there are interesting points:
>> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
>> very compute-intensive activity. So, potentially, FM on firmware side could be not
>> capable to digest and executes all responsibilities without potential performance
>> degradation.
>> (2) However, if we have FM on host side, then there is security concerns because
>> FM sees everything and all details of multiple hosts and subsystems.
>> (3) Technically speaking, there is one potential capability that user-space FM daemon
>> can run as on host side as on CXL switch side. I mean here that if we implement
>> user-space FM daemon, then it could be used to execute FM functionality on CXL
>> switch side (maybe????). :)
>> 
>> <skipped>
>> 
>>>>>>>  - Manage surprise removal of devices  
>>>>>> 
>>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
>>>>>> what to do in the way of managing this.  Scream loudly?
>>>>>> 
>>>>> 
>>>>> Maybe, it could require application(s) notification. Let’s imagine that application
>>>>> uses some resources from removed device. Maybe, FM can manage kernel-space
>>>>> metadata correction and helping to manage application requests to not existing
>>>>> entities.
>>>> 
>>>> Notifications for the host are likely to come via inband means - so type3 driver
>>>> handling rather than related to FM.  As far as the host is concerned this is the
>>>> same as case where there is no FM and someone ripped a device out.
>>>> 
>>>> There might indeed be meta data to manage, but doubt it will have anything to
>>>> do with kernel.
>>>> 
>>> 
>>> I've also had similar thoughts, I think the OS responds to notifications that
>>> are generated in-band after changes to the state of the FM are made through 
>>> OOB means.
>>> 
>>> I envision the host sends REDFISH requests to a switch BMC that has an FM
>>> implementation. Once the changes are implemented by the FM it would show up
>>> as changes to the PCIe hierarchy on a host, which is capable of responding to
>>> such changes.
>>> 
>> 
>> I think I am not completely follow your point. :) First of all, I assume that if host
>> sends REDFISH request, then it will be expected the confirmation of request execution.
>> It means for me that host needs to receive some packet that informs that request
>> executed successfully or failed. It means that some subsystem or application requested
>> this change and only after receiving the confirmation requested capabilities can be used.
>> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
>> that some FM subsystem should be on the host side to receive confirmation/notification
>> and to execute the real changes in PCIe hierarchy. Am missing something here?
> 
> Hopefully I have a point ;). I do expect a host to receive a response  for a
> given REDFISH request, but the request/response would be OOB. I would go back
> to the example of hot plugging in a PCIe based devices. For example if an nvme
> SSD is hot plugged, then the OS notified by HW that a new PCIe device has been
> added. Going back to changes made by the FM, if the changes impact the CXL
> hiearchy that is visible to a host, it is my expectation that the host OS will
> be informed of the changes requested of the FM when the host HW becomes aware
> of the changes (the in-band change).
> 

You are right if we talk about hardware directly connected to the host. It means that
CPU (or any other hardware subsystem) can receive interrupt and kernel can process
this hardware change. But FM can be remote and be shared by multiple hosts.
In such case, we need to have some software subsystem on host(s) side that can
execute polling or expects to receive network packet with notification or confirmation
of the change. Or we need to have some hardware subsystem on every host that
can interact with remote FM in the background and issues the interrupt locally
with the goal to refresh kernel metadata.

Thanks,
Slava.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-09 22:04             ` Viacheslav A.Dubeyko
@ 2023-02-10 12:32               ` Jonathan Cameron
  2023-02-17 18:31                 ` Viacheslav A.Dubeyko
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron @ 2023-02-10 12:32 UTC (permalink / raw)
  To: Viacheslav A.Dubeyko
  Cc: Adam Manzanares, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko

On Thu, 9 Feb 2023 14:04:13 -0800
"Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:

> > On Feb 9, 2023, at 3:05 AM, Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > 
> > On Wed, 8 Feb 2023 10:03:57 -0800
> > "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> >   
> >>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
> >>> 
> >>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:    
> >>>> On Wed, 1 Feb 2023 12:04:56 -0800
> >>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> >>>>   
> >>>>>>   
> >> 
> >> <skipped>
> >>   
> >>>>> 
> >>>>> Most probably, we will have multiple FM implementations in firmware.
> >>>>> Yes, FM on host could be important for debug and to verify correctness
> >>>>> firmware-based implementations. But FM daemon on host could be important
> >>>>> to receive notifications and react somehow on these events. Also, journalling
> >>>>> of events/messages/events could be important responsibility of FM daemon
> >>>>> on host.     
> >>>> 
> >>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
> >>>> that also has the lower level FM-API access).  I think it is somewhat
> >>>> separate from the rest of this on basis it may well just be talking redfish
> >>>> to the FM and there are lots of tools for that sort of handling already.
> >>>>   
> >>> 
> >>> I would be interested in particpating in a BOF about this topic. I wonder what
> >>> happens when we have multiple switches with multiple FMs each on a separate BMC.
> >>> In this case, does it make more sense to have an owner of the global FM state 
> >>> be a user space application. Is this the job of the orchestrator?  
> > 
> > This partly comes down to terminology. Ultimately there is an FM that is
> > responsible for the whole fabric (could be distributed software) and that
> > in turn will talk to a the various BMCs that then talk to the switches.
> > 
> > Depending on the setup it may not be necessary for any entity to see the
> > whole fabric.
> > 
> > Interesting point in general though.  I think it boils down to getting
> > layering in any software correct and that is easier done from outset.
> > 
> > I don't know whether the redfish stuff is flexible enough to cover this, but
> > if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
> > and in turn presenting redfish to the orchestrator.
> > 
> > Any of these components might run on separate machines, or in firmware on
> > some device, or indeed all run on one server that is acting as the FM and
> > a node in the orchestrator layer.
> >   
> >>> 
> >>> The BMC based FM seems to have scalability issues, but will we hit them in
> >>> practice any time soon.    
> > 
> > Who knows ;)  If anyone builds the large scale fabric stuff in CXL 3.0 then
> > we definitely will in the medium term.
> >   
> >> 
> >> I had discussion recently and it looks like there are interesting points:
> >> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
> >> very compute-intensive activity. So, potentially, FM on firmware side could be not
> >> capable to digest and executes all responsibilities without potential performance
> >> degradation.  
> > 
> > There is firmware and their is firmware ;)  It's not uncommon for BMCs to be
> > significant devices in their own right and run Linux or other heavy weight OSes.
> >   
> >> (2) However, if we have FM on host side, then there is security concerns because
> >> FM sees everything and all details of multiple hosts and subsystems.  
> > 
> > Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
> > at lest some implementations it will be running on a capable Linux machine.
> > In large fabrics that may be very capable indeed (basically a server dedicated to
> > this role).
> >   
> >> (3) Technically speaking, there is one potential capability that user-space FM daemon
> >> can run as on host side as on CXL switch side. I mean here that if we implement
> >> user-space FM daemon, then it could be used to execute FM functionality on CXL
> >> switch side (maybe????). :)  
> > 
> > Sure, anything could run anywhere.  We should draw up some 'reference' architectures
> > though to guide discussion down the line.  Mind you I think there are a lot of
> > steps along the way and starting point should be a simple PoC where all the FM
> > stuff is in linux userspace (other than comms).  That's easy enough to do.
> > If I get a quiet week or so I'll hammer out what we need on emulation side to
> > start playing with this.
> > 
> > Jonathan
> > 
> > 
> >   
> >> 
> >> <skipped>
> >>   
> >>>>>>>  - Manage surprise removal of devices      
> >>>>>> 
> >>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> >>>>>> what to do in the way of managing this.  Scream loudly?
> >>>>>>   
> >>>>> 
> >>>>> Maybe, it could require application(s) notification. Let’s imagine that application
> >>>>> uses some resources from removed device. Maybe, FM can manage kernel-space
> >>>>> metadata correction and helping to manage application requests to not existing
> >>>>> entities.    
> >>>> 
> >>>> Notifications for the host are likely to come via inband means - so type3 driver
> >>>> handling rather than related to FM.  As far as the host is concerned this is the
> >>>> same as case where there is no FM and someone ripped a device out.
> >>>> 
> >>>> There might indeed be meta data to manage, but doubt it will have anything to
> >>>> do with kernel.
> >>>>   
> >>> 
> >>> I've also had similar thoughts, I think the OS responds to notifications that
> >>> are generated in-band after changes to the state of the FM are made through 
> >>> OOB means.
> >>> 
> >>> I envision the host sends REDFISH requests to a switch BMC that has an FM
> >>> implementation. Once the changes are implemented by the FM it would show up
> >>> as changes to the PCIe hierarchy on a host, which is capable of responding to
> >>> such changes.
> >>>   
> >> 
> >> I think I am not completely follow your point. :) First of all, I assume that if host
> >> sends REDFISH request, then it will be expected the confirmation of request execution.
> >> It means for me that host needs to receive some packet that informs that request
> >> executed successfully or failed. It means that some subsystem or application requested
> >> this change and only after receiving the confirmation requested capabilities can be used.
> >> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
> >> that some FM subsystem should be on the host side to receive confirmation/notification
> >> and to execute the real changes in PCIe hierarchy. Am missing something here?  
> > 
> > Another terminology issue I think.  FM from CXL side of things is an abstract thing
> > (potentially highly layered / distributed) that acts on instructions from an
> > orchestrator (also potentially highly distributed, one implementation is hosts
> > can be the orchestrator) and configures the fabric.
> > The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
> > Upstream probably all Redfish.  What happens in between is impdef (though
> > obviously mapping to Redfish or FM-API as applicable may make it more
> > reuseable and flexible).
> > 
> > I think some diagrams of what is where will help.
> > I think we need (note I've always kept the controller hosts as normal hosts as well
> > as that includes the case where it never uses the Fabric - so BMC type cases as
> > a subset without needing to double the number of diagrams).
> > 
> > 1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
> >   to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
> >   mctp over say i2c.
> > 
> > 2) Diagram of same as above, with a multiple head device all connected to one host.
> > 
> > 3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
> >   one of which is responsible for  fabric management.   FM in that manager host
> >   and orchestrator) - agents on other hosts able to send requests for services to that host.
> > 
> > 4) Diagram of 3, but now with multiple switches, each with separate controlling host.
> >   Some other hosts that don't have any fabric control.
> >   Distributed FM across the controlling hosts.
> > 
> > 5) Diagram of 4 but with layered FM and separate Orchestrator.  Hosts all talk to the
> >   orchestrator, that then talks to the FM.
> > 
> > 6) 4, but push some management entities down into switches (from architecture point of
> >   view this is no different from layered case with a separate BMC per switch - there
> >   is still either a distribute FM or a layered FM, which the orchestrator talks to.)
> > 
> > Can mess with exactly distribution of who does what across the various layers.
> > 
> > I can sketch this lot up (and that will probably make some gaps in these cases apparent)
> > but will take a little while, hence text descriptions in the meantime.
> > 
> > I come back to my personal view though - which is don't worry too much at this early
> > stage, beyond making sure we have some layering in code so that we can distribute
> > it across a distributed or layered architecture later!   
> >   
> 
> I had slightly more simplified image in my mind. :) We definitely need to have diagrams
> to clarify the vision. But which collaboration tool could we use to work publicly on diagrams?
> Any suggestion?

Ascii art :)  To have a broad discussion it needs to be mailing list and that
is effectively only option.

> 
> Thanks,
> Slava.
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-10 12:32               ` Jonathan Cameron
@ 2023-02-17 18:31                 ` Viacheslav A.Dubeyko
  2023-02-20 11:59                   ` Jonathan Cameron
  0 siblings, 1 reply; 13+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-02-17 18:31 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Adam Manzanares, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko



> On Feb 10, 2023, at 4:32 AM, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> On Thu, 9 Feb 2023 14:04:13 -0800
> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> 
>>> On Feb 9, 2023, at 3:05 AM, Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
>>> 
>>> On Wed, 8 Feb 2023 10:03:57 -0800
>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
>>> 
>>>>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
>>>>> 
>>>>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:    
>>>>>> On Wed, 1 Feb 2023 12:04:56 -0800
>>>>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
>>>>>> 
>>>>>>>> 
>>>> 
>>>> <skipped>
>>>> 
>>>>>>> 
>>>>>>> Most probably, we will have multiple FM implementations in firmware.
>>>>>>> Yes, FM on host could be important for debug and to verify correctness
>>>>>>> firmware-based implementations. But FM daemon on host could be important
>>>>>>> to receive notifications and react somehow on these events. Also, journalling
>>>>>>> of events/messages/events could be important responsibility of FM daemon
>>>>>>> on host.     
>>>>>> 
>>>>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
>>>>>> that also has the lower level FM-API access).  I think it is somewhat
>>>>>> separate from the rest of this on basis it may well just be talking redfish
>>>>>> to the FM and there are lots of tools for that sort of handling already.
>>>>>> 
>>>>> 
>>>>> I would be interested in particpating in a BOF about this topic. I wonder what
>>>>> happens when we have multiple switches with multiple FMs each on a separate BMC.
>>>>> In this case, does it make more sense to have an owner of the global FM state 
>>>>> be a user space application. Is this the job of the orchestrator?  
>>> 
>>> This partly comes down to terminology. Ultimately there is an FM that is
>>> responsible for the whole fabric (could be distributed software) and that
>>> in turn will talk to a the various BMCs that then talk to the switches.
>>> 
>>> Depending on the setup it may not be necessary for any entity to see the
>>> whole fabric.
>>> 
>>> Interesting point in general though.  I think it boils down to getting
>>> layering in any software correct and that is easier done from outset.
>>> 
>>> I don't know whether the redfish stuff is flexible enough to cover this, but
>>> if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
>>> and in turn presenting redfish to the orchestrator.
>>> 
>>> Any of these components might run on separate machines, or in firmware on
>>> some device, or indeed all run on one server that is acting as the FM and
>>> a node in the orchestrator layer.
>>> 
>>>>> 
>>>>> The BMC based FM seems to have scalability issues, but will we hit them in
>>>>> practice any time soon.    
>>> 
>>> Who knows ;)  If anyone builds the large scale fabric stuff in CXL 3.0 then
>>> we definitely will in the medium term.
>>> 
>>>> 
>>>> I had discussion recently and it looks like there are interesting points:
>>>> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
>>>> very compute-intensive activity. So, potentially, FM on firmware side could be not
>>>> capable to digest and executes all responsibilities without potential performance
>>>> degradation.  
>>> 
>>> There is firmware and their is firmware ;)  It's not uncommon for BMCs to be
>>> significant devices in their own right and run Linux or other heavy weight OSes.
>>> 
>>>> (2) However, if we have FM on host side, then there is security concerns because
>>>> FM sees everything and all details of multiple hosts and subsystems.  
>>> 
>>> Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
>>> at lest some implementations it will be running on a capable Linux machine.
>>> In large fabrics that may be very capable indeed (basically a server dedicated to
>>> this role).
>>> 
>>>> (3) Technically speaking, there is one potential capability that user-space FM daemon
>>>> can run as on host side as on CXL switch side. I mean here that if we implement
>>>> user-space FM daemon, then it could be used to execute FM functionality on CXL
>>>> switch side (maybe????). :)  
>>> 
>>> Sure, anything could run anywhere.  We should draw up some 'reference' architectures
>>> though to guide discussion down the line.  Mind you I think there are a lot of
>>> steps along the way and starting point should be a simple PoC where all the FM
>>> stuff is in linux userspace (other than comms).  That's easy enough to do.
>>> If I get a quiet week or so I'll hammer out what we need on emulation side to
>>> start playing with this.
>>> 
>>> Jonathan
>>> 
>>> 
>>> 
>>>> 
>>>> <skipped>
>>>> 
>>>>>>>>> - Manage surprise removal of devices      
>>>>>>>> 
>>>>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
>>>>>>>> what to do in the way of managing this.  Scream loudly?
>>>>>>>> 
>>>>>>> 
>>>>>>> Maybe, it could require application(s) notification. Let’s imagine that application
>>>>>>> uses some resources from removed device. Maybe, FM can manage kernel-space
>>>>>>> metadata correction and helping to manage application requests to not existing
>>>>>>> entities.    
>>>>>> 
>>>>>> Notifications for the host are likely to come via inband means - so type3 driver
>>>>>> handling rather than related to FM.  As far as the host is concerned this is the
>>>>>> same as case where there is no FM and someone ripped a device out.
>>>>>> 
>>>>>> There might indeed be meta data to manage, but doubt it will have anything to
>>>>>> do with kernel.
>>>>>> 
>>>>> 
>>>>> I've also had similar thoughts, I think the OS responds to notifications that
>>>>> are generated in-band after changes to the state of the FM are made through 
>>>>> OOB means.
>>>>> 
>>>>> I envision the host sends REDFISH requests to a switch BMC that has an FM
>>>>> implementation. Once the changes are implemented by the FM it would show up
>>>>> as changes to the PCIe hierarchy on a host, which is capable of responding to
>>>>> such changes.
>>>>> 
>>>> 
>>>> I think I am not completely follow your point. :) First of all, I assume that if host
>>>> sends REDFISH request, then it will be expected the confirmation of request execution.
>>>> It means for me that host needs to receive some packet that informs that request
>>>> executed successfully or failed. It means that some subsystem or application requested
>>>> this change and only after receiving the confirmation requested capabilities can be used.
>>>> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
>>>> that some FM subsystem should be on the host side to receive confirmation/notification
>>>> and to execute the real changes in PCIe hierarchy. Am missing something here?  
>>> 
>>> Another terminology issue I think.  FM from CXL side of things is an abstract thing
>>> (potentially highly layered / distributed) that acts on instructions from an
>>> orchestrator (also potentially highly distributed, one implementation is hosts
>>> can be the orchestrator) and configures the fabric.
>>> The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
>>> Upstream probably all Redfish.  What happens in between is impdef (though
>>> obviously mapping to Redfish or FM-API as applicable may make it more
>>> reuseable and flexible).
>>> 
>>> I think some diagrams of what is where will help.
>>> I think we need (note I've always kept the controller hosts as normal hosts as well
>>> as that includes the case where it never uses the Fabric - so BMC type cases as
>>> a subset without needing to double the number of diagrams).
>>> 
>>> 1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
>>>  to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
>>>  mctp over say i2c.
>>> 
>>> 2) Diagram of same as above, with a multiple head device all connected to one host.
>>> 
>>> 3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
>>>  one of which is responsible for  fabric management.   FM in that manager host
>>>  and orchestrator) - agents on other hosts able to send requests for services to that host.
>>> 
>>> 4) Diagram of 3, but now with multiple switches, each with separate controlling host.
>>>  Some other hosts that don't have any fabric control.
>>>  Distributed FM across the controlling hosts.
>>> 
>>> 5) Diagram of 4 but with layered FM and separate Orchestrator.  Hosts all talk to the
>>>  orchestrator, that then talks to the FM.
>>> 
>>> 6) 4, but push some management entities down into switches (from architecture point of
>>>  view this is no different from layered case with a separate BMC per switch - there
>>>  is still either a distribute FM or a layered FM, which the orchestrator talks to.)
>>> 
>>> Can mess with exactly distribution of who does what across the various layers.
>>> 
>>> I can sketch this lot up (and that will probably make some gaps in these cases apparent)
>>> but will take a little while, hence text descriptions in the meantime.
>>> 
>>> I come back to my personal view though - which is don't worry too much at this early
>>> stage, beyond making sure we have some layering in code so that we can distribute
>>> it across a distributed or layered architecture later!   
>>> 
>> 
>> I had slightly more simplified image in my mind. :) We definitely need to have diagrams
>> to clarify the vision. But which collaboration tool could we use to work publicly on diagrams?
>> Any suggestion?
> 
> Ascii art :)  To have a broad discussion it needs to be mailing list and that
> is effectively only option.
> 

I tried to prepare some diagram based on ascii art. :) It looks pretty terrible in email:

----------------------------         ------------------
|  ---------       ------  |         |                |
|  | Agent | <---> | FM |  |         |                |
|  ---------       ------  |<------->|   CXL switch   |
|            Host          |         |                |
|                          |         |                |
----------------------------         —————————

I think we need to use some online resource, anyway. We are discussing with Adam what we
can do here.

You introduced Orchestrator entity. I realized that I am not completely follow the responsibility
of this subsystem. Do you imply some central point of management of multiple FM instances?
Something like a router that has knowledge base and can redirect the request to proper FM
instance. Am I correct? It sounds to me that orchestrator needs to implement some
sub-API of FM. Or, maybe, it needs to parse REDFISH packets, for example, and only
redirects the packets.

Thanks,
Slava.
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture
  2023-02-17 18:31                 ` Viacheslav A.Dubeyko
@ 2023-02-20 11:59                   ` Jonathan Cameron
  0 siblings, 0 replies; 13+ messages in thread
From: Jonathan Cameron @ 2023-02-20 11:59 UTC (permalink / raw)
  To: Viacheslav A.Dubeyko
  Cc: Adam Manzanares, lsf-pc, linux-mm, linux-cxl, Dan Williams,
	Cong Wang, Viacheslav Dubeyko

On Fri, 17 Feb 2023 10:31:15 -0800
"Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:

> > On Feb 10, 2023, at 4:32 AM, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > 
> > On Thu, 9 Feb 2023 14:04:13 -0800
> > "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> >   
> >>> On Feb 9, 2023, at 3:05 AM, Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> >>> 
> >>> On Wed, 8 Feb 2023 10:03:57 -0800
> >>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> >>>   
> >>>>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares <a.manzanares@samsung.com> wrote:
> >>>>> 
> >>>>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote:      
> >>>>>> On Wed, 1 Feb 2023 12:04:56 -0800
> >>>>>> "Viacheslav A.Dubeyko" <viacheslav.dubeyko@bytedance.com> wrote:
> >>>>>>   
> >>>>>>>>   
> >>>> 
> >>>> <skipped>
> >>>>   
> >>>>>>> 
> >>>>>>> Most probably, we will have multiple FM implementations in firmware.
> >>>>>>> Yes, FM on host could be important for debug and to verify correctness
> >>>>>>> firmware-based implementations. But FM daemon on host could be important
> >>>>>>> to receive notifications and react somehow on these events. Also, journalling
> >>>>>>> of events/messages/events could be important responsibility of FM daemon
> >>>>>>> on host.       
> >>>>>> 
> >>>>>> I agree with an FM daemon somewhere (potentially running on the BMC type chip
> >>>>>> that also has the lower level FM-API access).  I think it is somewhat
> >>>>>> separate from the rest of this on basis it may well just be talking redfish
> >>>>>> to the FM and there are lots of tools for that sort of handling already.
> >>>>>>   
> >>>>> 
> >>>>> I would be interested in particpating in a BOF about this topic. I wonder what
> >>>>> happens when we have multiple switches with multiple FMs each on a separate BMC.
> >>>>> In this case, does it make more sense to have an owner of the global FM state 
> >>>>> be a user space application. Is this the job of the orchestrator?    
> >>> 
> >>> This partly comes down to terminology. Ultimately there is an FM that is
> >>> responsible for the whole fabric (could be distributed software) and that
> >>> in turn will talk to a the various BMCs that then talk to the switches.
> >>> 
> >>> Depending on the setup it may not be necessary for any entity to see the
> >>> whole fabric.
> >>> 
> >>> Interesting point in general though.  I think it boils down to getting
> >>> layering in any software correct and that is easier done from outset.
> >>> 
> >>> I don't know whether the redfish stuff is flexible enough to cover this, but
> >>> if it is, I'd envision, the actual FM talking redfish to a bunch of sub-FMs
> >>> and in turn presenting redfish to the orchestrator.
> >>> 
> >>> Any of these components might run on separate machines, or in firmware on
> >>> some device, or indeed all run on one server that is acting as the FM and
> >>> a node in the orchestrator layer.
> >>>   
> >>>>> 
> >>>>> The BMC based FM seems to have scalability issues, but will we hit them in
> >>>>> practice any time soon.      
> >>> 
> >>> Who knows ;)  If anyone builds the large scale fabric stuff in CXL 3.0 then
> >>> we definitely will in the medium term.
> >>>   
> >>>> 
> >>>> I had discussion recently and it looks like there are interesting points:
> >>>> (1) If we have multiple CXL switches (especially with complex hierarchy), then it is
> >>>> very compute-intensive activity. So, potentially, FM on firmware side could be not
> >>>> capable to digest and executes all responsibilities without potential performance
> >>>> degradation.    
> >>> 
> >>> There is firmware and their is firmware ;)  It's not uncommon for BMCs to be
> >>> significant devices in their own right and run Linux or other heavy weight OSes.
> >>>   
> >>>> (2) However, if we have FM on host side, then there is security concerns because
> >>>> FM sees everything and all details of multiple hosts and subsystems.    
> >>> 
> >>> Agreed. Other than testing I wouldn't expect the FM to run on a 'host', but in
> >>> at lest some implementations it will be running on a capable Linux machine.
> >>> In large fabrics that may be very capable indeed (basically a server dedicated to
> >>> this role).
> >>>   
> >>>> (3) Technically speaking, there is one potential capability that user-space FM daemon
> >>>> can run as on host side as on CXL switch side. I mean here that if we implement
> >>>> user-space FM daemon, then it could be used to execute FM functionality on CXL
> >>>> switch side (maybe????). :)    
> >>> 
> >>> Sure, anything could run anywhere.  We should draw up some 'reference' architectures
> >>> though to guide discussion down the line.  Mind you I think there are a lot of
> >>> steps along the way and starting point should be a simple PoC where all the FM
> >>> stuff is in linux userspace (other than comms).  That's easy enough to do.
> >>> If I get a quiet week or so I'll hammer out what we need on emulation side to
> >>> start playing with this.
> >>> 
> >>> Jonathan
> >>> 
> >>> 
> >>>   
> >>>> 
> >>>> <skipped>
> >>>>   
> >>>>>>>>> - Manage surprise removal of devices        
> >>>>>>>> 
> >>>>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to have any idea
> >>>>>>>> what to do in the way of managing this.  Scream loudly?
> >>>>>>>>   
> >>>>>>> 
> >>>>>>> Maybe, it could require application(s) notification. Let’s imagine that application
> >>>>>>> uses some resources from removed device. Maybe, FM can manage kernel-space
> >>>>>>> metadata correction and helping to manage application requests to not existing
> >>>>>>> entities.      
> >>>>>> 
> >>>>>> Notifications for the host are likely to come via inband means - so type3 driver
> >>>>>> handling rather than related to FM.  As far as the host is concerned this is the
> >>>>>> same as case where there is no FM and someone ripped a device out.
> >>>>>> 
> >>>>>> There might indeed be meta data to manage, but doubt it will have anything to
> >>>>>> do with kernel.
> >>>>>>   
> >>>>> 
> >>>>> I've also had similar thoughts, I think the OS responds to notifications that
> >>>>> are generated in-band after changes to the state of the FM are made through 
> >>>>> OOB means.
> >>>>> 
> >>>>> I envision the host sends REDFISH requests to a switch BMC that has an FM
> >>>>> implementation. Once the changes are implemented by the FM it would show up
> >>>>> as changes to the PCIe hierarchy on a host, which is capable of responding to
> >>>>> such changes.
> >>>>>   
> >>>> 
> >>>> I think I am not completely follow your point. :) First of all, I assume that if host
> >>>> sends REDFISH request, then it will be expected the confirmation of request execution.
> >>>> It means for me that host needs to receive some packet that informs that request
> >>>> executed successfully or failed. It means that some subsystem or application requested
> >>>> this change and only after receiving the confirmation requested capabilities can be used.
> >>>> And if FM is on CXL switch side, then how FM will show up the changes? It sounds for me
> >>>> that some FM subsystem should be on the host side to receive confirmation/notification
> >>>> and to execute the real changes in PCIe hierarchy. Am missing something here?    
> >>> 
> >>> Another terminology issue I think.  FM from CXL side of things is an abstract thing
> >>> (potentially highly layered / distributed) that acts on instructions from an
> >>> orchestrator (also potentially highly distributed, one implementation is hosts
> >>> can be the orchestrator) and configures the fabric.
> >>> The downstream APIs to the switches and EPs are all in FM-API (CXL spec)
> >>> Upstream probably all Redfish.  What happens in between is impdef (though
> >>> obviously mapping to Redfish or FM-API as applicable may make it more
> >>> reuseable and flexible).
> >>> 
> >>> I think some diagrams of what is where will help.
> >>> I think we need (note I've always kept the controller hosts as normal hosts as well
> >>> as that includes the case where it never uses the Fabric - so BMC type cases as
> >>> a subset without needing to double the number of diagrams).
> >>> 
> >>> 1) Diagram of single host with the FM as one 'thing' on that host - direct interfaces
> >>>  to a single switch - interfaces options include switch CCI MB, mctp of PCI VDM,
> >>>  mctp over say i2c.
> >>> 
> >>> 2) Diagram of same as above, with a multiple head device all connected to one host.
> >>> 
> >>> 3) Diagram of 1 (maybe with MHD below switches), but now with multiple hosts,
> >>>  one of which is responsible for  fabric management.   FM in that manager host
> >>>  and orchestrator) - agents on other hosts able to send requests for services to that host.
> >>> 
> >>> 4) Diagram of 3, but now with multiple switches, each with separate controlling host.
> >>>  Some other hosts that don't have any fabric control.
> >>>  Distributed FM across the controlling hosts.
> >>> 
> >>> 5) Diagram of 4 but with layered FM and separate Orchestrator.  Hosts all talk to the
> >>>  orchestrator, that then talks to the FM.
> >>> 
> >>> 6) 4, but push some management entities down into switches (from architecture point of
> >>>  view this is no different from layered case with a separate BMC per switch - there
> >>>  is still either a distribute FM or a layered FM, which the orchestrator talks to.)
> >>> 
> >>> Can mess with exactly distribution of who does what across the various layers.
> >>> 
> >>> I can sketch this lot up (and that will probably make some gaps in these cases apparent)
> >>> but will take a little while, hence text descriptions in the meantime.
> >>> 
> >>> I come back to my personal view though - which is don't worry too much at this early
> >>> stage, beyond making sure we have some layering in code so that we can distribute
> >>> it across a distributed or layered architecture later!   
> >>>   
> >> 
> >> I had slightly more simplified image in my mind. :) We definitely need to have diagrams
> >> to clarify the vision. But which collaboration tool could we use to work publicly on diagrams?
> >> Any suggestion?  
> > 
> > Ascii art :)  To have a broad discussion it needs to be mailing list and that
> > is effectively only option.
> >   
> 
> I tried to prepare some diagram based on ascii art. :) It looks pretty terrible in email:
> 
> ----------------------------         ------------------
> |  ---------       ------  |         |                |
> |  | Agent | <---> | FM |  |         |                |
> |  ---------       ------  |<------->|   CXL switch   |
> |            Host          |         |                |
> |                          |         |                |
> ----------------------------         —————————
other than wrong line type on the right looks fine to me ;)

> 
> I think we need to use some online resource, anyway. We are discussing with Adam what we
> can do here.
> 
> You introduced Orchestrator entity. I realized that I am not completely follow the responsibility
> of this subsystem. Do you imply some central point of management of multiple FM instances?

Absolutely - whether it's role is actually separate from the FM or not is an implementation
detail, but assumption is someone is placing the VMs etc that are using the CXL memory and
only that entity will have the knowledge of what memory to assign to which host to provide
that memory to the VMs.

> Something like a router that has knowledge base and can redirect the request to proper FM
> instance. Am I correct?

More than that.  The orchestrator would get a 'give me a VM with X normal DRAM and X CXL DRAM'
it would figure out where to put that VM across a set of systems and issue the commands
to the relevant FMs to 'make it so'.  So that's the entity that would query all the FMs
to understand what resources it is managing and then tell them what to do (possibly
via multiple layers of abstraction and sub orchestators etc).


> It sounds to me that orchestrator needs to implement some
> sub-API of FM. Or, maybe, it needs to parse REDFISH packets, for example, and only
> redirects the packets.

I'd expect individual hosts to most do what they are told to do, or maybe
ask nicely for more resources for a particular VM or application.  The hosts shouldn't
be responsible for allocating those resources, but should just be told where they
are.  That stuff might be in redfish or similar, but it's way above the level of
anything CXL specific.

Jonathan

> 
> Thanks,
> Slava.
>  


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-02-20 11:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-30 19:11 [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture Viacheslav A.Dubeyko
2023-01-31 17:41 ` Jonathan Cameron
2023-02-01 20:04   ` [External] " Viacheslav A.Dubeyko
2023-02-02  9:54     ` Jonathan Cameron
2023-02-08 16:38       ` Adam Manzanares
2023-02-08 18:03         ` Viacheslav A.Dubeyko
2023-02-09 11:05           ` Jonathan Cameron
2023-02-09 22:04             ` Viacheslav A.Dubeyko
2023-02-10 12:32               ` Jonathan Cameron
2023-02-17 18:31                 ` Viacheslav A.Dubeyko
2023-02-20 11:59                   ` Jonathan Cameron
2023-02-09 22:10           ` Adam Manzanares
2023-02-09 22:22             ` Viacheslav A.Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).