All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: phys_port_id in switchdev mode?
       [not found] <20180828200539.1c0fe607@cakuba.netronome.com>
@ 2018-08-28 18:43 ` Jakub Kicinski
  2018-08-31 20:13   ` Marcelo Ricardo Leitner
  2018-09-03  9:40 ` Or Gerlitz
  1 sibling, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2018-08-28 18:43 UTC (permalink / raw)
  To: Florian Fainelli, Or Gerlitz, Simon Horman, Andy Gospodarek,
	mchan, Jiri Pirko, Alexander Duyck, Frederick Botha
  Cc: nick viljoen, netdev

Ugh, CC: netdev..

On Tue, 28 Aug 2018 20:05:39 +0200, Jakub Kicinski wrote:
> Hi!
> 
> I wonder if we can use phys_port_id in switchdev to group together
> interfaces of a single PCI PF?  Here is the problem:
> 
> With a mix of PF and VF interfaces it gets increasingly difficult to
> figure out which one corresponds to which PF.  We can identify which
> *representor* is which, by means of phys_port_name and devlink
> flavours.  But if the actual VF/PF interfaces are also present on the
> same host, it gets confusing when one tries to identify the PF they
> came from.  Generally one has to resort of matching between PCI DBDF of
> the PF and VFs or read relevant info out of ethtool -i.
> 
> In multi host scenario this is particularly painful, as there seems to
> be no immediately obvious way to match PCI interface ID of a card (0,
> 1, 2, 3, 4...) to the DBDF we have connected.
> 
> Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
> from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
> random manner, which means we have to provide those for all devices with
> link to the PF (all reprs).  And we have to link them (a) because it's
> right (tm) and (b) to get correct naming.  The only reliable way to make
> user space (libvirt) choose the repr it should run the NDOs on (which is
> IMHO the corresponding PF repr) is to set phys_port_id on actual VFs,
> VF reprs, PFs and PF reprs to a value corresponding to the *PCI PF*,
> not the external/Ethernet port when in switchdev mode.  User space
> should understand phys_port_id in this context, given it was originally
> introduced for matching VFs to ports.
> 
> I hope this explanation makes sense, and is correct.  Please point out
> errors in my understanding, any comments would be appreciated! :)
> 
> Jiri?  Or?  Others?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-08-28 18:43 ` phys_port_id in switchdev mode? Jakub Kicinski
@ 2018-08-31 20:13   ` Marcelo Ricardo Leitner
  2018-09-01 11:34     ` Jakub Kicinski
  2018-09-03  9:43     ` Or Gerlitz
  0 siblings, 2 replies; 11+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-08-31 20:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Florian Fainelli, Or Gerlitz, Simon Horman, Andy Gospodarek,
	mchan, Jiri Pirko, Alexander Duyck, Frederick Botha,
	nick viljoen, netdev

On Tue, Aug 28, 2018 at 08:43:51PM +0200, Jakub Kicinski wrote:
> Ugh, CC: netdev..
> 
> On Tue, 28 Aug 2018 20:05:39 +0200, Jakub Kicinski wrote:
> > Hi!
> > 
> > I wonder if we can use phys_port_id in switchdev to group together
> > interfaces of a single PCI PF?  Here is the problem:

On Mellanox cards, this is already possible via phys_switch_id, as
each PF has its own phys_switch_id. So all VFs with a given
phys_switch_id belong to the PF with that same phys_switch_id.

I understand this is a vendor-specific design, but if you have the
same phys_switch_id across PFs, does it really matter on which PF the
VF was created on?

  Marcelo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-08-31 20:13   ` Marcelo Ricardo Leitner
@ 2018-09-01 11:34     ` Jakub Kicinski
  2018-09-03 13:55       ` Marcelo Ricardo Leitner
  2018-09-03  9:43     ` Or Gerlitz
  1 sibling, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2018-09-01 11:34 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Florian Fainelli, Or Gerlitz, Simon Horman, Andy Gospodarek,
	mchan, Jiri Pirko, Alexander Duyck, Frederick Botha,
	nick viljoen, netdev

On Fri, 31 Aug 2018 17:13:22 -0300, Marcelo Ricardo Leitner wrote:
> On Tue, Aug 28, 2018 at 08:43:51PM +0200, Jakub Kicinski wrote:
> > Ugh, CC: netdev..
> > 
> > On Tue, 28 Aug 2018 20:05:39 +0200, Jakub Kicinski wrote:  
> > > Hi!
> > > 
> > > I wonder if we can use phys_port_id in switchdev to group together
> > > interfaces of a single PCI PF?  Here is the problem:  
> 
> On Mellanox cards, this is already possible via phys_switch_id, as
> each PF has its own phys_switch_id. So all VFs with a given
> phys_switch_id belong to the PF with that same phys_switch_id.

You mean Connect-X4 and on, Connect-X3 also shares PF between ports.
But you're right, in simpler designs there is usually a 1:1 relation
between an external networking port, a PCIe PF and a eswitch.  Which
does not mean it's correct to conflate the ports.

> I understand this is a vendor-specific design

It's hardly a strange design :)  If the card is a true switch
associating VFs on the PCIe side with an Ethernet port makes no sense.

> but if you have the same phys_switch_id across PFs, does it really
> matter on which PF the VF was created on?

It does, because one can have multiple PFs, potentially connected to
different hosts.  And legacy NDOs, for example, must target the right
one.

So for one thing it seems like the right thing to do to associate VFs
with PFs, not external ports.

The other thing is that if you look at your Mellanox card in switchdev
mode, you'll notice that systemd/udev is not able to name the
representor interfaces properly without custom rules, because they are
not linked to the PCIe PF.  They are not linked because VF reprs don't
provide legacy SR-IOV NDOs, so we prevent libvirt from poking at them.
phys_port_id should steer libvirt away from VF representors, so we can
get proper naming.  (And having legacy NDOs implemented everywhere is a
no-go, because ip link output grows quadratically.)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
       [not found] <20180828200539.1c0fe607@cakuba.netronome.com>
  2018-08-28 18:43 ` phys_port_id in switchdev mode? Jakub Kicinski
@ 2018-09-03  9:40 ` Or Gerlitz
  2018-09-04 10:20   ` Jakub Kicinski
  1 sibling, 1 reply; 11+ messages in thread
From: Or Gerlitz @ 2018-09-03  9:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Florian Fainelli, Simon Horman, Andy Gospodarek, mchan,
	Jiri Pirko, Alexander Duyck, Frederick Botha, nick viljoen,
	Linux Netdev List

On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> Hi!


Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,

Note I replied couple of minutes ago but it didn't get to the list, so
lets take it from this one:

> I wonder if we can use phys_port_id in switchdev to group together
> interfaces of a single PCI PF?  Here is the problem:
>
> With a mix of PF and VF interfaces it gets increasingly difficult to
> figure out which one corresponds to which PF.  We can identify which
> *representor* is which, by means of phys_port_name and devlink
> flavours.  But if the actual VF/PF interfaces are also present on the
> same host, it gets confusing when one tries to identify the PF they
> came from.  Generally one has to resort of matching between PCI DBDF of
> the PF and VFs or read relevant info out of ethtool -i.
>
> In multi host scenario this is particularly painful, as there seems to
> be no immediately obvious way to match PCI interface ID of a card (0,
> 1, 2, 3, 4...) to the DBDF we have connected.
>
> Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
> from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
> random manner, which means we have to provide those for all devices with
> link to the PF (all reprs).  And we have to link them (a) because it's
> right (tm) and (b) to get correct naming.


wait, as you commented in later, not only the mellanox vf reprs but rather also
the nfp vf reprs are not linked to the PF, because ip link output
grows quadratically.

> The only reliable way to make
> user space (libvirt) choose the repr it should run the NDOs on (which is
> IMHO the corresponding PF repr) is to set phys_port_id on actual VFs,
> VF reprs, PFs and PF reprs to a value corresponding to the *PCI PF*,
> not the external/Ethernet port when in switchdev mode.  User space
> should understand phys_port_id in this context, given it was originally
> introduced for matching VFs to ports.


Using phy_port_id to match/group VFs to PFs makes sense to me.

So what would be the libvirt use case you envision that needs
the VF and PF reprs to support that as well? or maybe you were
not referring to libvirt but to some other provisioning element? I need
to refresh my memory on that area.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-08-31 20:13   ` Marcelo Ricardo Leitner
  2018-09-01 11:34     ` Jakub Kicinski
@ 2018-09-03  9:43     ` Or Gerlitz
  1 sibling, 0 replies; 11+ messages in thread
From: Or Gerlitz @ 2018-09-03  9:43 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Jakub Kicinski, Florian Fainelli, Simon Horman, Andy Gospodarek,
	mchan, Jiri Pirko, Alexander Duyck, Frederick Botha,
	nick viljoen, netdev

On Fri, Aug 31, 2018 at 11:13 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Tue, Aug 28, 2018 at 08:43:51PM +0200, Jakub Kicinski wrote:
>> Ugh, CC: netdev..
>>
>> On Tue, 28 Aug 2018 20:05:39 +0200, Jakub Kicinski wrote:
>> > Hi!
>> >
>> > I wonder if we can use phys_port_id in switchdev to group together
>> > interfaces of a single PCI PF?  Here is the problem:
>
> On Mellanox cards, this is already possible via phys_switch_id, as
> each PF has its own phys_switch_id. So all VFs with a given
> phys_switch_id belong to the PF with that same phys_switch_id.

This is due to the fact that currently when getting into switchdev mode
the PF netdev becomes the uplink representor. This is problematic and we
are working to have an uplink repr as nfp and others have. Bottom line,
this is not the correct way to group PF with it's VFs, switch id is something
that relates to switch port reprs not the entities behind them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-09-01 11:34     ` Jakub Kicinski
@ 2018-09-03 13:55       ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 11+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-09-03 13:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Florian Fainelli, Or Gerlitz, Simon Horman, Andy Gospodarek,
	mchan, Jiri Pirko, Alexander Duyck, Frederick Botha,
	nick viljoen, netdev

On Sat, Sep 01, 2018 at 01:34:12PM +0200, Jakub Kicinski wrote:
> On Fri, 31 Aug 2018 17:13:22 -0300, Marcelo Ricardo Leitner wrote:
> > On Tue, Aug 28, 2018 at 08:43:51PM +0200, Jakub Kicinski wrote:
> > > Ugh, CC: netdev..
> > > 
> > > On Tue, 28 Aug 2018 20:05:39 +0200, Jakub Kicinski wrote:  
> > > > Hi!
> > > > 
> > > > I wonder if we can use phys_port_id in switchdev to group together
> > > > interfaces of a single PCI PF?  Here is the problem:  
> > 
> > On Mellanox cards, this is already possible via phys_switch_id, as
> > each PF has its own phys_switch_id. So all VFs with a given
> > phys_switch_id belong to the PF with that same phys_switch_id.
> 
> You mean Connect-X4 and on, Connect-X3 also shares PF between ports.

Yes ConnectX-3 shares PF beween ports but doesn't support switchdev
mode.

I see the issue now. I was still considering the external ports as
uplink representors.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-09-03  9:40 ` Or Gerlitz
@ 2018-09-04 10:20   ` Jakub Kicinski
  2018-09-04 20:37     ` Or Gerlitz
  2018-09-05 16:20     ` Samudrala, Sridhar
  0 siblings, 2 replies; 11+ messages in thread
From: Jakub Kicinski @ 2018-09-04 10:20 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Florian Fainelli, Simon Horman, Andy Gospodarek, mchan,
	Jiri Pirko, Alexander Duyck, Frederick Botha, nick viljoen,
	Linux Netdev List

On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:
> On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:
> > Hi!  
> 
> Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,
> 
> Note I replied couple of minutes ago but it didn't get to the list, so
> lets take it from this one:
>
> > I wonder if we can use phys_port_id in switchdev to group together
> > interfaces of a single PCI PF?  Here is the problem:
> >
> > With a mix of PF and VF interfaces it gets increasingly difficult to
> > figure out which one corresponds to which PF.  We can identify which
> > *representor* is which, by means of phys_port_name and devlink
> > flavours.  But if the actual VF/PF interfaces are also present on the
> > same host, it gets confusing when one tries to identify the PF they
> > came from.  Generally one has to resort of matching between PCI DBDF of
> > the PF and VFs or read relevant info out of ethtool -i.
> >
> > In multi host scenario this is particularly painful, as there seems to
> > be no immediately obvious way to match PCI interface ID of a card (0,
> > 1, 2, 3, 4...) to the DBDF we have connected.
> >
> > Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
> > from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
> > random manner, which means we have to provide those for all devices with
> > link to the PF (all reprs).  And we have to link them (a) because it's
> > right (tm) and (b) to get correct naming.  
> 
> wait, as you commented in later, not only the mellanox vf reprs but rather also
> the nfp vf reprs are not linked to the PF, because ip link output
> grows quadratically.

Right, correct.  If we set phys_port_id libvirt will reliably pick the
correct netdev to run NDOs on (PF/PF repr) so we can remove them from
the other netdevs and therefore limit the size of ip link show output.

> > The only reliable way to make
> > user space (libvirt) choose the repr it should run the NDOs on (which is
> > IMHO the corresponding PF repr) is to set phys_port_id on actual VFs,
> > VF reprs, PFs and PF reprs to a value corresponding to the *PCI PF*,
> > not the external/Ethernet port when in switchdev mode.  User space
> > should understand phys_port_id in this context, given it was originally
> > introduced for matching VFs to ports.  
> 
> Using phy_port_id to match/group VFs to PFs makes sense to me.
> 
> So what would be the libvirt use case you envision that needs
> the VF and PF reprs to support that as well? or maybe you were
> not referring to libvirt but to some other provisioning element? I need
> to refresh my memory on that area.

Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
phys_port_id on the actual VF and then *a* netdev linked to physfn in
sysfs which will have the legacy NDOs.

We can't set the phys_port_id on the VF reprs because then we're back
to the problem of ip link output growing.  Perhaps we shouldn't set it
on PF repr either?

Let's make a table (assuming bare metal cloud scenario where Host0 is
controlling the network, while Host1 is the actual server):

[act - actual; rpr - representor; SN -serial number]

Today:

  dev     | host | sysfs | phys_-  | switch- | phys_-    | NDOs
          |      | link  | port_id | dev_id  | port_name | 
---------------------------------------------------------------
uplink    |   0  |   PF0 |   -     | ASIC SN | p0        | PF0
act PF0   |   0  |   PF0 |   -     |   -     |  -        |  -
act VF0/0 |   0  | VF0/0 |   -     |   -     |  -        |  -
rpr PF0   |   0  |    -  |   -     | ASIC SN | pf0       |  -
rpr VF0/0 |   0  |    -  |   -     | ASIC SN | pf0vf0    |  -
act PF1   |   1  |   PF1 |   -     |   -     |  -        | PF1
act VF1/0 |   1  | VF1/0 |   -     |   -     |  -        |  -
rpr PF1   |   0  |    -  |   -     | ASIC SN | pf1       |  -
rpr VF1/0 |   0  |    -  |   -     | ASIC SN | pf1vf0    |  -

Proposed:

  dev     | host | sysfs | phys_-  | switch- | phys_-    | NDOs
          |      | link  | port_id | dev_id  | port_name |
---------------------------------------------------------------
uplink    |   0  |   PF0 |   -     | ASIC SN | p0        |  -
act PF0   |   0  |   PF0 | PF0 SN  |   -     |  -        | PF0
act VF0/0 |   0  | VF0/0 | PF0 SN  |   -     |  -        |  -
rpr PF0   |   0  |   PF0 |   -     | ASIC SN | pf0       |  -
rpr VF0/0 |   0  |   PF0 |   -     | ASIC SN | pf0vf0    |  -
act PF1   |   1  |   PF1 | PF1 SN  |   -     |  -        | PF1
act VF1/0 |   1  | VF1/0 | PF1 SN  |   -     |  -        |  -
rpr PF1   |   0  |   PF0 |   -     | ASIC SN | pf1       |  -
rpr VF1/0 |   0  |   PF0 |   -     | ASIC SN | pf1vf0    |  -

With this libvirt on Host0 should easily find the actual PF0 netdev to
run the NDO on, if it wants to use VFs:
 - libvrit finds act VF0/0 to plug into the VF;
 - reads its phys_port_id -> "PF0 SN";
 - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
 - runs NDOs on "act PF0" for PF0's VF correctly.


The other problem remains unsolved - Host0 can't be sure without
vendor-specific knowledge whether it's connected to PF0 or PF1.
That's why I was thinking maybe we should provide phys_port_id
on PF representors as well.  That means we'd have to provide the 
legacy NDOs on PF reprs too because libvirt may now find PF repr...
Would it be cleaner to add a new attribute?  

Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-09-04 10:20   ` Jakub Kicinski
@ 2018-09-04 20:37     ` Or Gerlitz
  2018-09-05 16:43       ` Jakub Kicinski
  2018-09-05 16:20     ` Samudrala, Sridhar
  1 sibling, 1 reply; 11+ messages in thread
From: Or Gerlitz @ 2018-09-04 20:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Florian Fainelli, Simon Horman, Andy Gospodarek, mchan,
	Jiri Pirko, Alexander Duyck, Frederick Botha, nick viljoen,
	Linux Netdev List

On Tue, Sep 4, 2018 at 1:20 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:
>> On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:
>> > Hi!
>>
>> Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,
>>
>> Note I replied couple of minutes ago but it didn't get to the list, so
>> lets take it from this one:
>>
>> > I wonder if we can use phys_port_id in switchdev to group together
>> > interfaces of a single PCI PF?  Here is the problem:
>> >
>> > With a mix of PF and VF interfaces it gets increasingly difficult to
>> > figure out which one corresponds to which PF.  We can identify which
>> > *representor* is which, by means of phys_port_name and devlink
>> > flavours.  But if the actual VF/PF interfaces are also present on the
>> > same host, it gets confusing when one tries to identify the PF they
>> > came from.  Generally one has to resort of matching between PCI DBDF of
>> > the PF and VFs or read relevant info out of ethtool -i.
>> >
>> > In multi host scenario this is particularly painful, as there seems to
>> > be no immediately obvious way to match PCI interface ID of a card (0,
>> > 1, 2, 3, 4...) to the DBDF we have connected.
>> >
>> > Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
>> > from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
>> > random manner, which means we have to provide those for all devices with
>> > link to the PF (all reprs).  And we have to link them (a) because it's
>> > right (tm) and (b) to get correct naming.
>>
>> wait, as you commented in later, not only the mellanox vf reprs but rather also
>> the nfp vf reprs are not linked to the PF, because ip link output
>> grows quadratically.
>
> Right, correct.  If we set phys_port_id libvirt will reliably pick the
> correct netdev to run NDOs on (PF/PF repr) so we can remove them from
> the other netdevs and therefore limit the size of ip link show output.

just to make sure, this is suggested/future not existing flow of libvirt?


> Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
> phys_port_id on the actual VF and then *a* netdev linked to physfn in
> sysfs which will have the legacy NDOs.
>
> We can't set the phys_port_id on the VF reprs because then we're back
> to the problem of ip link output growing.  Perhaps we shouldn't set it
> on PF repr either?
>
> Let's make a table (assuming bare metal cloud scenario where Host0 is
> controlling the network, while Host1 is the actual server):

yeah, this would be a super-set the non-smartnic case where
we have only one host.



[...]


> With this libvirt on Host0 should easily find the actual PF0 netdev to
> run the NDO on, if it wants to use VFs:
>  - libvrit finds act VF0/0 to plug into the VM;
>  - reads its phys_port_id -> "PF0 SN";
>  - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
>  - runs NDOs on "act PF0" for PF0's VF correctly.

What you describe here doesn't seem to be networking
configuration, as it deals only with VFs and PF but not with reprs,
and hence AFAIK runs on host host1

[...]

> Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?

I need to think on that

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-09-04 10:20   ` Jakub Kicinski
  2018-09-04 20:37     ` Or Gerlitz
@ 2018-09-05 16:20     ` Samudrala, Sridhar
  2018-09-05 16:47       ` Jakub Kicinski
  1 sibling, 1 reply; 11+ messages in thread
From: Samudrala, Sridhar @ 2018-09-05 16:20 UTC (permalink / raw)
  To: Jakub Kicinski, Or Gerlitz
  Cc: Florian Fainelli, Simon Horman, Andy Gospodarek, mchan,
	Jiri Pirko, Alexander Duyck, Frederick Botha, nick viljoen,
	Linux Netdev List

On 9/4/2018 3:20 AM, Jakub Kicinski wrote:
> On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:
>> On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:
>>> Hi!
>> Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,
>>
>> Note I replied couple of minutes ago but it didn't get to the list, so
>> lets take it from this one:
>>
>>> I wonder if we can use phys_port_id in switchdev to group together
>>> interfaces of a single PCI PF?  Here is the problem:
>>>
>>> With a mix of PF and VF interfaces it gets increasingly difficult to
>>> figure out which one corresponds to which PF.  We can identify which
>>> *representor* is which, by means of phys_port_name and devlink
>>> flavours.  But if the actual VF/PF interfaces are also present on the
>>> same host, it gets confusing when one tries to identify the PF they
>>> came from.  Generally one has to resort of matching between PCI DBDF of
>>> the PF and VFs or read relevant info out of ethtool -i.
>>>
>>> In multi host scenario this is particularly painful, as there seems to
>>> be no immediately obvious way to match PCI interface ID of a card (0,
>>> 1, 2, 3, 4...) to the DBDF we have connected.
>>>
>>> Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
>>> from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
>>> random manner, which means we have to provide those for all devices with
>>> link to the PF (all reprs).  And we have to link them (a) because it's
>>> right (tm) and (b) to get correct naming.
>> wait, as you commented in later, not only the mellanox vf reprs but rather also
>> the nfp vf reprs are not linked to the PF, because ip link output
>> grows quadratically.
> Right, correct.  If we set phys_port_id libvirt will reliably pick the
> correct netdev to run NDOs on (PF/PF repr) so we can remove them from
> the other netdevs and therefore limit the size of ip link show output.
>
>>> The only reliable way to make
>>> user space (libvirt) choose the repr it should run the NDOs on (which is
>>> IMHO the corresponding PF repr) is to set phys_port_id on actual VFs,
>>> VF reprs, PFs and PF reprs to a value corresponding to the *PCI PF*,
>>> not the external/Ethernet port when in switchdev mode.  User space
>>> should understand phys_port_id in this context, given it was originally
>>> introduced for matching VFs to ports.
>> Using phy_port_id to match/group VFs to PFs makes sense to me.
>>
>> So what would be the libvirt use case you envision that needs
>> the VF and PF reprs to support that as well? or maybe you were
>> not referring to libvirt but to some other provisioning element? I need
>> to refresh my memory on that area.
> Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
> phys_port_id on the actual VF and then *a* netdev linked to physfn in
> sysfs which will have the legacy NDOs.
>
> We can't set the phys_port_id on the VF reprs because then we're back
> to the problem of ip link output growing.  Perhaps we shouldn't set it
> on PF repr either?
>
> Let's make a table (assuming bare metal cloud scenario where Host0 is
> controlling the network, while Host1 is the actual server):
>
> [act - actual; rpr - representor; SN -serial number]
>
> Today:
>
>    dev     | host | sysfs | phys_-  | switch- | phys_-    | NDOs
>            |      | link  | port_id | dev_id  | port_name |
> ---------------------------------------------------------------
> uplink    |   0  |   PF0 |   -     | ASIC SN | p0        | PF0
> act PF0   |   0  |   PF0 |   -     |   -     |  -        |  -
> act VF0/0 |   0  | VF0/0 |   -     |   -     |  -        |  -
> rpr PF0   |   0  |    -  |   -     | ASIC SN | pf0       |  -
> rpr VF0/0 |   0  |    -  |   -     | ASIC SN | pf0vf0    |  -
> act PF1   |   1  |   PF1 |   -     |   -     |  -        | PF1
> act VF1/0 |   1  | VF1/0 |   -     |   -     |  -        |  -
> rpr PF1   |   0  |    -  |   -     | ASIC SN | pf1       |  -
> rpr VF1/0 |   0  |    -  |   -     | ASIC SN | pf1vf0    |  -
>
> Proposed:
>
>    dev     | host | sysfs | phys_-  | switch- | phys_-    | NDOs
>            |      | link  | port_id | dev_id  | port_name |
> ---------------------------------------------------------------
> uplink    |   0  |   PF0 |   -     | ASIC SN | p0        |  -
> act PF0   |   0  |   PF0 | PF0 SN  |   -     |  -        | PF0
> act VF0/0 |   0  | VF0/0 | PF0 SN  |   -     |  -        |  -
> rpr PF0   |   0  |   PF0 |   -     | ASIC SN | pf0       |  -
> rpr VF0/0 |   0  |   PF0 |   -     | ASIC SN | pf0vf0    |  -
> act PF1   |   1  |   PF1 | PF1 SN  |   -     |  -        | PF1
> act VF1/0 |   1  | VF1/0 | PF1 SN  |   -     |  -        |  -
> rpr PF1   |   0  |   PF0 |   -     | ASIC SN | pf1       |  -
> rpr VF1/0 |   0  |   PF0 |   -     | ASIC SN | pf1vf0    |  -
>
> With this libvirt on Host0 should easily find the actual PF0 netdev to
> run the NDO on, if it wants to use VFs:
>   - libvrit finds act VF0/0 to plug into the VF;
>   - reads its phys_port_id -> "PF0 SN";
>   - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
>   - runs NDOs on "act PF0" for PF0's VF correctly.

I think Host0 corresponds to embedded OS on the NIC. Is this correct?
I guess in this setup, only PF0's PCI interface on Host0 is in switchdev mode and
the representors for PF0 and its VFs are created on Host0 when they come up
on Host1. I would think PF0 on Host0 acts as a Control PF for PF1 on Host1.

Isn't hypervisor running only on Host1?


>
>
> The other problem remains unsolved - Host0 can't be sure without
> vendor-specific knowledge whether it's connected to PF0 or PF1.
> That's why I was thinking maybe we should provide phys_port_id
> on PF representors as well.  That means we'd have to provide the
> legacy NDOs on PF reprs too because libvirt may now find PF repr...
> Would it be cleaner to add a new attribute?
>
> Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?

Do you mean the legacy VF ndo ops on the PF?  I think it is possible to configure
the VFs on Host1 via the port representors except for the MAC address.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-09-04 20:37     ` Or Gerlitz
@ 2018-09-05 16:43       ` Jakub Kicinski
  0 siblings, 0 replies; 11+ messages in thread
From: Jakub Kicinski @ 2018-09-05 16:43 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Florian Fainelli, Simon Horman, Andy Gospodarek, mchan,
	Jiri Pirko, Alexander Duyck, Frederick Botha, nick viljoen,
	Linux Netdev List

On Tue, 4 Sep 2018 23:37:29 +0300, Or Gerlitz wrote:
> On Tue, Sep 4, 2018 at 1:20 PM, Jakub Kicinski wrote:
> > On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:  
> >> On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:  
> >> > Hi!  
> >>
> >> Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,
> >>
> >> Note I replied couple of minutes ago but it didn't get to the list, so
> >> lets take it from this one:
> >>  
> >> > I wonder if we can use phys_port_id in switchdev to group together
> >> > interfaces of a single PCI PF?  Here is the problem:
> >> >
> >> > With a mix of PF and VF interfaces it gets increasingly difficult to
> >> > figure out which one corresponds to which PF.  We can identify which
> >> > *representor* is which, by means of phys_port_name and devlink
> >> > flavours.  But if the actual VF/PF interfaces are also present on the
> >> > same host, it gets confusing when one tries to identify the PF they
> >> > came from.  Generally one has to resort of matching between PCI DBDF of
> >> > the PF and VFs or read relevant info out of ethtool -i.
> >> >
> >> > In multi host scenario this is particularly painful, as there seems to
> >> > be no immediately obvious way to match PCI interface ID of a card (0,
> >> > 1, 2, 3, 4...) to the DBDF we have connected.
> >> >
> >> > Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
> >> > from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
> >> > random manner, which means we have to provide those for all devices with
> >> > link to the PF (all reprs).  And we have to link them (a) because it's
> >> > right (tm) and (b) to get correct naming.  
> >>
> >> wait, as you commented in later, not only the mellanox vf reprs but rather also
> >> the nfp vf reprs are not linked to the PF, because ip link output
> >> grows quadratically.  
> >
> > Right, correct.  If we set phys_port_id libvirt will reliably pick the
> > correct netdev to run NDOs on (PF/PF repr) so we can remove them from
> > the other netdevs and therefore limit the size of ip link show output.  
> 
> just to make sure, this is suggested/future not existing flow of libvirt?

Mm..  admittedly I haven't investigated in depth, but my colleague did
and indicated this is the current flow.  It matches phys_port_id right
here:

https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L2793

Are we wrong?

> > Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
> > phys_port_id on the actual VF and then *a* netdev linked to physfn in
> > sysfs which will have the legacy NDOs.
> >
> > We can't set the phys_port_id on the VF reprs because then we're back
> > to the problem of ip link output growing.  Perhaps we shouldn't set it
> > on PF repr either?
> >
> > Let's make a table (assuming bare metal cloud scenario where Host0 is
> > controlling the network, while Host1 is the actual server):  
> 
> yeah, this would be a super-set the non-smartnic case where
> we have only one host.
> 
> [...]
> 
> > With this libvirt on Host0 should easily find the actual PF0 netdev to
> > run the NDO on, if it wants to use VFs:
> >  - libvrit finds act VF0/0 to plug into the VM;
> >  - reads its phys_port_id -> "PF0 SN";
> >  - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
> >  - runs NDOs on "act PF0" for PF0's VF correctly.  
> 
> What you describe here doesn't seem to be networking
> configuration, as it deals only with VFs and PF but not with reprs,
> and hence AFAIK runs on host host1

No, hm, depends on your definition of SmartNIC.  ARM64 control CPU is
capable of running VMs.  Why would you not run VMs on your controller?
Or one day we will need reprs for containers, people are definitely
going to run containers on the controller...  I wouldn't design this
assuming there is no advanced switching a'la service chains on the
control CPU...

> > Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?  
> 
> I need to think on that

Okay :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: phys_port_id in switchdev mode?
  2018-09-05 16:20     ` Samudrala, Sridhar
@ 2018-09-05 16:47       ` Jakub Kicinski
  0 siblings, 0 replies; 11+ messages in thread
From: Jakub Kicinski @ 2018-09-05 16:47 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Or Gerlitz, Florian Fainelli, Simon Horman, Andy Gospodarek,
	mchan, Jiri Pirko, Alexander Duyck, Frederick Botha,
	nick viljoen, Linux Netdev List

On Wed, 5 Sep 2018 09:20:43 -0700, Samudrala, Sridhar wrote:
> > With this libvirt on Host0 should easily find the actual PF0 netdev to
> > run the NDO on, if it wants to use VFs:
> >   - libvrit finds act VF0/0 to plug into the VF;
> >   - reads its phys_port_id -> "PF0 SN";
> >   - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
> >   - runs NDOs on "act PF0" for PF0's VF correctly.  
> 
> I think Host0 corresponds to embedded OS on the NIC. Is this correct?
> I guess in this setup, only PF0's PCI interface on Host0 is in switchdev mode and
> the representors for PF0 and its VFs are created on Host0 when they come up
> on Host1. I would think PF0 on Host0 acts as a Control PF for PF1 on Host1.
> 
> Isn't hypervisor running only on Host1?

The main hypervisor is, but Host0 can very easily want to run some DPI
or some such on flows before it allows them though, and people like
running DPI-like apps in VMs/containers..

> > The other problem remains unsolved - Host0 can't be sure without
> > vendor-specific knowledge whether it's connected to PF0 or PF1.
> > That's why I was thinking maybe we should provide phys_port_id
> > on PF representors as well.  That means we'd have to provide the
> > legacy NDOs on PF reprs too because libvirt may now find PF repr...
> > Would it be cleaner to add a new attribute?
> >
> > Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?  
> 
> Do you mean the legacy VF ndo ops on the PF?  I think it is possible to configure
> the VFs on Host1 via the port representors except for the MAC address.

Yes, the MAC address would be the only one.  Could Host0 care about
which MACs Host1 assigned to its VFs?  I'm not sure..

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-09-05 21:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20180828200539.1c0fe607@cakuba.netronome.com>
2018-08-28 18:43 ` phys_port_id in switchdev mode? Jakub Kicinski
2018-08-31 20:13   ` Marcelo Ricardo Leitner
2018-09-01 11:34     ` Jakub Kicinski
2018-09-03 13:55       ` Marcelo Ricardo Leitner
2018-09-03  9:43     ` Or Gerlitz
2018-09-03  9:40 ` Or Gerlitz
2018-09-04 10:20   ` Jakub Kicinski
2018-09-04 20:37     ` Or Gerlitz
2018-09-05 16:43       ` Jakub Kicinski
2018-09-05 16:20     ` Samudrala, Sridhar
2018-09-05 16:47       ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.