From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Subject: Re: phys_port_id in switchdev mode?
Date: Wed, 5 Sep 2018 18:43:50 +0200
Message-ID: <20180905184350.5c837ba2@cakuba>
References: <20180828200539.1c0fe607@cakuba.netronome.com>
        <CAJ3xEMiARmW1TYG4OB2doMoH3tBnDRgY2uVsw-upG0wgZsn6RA@mail.gmail.com>
        <20180904122057.46fce83a@cakuba>
        <CAJ3xEMj79J0F9dpBNNLGzq1Mub7ig1itv7_Wgpsi6PxVgmRqsg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: Florian Fainelli <f.fainelli@gmail.com>,
        Simon Horman <simon.horman@netronome.com>,
        Andy Gospodarek <andy@greyhouse.net>,
        "mchan@broadcom.com" <mchan@broadcom.com>,
        Jiri Pirko <jiri@resnulli.us>,
        Alexander Duyck <alexander.duyck@gmail.com>,
        Frederick Botha <frederick.botha@netronome.com>,
        nick viljoen <nick.viljoen@netronome.com>,
        Linux Netdev List <netdev@vger.kernel.org>
To: Or Gerlitz <gerlitz.or@gmail.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pf1-f195.google.com ([209.85.210.195]:34693 "EHLO
        mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726497AbeIEVPB (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 5 Sep 2018 17:15:01 -0400
Received: by mail-pf1-f195.google.com with SMTP id k19-v6so3779473pfi.1
        for <netdev@vger.kernel.org>; Wed, 05 Sep 2018 09:44:01 -0700 (PDT)
In-Reply-To: <CAJ3xEMj79J0F9dpBNNLGzq1Mub7ig1itv7_Wgpsi6PxVgmRqsg@mail.gmail.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Tue, 4 Sep 2018 23:37:29 +0300, Or Gerlitz wrote:
> On Tue, Sep 4, 2018 at 1:20 PM, Jakub Kicinski wrote:
> > On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:  
> >> On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:  
> >> > Hi!  
> >>
> >> Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,
> >>
> >> Note I replied couple of minutes ago but it didn't get to the list, so
> >> lets take it from this one:
> >>  
> >> > I wonder if we can use phys_port_id in switchdev to group together
> >> > interfaces of a single PCI PF?  Here is the problem:
> >> >
> >> > With a mix of PF and VF interfaces it gets increasingly difficult to
> >> > figure out which one corresponds to which PF.  We can identify which
> >> > *representor* is which, by means of phys_port_name and devlink
> >> > flavours.  But if the actual VF/PF interfaces are also present on the
> >> > same host, it gets confusing when one tries to identify the PF they
> >> > came from.  Generally one has to resort of matching between PCI DBDF of
> >> > the PF and VFs or read relevant info out of ethtool -i.
> >> >
> >> > In multi host scenario this is particularly painful, as there seems to
> >> > be no immediately obvious way to match PCI interface ID of a card (0,
> >> > 1, 2, 3, 4...) to the DBDF we have connected.
> >> >
> >> > Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
> >> > from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
> >> > random manner, which means we have to provide those for all devices with
> >> > link to the PF (all reprs).  And we have to link them (a) because it's
> >> > right (tm) and (b) to get correct naming.  
> >>
> >> wait, as you commented in later, not only the mellanox vf reprs but rather also
> >> the nfp vf reprs are not linked to the PF, because ip link output
> >> grows quadratically.  
> >
> > Right, correct.  If we set phys_port_id libvirt will reliably pick the
> > correct netdev to run NDOs on (PF/PF repr) so we can remove them from
> > the other netdevs and therefore limit the size of ip link show output.  
> 
> just to make sure, this is suggested/future not existing flow of libvirt?

Mm..  admittedly I haven't investigated in depth, but my colleague did
and indicated this is the current flow.  It matches phys_port_id right
here:

https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L2793

Are we wrong?

> > Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
> > phys_port_id on the actual VF and then *a* netdev linked to physfn in
> > sysfs which will have the legacy NDOs.
> >
> > We can't set the phys_port_id on the VF reprs because then we're back
> > to the problem of ip link output growing.  Perhaps we shouldn't set it
> > on PF repr either?
> >
> > Let's make a table (assuming bare metal cloud scenario where Host0 is
> > controlling the network, while Host1 is the actual server):  
> 
> yeah, this would be a super-set the non-smartnic case where
> we have only one host.
> 
> [...]
> 
> > With this libvirt on Host0 should easily find the actual PF0 netdev to
> > run the NDO on, if it wants to use VFs:
> >  - libvrit finds act VF0/0 to plug into the VM;
> >  - reads its phys_port_id -> "PF0 SN";
> >  - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
> >  - runs NDOs on "act PF0" for PF0's VF correctly.  
> 
> What you describe here doesn't seem to be networking
> configuration, as it deals only with VFs and PF but not with reprs,
> and hence AFAIK runs on host host1

No, hm, depends on your definition of SmartNIC.  ARM64 control CPU is
capable of running VMs.  Why would you not run VMs on your controller?
Or one day we will need reprs for containers, people are definitely
going to run containers on the controller...  I wouldn't design this
assuming there is no advanced switching a'la service chains on the
control CPU...

> > Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?  
> 
> I need to think on that

Okay :)