All of lore.kernel.org
 help / color / mirror / Atom feed
* Switchdev Application to SR-IOV NICs
@ 2015-03-04  0:26 David Christensen
  2015-03-04  4:05 ` John Fastabend
  0 siblings, 1 reply; 4+ messages in thread
From: David Christensen @ 2015-03-04  0:26 UTC (permalink / raw)
  To: netdev; +Cc: Jiří Pírko (jiri@resnulli.us)

I'm struggling with the concept of implementing switchdev on an SR-IOV NIC.
Most slides presented at Netdev 0.1 agreed that switchdev should be applicable
to SR-IOV NICs as well as switch ASICs, but I'm having difficulty figuring
out exactly how things should operate.  Here's how things look today with
netdev and SR-IOV VFs passed-through to a virtual machine.

      +-----+-----+-----+
      | vm0 | vm1 | vm2 | Virtual
      | eth0| eth0| eth0| Machines
+-----+--|--+--|--+--|--+----------
|eth0 |  |     |     |    Kernel
+--|--+--|-----|-----|--+----------
| pf0   vf0   vf1   vf2 | PCIe
+--|-----|-----|-----|--+----------
| ++-----+-----+-----++ | SR-IOV NIC
| | VEB               | |
| +------------+------+ |
+--------------|--------+
               |
              PHY

Connectivity between VMs and the host is handled by the VEB operating in the
NIC, other traffic is forwarded normally by the VEB from the external network
to the host/VM based on destination MAC and VLAN with special handling 
required for broadcast/multicast.  

Based on some separate conversations I've had with Jiri, I'm lead to believe
switchdev would look something like this.

      +-----+-----+-----+
      | vm0 | vm1 | vm2 | Virtual
      | eth0| eth0| eth0| Machines
+-----+--|--+--|--+--|--+----------
|sw0p0 sw0p1 sw0p2 sw0p3| Kernel
+--|-----|-----|-----|--+----------
| pf0   vf0   vf1   vf2 | PCIe
+--|-----|-----|-----|--+----------
| ++-----+-----+-----++ | SR-IOV NIC
| | VEB               | |
| +------------+------+ |
| SR-IOV NIC   |        |
+--------------|--------+
               |
              PHY

The use of switchdev would show that all sw0* devices are associated with the
same switch, and the instantiation of the sw0* devices in the kernel would
provide higher level applications like OVS/Linux bridge/etc. to control traffic
in a way not possible in the earlier example.  So far so good?

Now the question becomes how to plumb SR-IOV NIC to create this representation.
Looking at one specific path:

  +-----+
  | vm0 |
  | eth0|
  +--|--+
  |sw0p1|
  +--|--+
  | vf0 |
+----|----+
| +--+--+ |
| | VEB | |
| +-----+ |
+---------+

It's unclear to me when traffic egressing the VEB should terminate at sw0p1 vs.
vm0's eth0.  They both represent the same MAC/VLAN.  Similarly, for traffic 
egressing vm0's eth0, when should it terminate at sw0p1 vs. the VEB.

Can anyone offer an alternate diagram for switchdev on an SR-IOV NIC?

Dave

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Switchdev Application to SR-IOV NICs
  2015-03-04  0:26 Switchdev Application to SR-IOV NICs David Christensen
@ 2015-03-04  4:05 ` John Fastabend
  2015-03-04  7:25   ` Jiri Pirko
  2015-03-04 21:51   ` David Christensen
  0 siblings, 2 replies; 4+ messages in thread
From: John Fastabend @ 2015-03-04  4:05 UTC (permalink / raw)
  To: David Christensen
  Cc: netdev, "Jiří Pírko (jiri@resnulli.us)"

On 03/03/2015 04:26 PM, David Christensen wrote:
> I'm struggling with the concept of implementing switchdev on an SR-IOV NIC.
> Most slides presented at Netdev 0.1 agreed that switchdev should be applicable
> to SR-IOV NICs as well as switch ASICs, but I'm having difficulty figuring
> out exactly how things should operate.  Here's how things look today with
> netdev and SR-IOV VFs passed-through to a virtual machine.
>
>        +-----+-----+-----+
>        | vm0 | vm1 | vm2 | Virtual
>        | eth0| eth0| eth0| Machines
> +-----+--|--+--|--+--|--+----------
> |eth0 |  |     |     |    Kernel
> +--|--+--|-----|-----|--+----------
> | pf0   vf0   vf1   vf2 | PCIe
> +--|-----|-----|-----|--+----------
> | ++-----+-----+-----++ | SR-IOV NIC
> | | VEB               | |
> | +------------+------+ |
> +--------------|--------+
>                 |
>                PHY
>
> Connectivity between VMs and the host is handled by the VEB operating in the
> NIC, other traffic is forwarded normally by the VEB from the external network
> to the host/VM based on destination MAC and VLAN with special handling
> required for broadcast/multicast.
>
> Based on some separate conversations I've had with Jiri, I'm lead to believe
> switchdev would look something like this.
>
>        +-----+-----+-----+
>        | vm0 | vm1 | vm2 | Virtual
>        | eth0| eth0| eth0| Machines
> +-----+--|--+--|--+--|--+----------
> |sw0p0 sw0p1 sw0p2 sw0p3| Kernel
> +--|-----|-----|-----|--+----------
> | pf0   vf0   vf1   vf2 | PCIe
> +--|-----|-----|-----|--+----------
> | ++-----+-----+-----++ | SR-IOV NIC
> | | VEB               | |
> | +------------+------+ |
> | SR-IOV NIC   |        |
> +--------------|--------+
>                 |
>                PHY

That looks good to me I might add one more netdev to represent the
egress port though. This could be used to submit control traffic
that should not by spec be sent through a VEB. For example STP,
LLDP, etc. At the moment we send this traffic on sw0p0 which is
exactly correct.

I had some prototype code @ one point that did this I can dig it
up if folks think its useful.

Also it might be worth noting the "Kernel" net_devices are not
actually bound to the virtual function but multiplexed/demux'd
over the physical function pf0 in the diagram. The diagram might
be read to imply some PCIe relationship between sw0p3 and vf2.

>
> The use of switchdev would show that all sw0* devices are associated with the
> same switch, and the instantiation of the sw0* devices in the kernel would
> provide higher level applications like OVS/Linux bridge/etc. to control traffic
> in a way not possible in the earlier example.  So far so good?
>
> Now the question becomes how to plumb SR-IOV NIC to create this representation.
> Looking at one specific path:
>
>    +-----+
>    | vm0 |
>    | eth0|
>    +--|--+
>    |sw0p1|
>    +--|--+
>    | vf0 |
> +----|----+
> | +--+--+ |
> | | VEB | |
> | +-----+ |
> +---------+
>
> It's unclear to me when traffic egressing the VEB should terminate at sw0p1 vs.
> vm0's eth0.  They both represent the same MAC/VLAN.  Similarly, for traffic
> egressing vm0's eth0, when should it terminate at sw0p1 vs. the VEB.
>
> Can anyone offer an alternate diagram for switchdev on an SR-IOV NIC?
>

One approach would be to treat it like the switch case where instead
of a physical port you have a VF. In this case if you xmit a packet on
sw0p1 it is sent to eth0. Then if vm0 (eth0) xmits a packet it enters
the VEB. The only way to get packets onto sw0p1 is to use a rule to
either "trap" or "mirror" packets to the "CPU sw0p1 port". Maybe a
better name would be "hypervisor sw0p1 port". This would be analagous
to the switch case, I have experimented with adding this support to
the Flow API I'm working on but have not implemented it on rocker yet.


  +-----+      +-----+
  |hyper|      | vm1 |
  |visor|      | eth0|
  +-----+      +-----+
     |            |
  +--|--+      +--|--+
  |sw0p0|      |sw0p2|
  +-----+      +-----+
     |           |
  +--|-----|-----|-----|--+
  | ++-----+-----+-----++ |
  | | VEB               | |
  | +------------+------+ |
  | SR-IOV NIC   |        |
  +--------------|--------+
                  |
                 PHY

here the link between sw0p2 and vm1 is a virtual function instead of a
physical wire. And sw0p0 is the "CPU port" directly to the hypervisor.

Is that at all clear? Let me know I can try to do a better write up
in the AM.

.John

> Dave
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Switchdev Application to SR-IOV NICs
  2015-03-04  4:05 ` John Fastabend
@ 2015-03-04  7:25   ` Jiri Pirko
  2015-03-04 21:51   ` David Christensen
  1 sibling, 0 replies; 4+ messages in thread
From: Jiri Pirko @ 2015-03-04  7:25 UTC (permalink / raw)
  To: John Fastabend; +Cc: David Christensen, netdev

Wed, Mar 04, 2015 at 05:05:03AM CET, john.fastabend@gmail.com wrote:
>On 03/03/2015 04:26 PM, David Christensen wrote:
>>I'm struggling with the concept of implementing switchdev on an SR-IOV NIC.
>>Most slides presented at Netdev 0.1 agreed that switchdev should be applicable
>>to SR-IOV NICs as well as switch ASICs, but I'm having difficulty figuring
>>out exactly how things should operate.  Here's how things look today with
>>netdev and SR-IOV VFs passed-through to a virtual machine.
>>
>>       +-----+-----+-----+
>>       | vm0 | vm1 | vm2 | Virtual
>>       | eth0| eth0| eth0| Machines
>>+-----+--|--+--|--+--|--+----------
>>|eth0 |  |     |     |    Kernel
>>+--|--+--|-----|-----|--+----------
>>| pf0   vf0   vf1   vf2 | PCIe
>>+--|-----|-----|-----|--+----------
>>| ++-----+-----+-----++ | SR-IOV NIC
>>| | VEB               | |
>>| +------------+------+ |
>>+--------------|--------+
>>                |
>>               PHY
>>
>>Connectivity between VMs and the host is handled by the VEB operating in the
>>NIC, other traffic is forwarded normally by the VEB from the external network
>>to the host/VM based on destination MAC and VLAN with special handling
>>required for broadcast/multicast.
>>
>>Based on some separate conversations I've had with Jiri, I'm lead to believe
>>switchdev would look something like this.
>>
>>       +-----+-----+-----+
>>       | vm0 | vm1 | vm2 | Virtual
>>       | eth0| eth0| eth0| Machines
>>+-----+--|--+--|--+--|--+----------
>>|sw0p0 sw0p1 sw0p2 sw0p3| Kernel
>>+--|-----|-----|-----|--+----------
>>| pf0   vf0   vf1   vf2 | PCIe
>>+--|-----|-----|-----|--+----------
>>| ++-----+-----+-----++ | SR-IOV NIC
>>| | VEB               | |
>>| +------------+------+ |
>>| SR-IOV NIC   |        |
>>+--------------|--------+
>>                |
>>               PHY
>
>That looks good to me I might add one more netdev to represent the
>egress port though. This could be used to submit control traffic
>that should not by spec be sent through a VEB. For example STP,
>LLDP, etc. At the moment we send this traffic on sw0p0 which is
>exactly correct.

Indeed. Looks like we may need 2 more netdevices on top of that. One to
represent actual phy (sw0pX) and one to represent the associated port in
embedded switch (ethX). I thought that this could be handled by PF netdevs
but that does not look correct now. For example packet going from vm1 for pf
should go through sw0p0 finishing in PF netdev. On the other hand,
packet going from vm1 outside should go through sw0pX ethX.

      +-----+-----+-----+
      | vm0 | vm1 | vm2 | Virtual
      | eth0| eth0| eth0| Machines
+-----+--|--+--|--+--|--+----------
| host|  |  |  |  |  |  |
| eth0|  |  |  |  |  |  | kernel
+--|--+--|--+--|--+--|--+
|sw0p0 sw0p1 sw0p2 sw0p3|
+--|-----|-----|-----|--+----------
| pf0   vf0   vf1   vf2 | PCIe
+--|-----|-----|-----|--+----------
| ++-----+-----+-----++ | SR-IOV NIC
| | VEB               | |
| +------------+------+ |
| SR-IOV NIC   |        |
+--------------|--------+
               |
	     sw0pX
	       |
	      ethX
              PHY

>From the higher perspective, this somehow downgrades PF functionality
to be on of the VFs.

The best would be to use PF for the PHY port. In that case, PF netdev
would be just a representor, without possibility to actually use it in
host for actual host targeted traffic. But not sure that could be doable
since it would break the current model we have in kernel.


>
>I had some prototype code @ one point that did this I can dig it
>up if folks think its useful.
>
>Also it might be worth noting the "Kernel" net_devices are not
>actually bound to the virtual function but multiplexed/demux'd
>over the physical function pf0 in the diagram. The diagram might
>be read to imply some PCIe relationship between sw0p3 and vf2.

Exactly. This is just how to represent things in kernel.


>
>>
>>The use of switchdev would show that all sw0* devices are associated with the
>>same switch, and the instantiation of the sw0* devices in the kernel would
>>provide higher level applications like OVS/Linux bridge/etc. to control traffic
>>in a way not possible in the earlier example.  So far so good?
>>
>>Now the question becomes how to plumb SR-IOV NIC to create this representation.
>>Looking at one specific path:
>>
>>   +-----+
>>   | vm0 |
>>   | eth0|
>>   +--|--+
>>   |sw0p1|
>>   +--|--+
>>   | vf0 |
>>+----|----+
>>| +--+--+ |
>>| | VEB | |
>>| +-----+ |
>>+---------+
>>
>>It's unclear to me when traffic egressing the VEB should terminate at sw0p1 vs.
>>vm0's eth0.  They both represent the same MAC/VLAN.  Similarly, for traffic
>>egressing vm0's eth0, when should it terminate at sw0p1 vs. the VEB.
>>
>>Can anyone offer an alternate diagram for switchdev on an SR-IOV NIC?
>>
>
>One approach would be to treat it like the switch case where instead
>of a physical port you have a VF. In this case if you xmit a packet on
>sw0p1 it is sent to eth0. Then if vm0 (eth0) xmits a packet it enters
>the VEB. The only way to get packets onto sw0p1 is to use a rule to
>either "trap" or "mirror" packets to the "CPU sw0p1 port". Maybe a
>better name would be "hypervisor sw0p1 port". This would be analagous
>to the switch case, I have experimented with adding this support to
>the Flow API I'm working on but have not implemented it on rocker yet.
>
>
> +-----+      +-----+
> |hyper|      | vm1 |
> |visor|      | eth0|
> +-----+      +-----+
>    |            |
> +--|--+      +--|--+
> |sw0p0|      |sw0p2|
> +-----+      +-----+
>    |           |
> +--|-----|-----|-----|--+
> | ++-----+-----+-----++ |
> | | VEB               | |
> | +------------+------+ |
> | SR-IOV NIC   |        |
> +--------------|--------+
>                 |
>                PHY
>
>here the link between sw0p2 and vm1 is a virtual function instead of a
>physical wire. And sw0p0 is the "CPU port" directly to the hypervisor.
>
>Is that at all clear? Let me know I can try to do a better write up
>in the AM.
>
>.John
>
>>Dave
>>--
>>To unsubscribe from this list: send the line "unsubscribe netdev" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>-- 
>John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Switchdev Application to SR-IOV NICs
  2015-03-04  4:05 ` John Fastabend
  2015-03-04  7:25   ` Jiri Pirko
@ 2015-03-04 21:51   ` David Christensen
  1 sibling, 0 replies; 4+ messages in thread
From: David Christensen @ 2015-03-04 21:51 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, "Jiří Pírko (jiri@resnulli.us)"

> >       +-----+-----+-----+
> >       | vm0 | vm1 | vm2 | Virtual
> >       | eth0| eth0| eth0| Machines
> > +-----+--|--+--|--+--|--+----------
> > |sw0p0 sw0p1 sw0p2 sw0p3| Kernel
> > +--|-----|-----|-----|--+----------
> > | pf0   vf0   vf1   vf2 | PCIe
> > +--|-----|-----|-----|--+----------
> > | ++-----+-----+-----++ | SR-IOV NIC
> > | | VEB               | |
> > | +------------+------+ |
> > | SR-IOV NIC   |        |
> > +--------------|--------+
> >                 |
> >                PHY

I did wonder if a KVM virtualization use case is the only one to 
consider. What about a container model?  This could be very
confusing to users with all of the "phantom" sw0p* devices visible
alongside the eth* devices.

> 
> That looks good to me I might add one more netdev to represent the
> egress port though. This could be used to submit control traffic
> that should not by spec be sent through a VEB. For example STP,
> LLDP, etc. At the moment we send this traffic on sw0p0 which is
> exactly correct.
>

How do we separate hypervisor/eth0 from hypervisor/sw0p0 traffic
in this case?  Lots of corner cases to consider if we take that
path.

> I had some prototype code @ one point that did this I can dig it
> up if folks think its useful.
> 
> Also it might be worth noting the "Kernel" net_devices are not
> actually bound to the virtual function but multiplexed/demux'd
> over the physical function pf0 in the diagram. The diagram might
> be read to imply some PCIe relationship between sw0p3 and vf2.
>
> >    +-----+
> >    | vm0 |
> >    | eth0|
> >    +--|--+
> >    |sw0p1|
> >    +--|--+
> >    | vf0 |
> > +----|----+
> > | +--+--+ |
> > | | VEB | |
> > | +-----+ |
> > +---------+
> >
> > It's unclear to me when traffic egressing the VEB should terminate at
> sw0p1 vs.
> > vm0's eth0.  They both represent the same MAC/VLAN.  Similarly, for
> traffic
> > egressing vm0's eth0, when should it terminate at sw0p1 vs. the VEB.
> >
> > Can anyone offer an alternate diagram for switchdev on an SR-IOV NIC?
> >
> 
> One approach would be to treat it like the switch case where instead
> of a physical port you have a VF. In this case if you xmit a packet on
> sw0p1 it is sent to eth0. Then if vm0 (eth0) xmits a packet it enters
> the VEB. The only way to get packets onto sw0p1 is to use a rule to
> either "trap" or "mirror" packets to the "CPU sw0p1 port". Maybe a
> better name would be "hypervisor sw0p1 port". This would be analagous
> to the switch case, I have experimented with adding this support to
> the Flow API I'm working on but have not implemented it on rocker yet.

(Drawing slightly modified to match text above.)

>   +-----+      +-----+
>   |hyper|      | vm0 |
>   |visor|      | eth0|
>   +-----+      +-----+
>      |            |
>   +--|--+      +--|--+
>   |sw0p0|      |sw0p1|
>   +-----+      +-----+
>      |           |
>   +--|-----|-----|-----|--+
>   | ++-----+-----+-----++ |
>   | | VEB               | |
>   | +------------+------+ |
>   | SR-IOV NIC   |        |
>   +--------------|--------+
>                   |
>                  PHY
> 

This was my thought as well but there's no real hardware connection
sw0p1 and vm0/eth0 so I don't see how to forward frames across that
conceptual sw0p1<->eth0 interface.  It seems like a different 
hardware interface is required:

+-----+ +-----+
|hyper| | vm0 |
|visor| | eth0|
+--|--+ +--|--+
|sw0p1|    |
+--|--+ +--|--+
| vf0'| | vf0 |
+--|--+-+--|--+
| ++-------++ |
| |   VEB   | |
| +---------+ |
+-------------+

When VEB is handling L2 Forwarding, packets entering the VEB with
vm0/eth0's MAC/VLAN would be forwarded to the vf0 interface and land
in the VM, even packets transmitted on sw0p1.  

Things become more interesting if the VEB were to implement an Open
Flow forwarding model.  In that case, a packet that enters the VEB
with vm0/eth0's MAC/VLAN would be forwarded to the vf0 interface on
a flow hit and forwarded to the vf0' interface on a flow miss.

Perhaps another way to draw it, more in line with your comment about
pf0 above, would look like this:

+-------------+ +-----+
|hyper        | | vm0 |
|visor        | | eth0|
+--|--+-+--|--+ +--|--+
| eth0| |sw0p1|    |
+--|--+ +--|--+ +--|--+
| pf0 |----+    | vf0 |
+--|--+---------+--|--+
| ++---------------++ |
| |       VEB       | |
| +-----------------+ |
+---------------------+


> here the link between sw0p2 and vm1 is a virtual function instead of a
> physical wire. And sw0p0 is the "CPU port" directly to the hypervisor.
> 
> Is that at all clear? Let me know I can try to do a better write up
> in the AM.

I think we need to decide what the relationship is between the VEB and 
the host bridging functions before we can settle on a topology.

1) Treat the VEB and host bridge/OVS as two separate switches and connect
   them through an uplink (pf0/sw0p0?).

Advantages: Fewer "phantom" devices in the design; works with more existing
            devices.
Disadvantages: Lost metadata such as VEB ingress port.

2) Treat the VEB and host bridge/OVS as a stacked switch.

Advantages: Simplified presentation to the kernel.
Disadvantages: More complex NIC/VEB design; definition of a stacking
               wire protocol to pass metadata.

3) Other options?

Dave

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-03-04 21:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-04  0:26 Switchdev Application to SR-IOV NICs David Christensen
2015-03-04  4:05 ` John Fastabend
2015-03-04  7:25   ` Jiri Pirko
2015-03-04 21:51   ` David Christensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.