Hi all, It's been a while since the last discussion here. I have been working on implementing the standby feature in Qemu. I have tried multiple approaches for implementation and in the end decided to implement using the hotplug/unplug infrastructure for multiple reasons which I'll go over when I send the patches. For now you can find the implementation here: https://github.com/sameehj/qemu/tree/failover_hidden_opts (the full command line I used can be found at the end of the email) I have tested my implementation in Qemu with Fedora 29 guest, I can see the failover interface successfully and assign an ip to it. The feature is acked and the primary device is plugged in with no issues. I have created a setup which has two hosts (host A and host B) with X710 10G cards connected back to back. On one host (I'll refer to this host as host A) I have configured a bridge with the PF interface as well as vitio-net's interface (standby) both attached to it. I ran the guest with the patched Qemu on host A and pinged the bridge successfully, I also have a ping between host A and Host B, however, I can't ping host B from the VM and vice versa, this only happens when the feature is enabled for some reason I have yet to figure out. I haven't tested migration yet, but on my way to do so. Since I couldn't ping from VM to host B, I did an iperf test between the VM and host A with the feature enabled and during the test I have unplugged the sriov device, the device was unplugged successfully and no drops where observed as you can see in the results below: [root@dhcp156-44 ~]# ifconfig ens3: flags=4163 mtu 1500 inet 10.19.156.44 netmask 255.255.248.0 broadcast 10.19.159.255 inet6 fe80::d306:561f:9f43:ff77 prefixlen 64 scopeid 0x20 inet6 2620:52:0:1398:9699:325b:25f9:e7bb prefixlen 64 scopeid 0x0 ether 56:cc:c1:01:cc:21 txqueuelen 1000 (Ethernet) RX packets 12258 bytes 870822 (850.4 KiB) RX errors 11 dropped 0 overruns 0 frame 11 TX packets 294 bytes 32432 (31.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens4: flags=4163 mtu 1500 inet 192.168.1.17 netmask 255.255.255.0 broadcast 192.168.1.255 inet6 fe80::bc87:86b8:bc86:be4e prefixlen 64 scopeid 0x20 ether 8a:f7:20:29:3b:cb txqueuelen 1000 (Ethernet) RX packets 41052 bytes 2775833 (2.6 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 47468 bytes 15629 (15.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens6: flags=4163 mtu 1500 ether 8a:f7:20:29:3b:cb txqueuelen 1000 (Ethernet) RX packets 214 bytes 14966 (14.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 163 bytes 26498 (25.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens4nsby: flags=4163 mtu 1500 ether 8a:f7:20:29:3b:cb txqueuelen 1000 (Ethernet) RX packets 41052 bytes 2775833 (2.6 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 47468 bytes 2889827541 (2.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73 mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1000 (Local Loopback) RX packets 176 bytes 19712 (19.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 176 bytes 19712 (19.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@dhcp156-44 ~]# iperf -c 192.168.1.117 -t 100 -i 1 ------------------------------------------------------------ Client connecting to 192.168.1.117, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.1.17 port 40368 connected with 192.168.1.117 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 3.47 GBytes 29.8 Gbits/sec [ 3] 1.0- 2.0 sec 4.35 GBytes 37.4 Gbits/sec [ 3] 2.0- 3.0 sec 4.10 GBytes 35.2 Gbits/sec [ 3] 3.0- 4.0 sec 4.20 GBytes 36.1 Gbits/sec [ 3] 4.0- 5.0 sec 4.20 GBytes 36.1 Gbits/sec [ 3] 5.0- 6.0 sec 4.07 GBytes 34.9 Gbits/sec [ 3] 6.0- 7.0 sec 4.53 GBytes 38.9 Gbits/sec [ 3] 7.0- 8.0 sec 4.38 GBytes 37.6 Gbits/sec [ 3] 8.0- 9.0 sec 4.60 GBytes 39.5 Gbits/sec [ 3] 9.0-10.0 sec 4.60 GBytes 39.5 Gbits/sec [ 3] 10.0-11.0 sec 4.56 GBytes 39.2 Gbits/sec [ 3] 11.0-12.0 sec 4.70 GBytes 40.4 Gbits/sec [ 3] 12.0-13.0 sec 4.65 GBytes 39.9 Gbits/sec [ 3] 13.0-14.0 sec 4.51 GBytes 38.7 Gbits/sec [ 3] 14.0-15.0 sec 4.48 GBytes 38.5 Gbits/sec [ 3] 15.0-16.0 sec 4.67 GBytes 40.2 Gbits/sec [ 3] 16.0-17.0 sec 4.37 GBytes 37.5 Gbits/sec [ 3] 17.0-18.0 sec 4.68 GBytes 40.2 Gbits/sec [ 3] 18.0-19.0 sec 4.99 GBytes 42.9 Gbits/sec [ 3] 19.0-20.0 sec 5.00 GBytes 42.9 Gbits/sec [ 3] 20.0-21.0 sec 4.90 GBytes 42.1 Gbits/sec [ 3] 21.0-22.0 sec 4.72 GBytes 40.5 Gbits/sec [ 3] 22.0-23.0 sec 4.60 GBytes 39.5 Gbits/sec [ 3] 23.0-24.0 sec 4.72 GBytes 40.6 Gbits/sec [ 3] 24.0-25.0 sec 4.42 GBytes 38.0 Gbits/sec [ 3] 25.0-26.0 sec 4.44 GBytes 38.2 Gbits/sec [ 3] 26.0-27.0 sec 4.18 GBytes 35.9 Gbits/sec [ 3] 27.0-28.0 sec 4.20 GBytes 36.1 Gbits/sec [ 3] 28.0-29.0 sec 4.27 GBytes 36.7 Gbits/sec [ 3] 29.0-30.0 sec 4.16 GBytes 35.7 Gbits/sec [ 3] 30.0-31.0 sec 4.14 GBytes 35.6 Gbits/sec [ 3] 31.0-32.0 sec 4.13 GBytes 35.4 Gbits/sec [ 3] 32.0-33.0 sec 4.16 GBytes 35.7 Gbits/sec [ 3] 33.0-34.0 sec 4.33 GBytes 37.2 Gbits/sec [ 3] 34.0-35.0 sec 4.31 GBytes 37.0 Gbits/sec [ 3] 35.0-36.0 sec 4.26 GBytes 36.6 Gbits/sec [ 3] 36.0-37.0 sec 4.36 GBytes 37.5 Gbits/sec [ 3] 37.0-38.0 sec 4.11 GBytes 35.3 Gbits/sec [ 3] 38.0-39.0 sec 4.00 GBytes 34.4 Gbits/sec [ 3] 39.0-40.0 sec 4.53 GBytes 38.9 Gbits/sec [ 3] 40.0-41.0 sec 4.06 GBytes 34.9 Gbits/sec [ 3] 41.0-42.0 sec 4.17 GBytes 35.8 Gbits/sec [ 3] 42.0-43.0 sec 4.14 GBytes 35.6 Gbits/sec [ 3] 43.0-44.0 sec 4.07 GBytes 34.9 Gbits/sec ^C[ 3] 0.0-44.5 sec 195 GBytes 37.5 Gbits/sec [root@dhcp156-44 ~]# ifconfig ens3: flags=4163 mtu 1500 inet 10.19.156.44 netmask 255.255.248.0 broadcast 10.19.159.255 inet6 fe80::d306:561f:9f43:ff77 prefixlen 64 scopeid 0x20 inet6 2620:52:0:1398:9699:325b:25f9:e7bb prefixlen 64 scopeid 0x0 ether 56:cc:c1:01:cc:21 txqueuelen 1000 (Ethernet) RX packets 12547 bytes 889713 (868.8 KiB) RX errors 11 dropped 0 overruns 0 frame 11 TX packets 373 bytes 45723 (44.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens4: flags=4163 mtu 1500 inet 192.168.1.17 netmask 255.255.255.0 broadcast 192.168.1.255 inet6 fe80::bc87:86b8:bc86:be4e prefixlen 64 scopeid 0x20 ether 8a:f7:20:29:3b:cb txqueuelen 1000 (Ethernet) RX packets 2862498 bytes 192898865 (183.9 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3414905 bytes 209192841687 (194.8 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens4nsby: flags=4163 mtu 1500 ether 8a:f7:20:29:3b:cb txqueuelen 1000 (Ethernet) RX packets 2862498 bytes 192898865 (183.9 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3414905 bytes 212082653599 (197.5 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73 mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1000 (Local Loopback) RX packets 176 bytes 19712 (19.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 176 bytes 19712 (19.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 __________________________________________________________________________________________________________________ The command line I used: /root/qemu/x86_64-softmmu/qemu-system-x86_64 \ -netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc17 \ -device e1000,netdev=hostnet0,mac=56:cc:c1:01:cc:21,id=cc17 \ -netdev tap,vhost=on,id=hostnet1,script=test_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \ -device virtio-net,host_mtu=1500,netdev=hostnet1,mac=8a:f7:20:29:3b:cb,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \ -device vfio-pci,host=65:02.1,id=cc1_71,standby=cc1_72 \ -enable-kvm \ -name netkvm \ -m 3000M \ -drive file=/dev/shm/fedora_29.qcow2,if=ide,id=drivex \ -smp 4 \ -vga qxl \ -spice port=6110,disable-ticketing \ -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x7 \ -chardev spicevmc,name=vdagent,id=vdagent \ -device virtserialport,nr=1,bus=virtio-serial0.0,chardev=vdagent,name=com.redhat.spice.0 \ -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 \ -device virtio-serial \ -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0 \ -monitor stdio On Fri, Oct 19, 2018 at 6:45 AM Michael S. Tsirkin wrote: > On Wed, Oct 10, 2018 at 06:26:50PM -0700, Siwei Liu wrote: > > On Fri, Oct 5, 2018 at 12:18 PM Michael S. Tsirkin > wrote: > > > > > > On Thu, Oct 04, 2018 at 05:03:14PM -0700, Siwei Liu wrote: > > > > On Tue, Oct 2, 2018 at 5:43 AM Michael S. Tsirkin > wrote: > > > > > > > > > > On Tue, Oct 02, 2018 at 01:42:09AM -0700, Siwei Liu wrote: > > > > > > The VF's MAC can be updated by PF/host on the fly at any time. > One can > > > > > > start with a random MAC but use group ID to pair device instead. > And > > > > > > only update MAC address to the real one when moving MAC filter > around > > > > > > after PV says OK to switch datapath. > > > > > > > > > > > > Do you see any problem with this design? > > > > > > > > > > Isn't this what I proposed: > > > > > Maybe we can > > > > > start VF with a temporary MAC, then change it to a final > one when guest > > > > > tries to use it. It will work but we run into fact that > MACs are > > > > > currently programmed by mgmnt - in many setups qemu does > not have the > > > > > rights to do it. > > > > > > > > > > ? > > > > > > > > > > If yes I don't see a problem with the interface design, even though > > > > > implementation wise it's more work as it will have to include > management > > > > > changes. > > > > > > > > I thought we discussed this design a while back: > > > > https://www.spinics.net/lists/netdev/msg512232.html > > > > > > > > ... plug in a VF with a random MAC filter programmed in prior, and > > > > initially use that random MAC within guest. This would require: > > > > a) not relying on permanent MAC address to do pairing during the > > > > initial discovery, e.g. use the failover group ID as in this > > > > discussion > > > > b) host to toggle the MAC address filter: which includes taking down > > > > the tap device to return the MAC back to PF, followed by assigning > > > > that MAC to VF using "ip link ... set vf ..." > > > > c) notify guest to reload/reset VF driver for the change of hardware > MAC address > > > > d) until VF reloads the driver it won't be able to use the datapath, > > > > so very short period of network outage is (still) expected > > > > > > > > though I still don't think this design can elimnate downtime. > > > > > > > > > No, my idea is somewhat different. As you say there is a problem > > > of delay at point (c). > > That's true, I never say the downtime can be avoided because of this > > delay in the guest side. But with this the downtime gets to the bare > > minimum and in most situations packets won't be lost on reception as > > long as the PF sets up the filter in timely manner. > > It's not really the bare minimum IMHO. E.g. fixing the PF to > defer filter update will give you less downtime. > > > > Further, the need to poke at PF filters > > > with set vf does not match the current security model where > > > any security related configuration such as MAC filtering is done > upfront. > > > > The security model belongs to the VM policy not the VF, right? I think > > same MAC address will always be used on the VM as it starts with > > virtio. Why it is a security issue that VF starts with an unused MAC > > before it's able to be used in the guest? > > Basically if guest is able to trigger MAC changes, > it might be able to exploit some bug to escalate that to > full network access. Completely blocking configuration > changes after setup feels safer. > > Case in point, with QEMU a typical selinux policy will block > attempts to change MACs, that task will have to be > delegated to a suitably priveledged tool. > > > > > > > > > > > > > So I have two suggestions: > > > > > > 1. Teach pf driver not to program the filter until vf driver actually > goes up. > > > > > > How do we know it went up? For example, it is highly likely > > > that driver will send some kind of command on init. > > > E.g. linux seems to always try to set the mac address during init. > > > We can have any kind of command received by the PF enable > > > the filter, until reset. > > > > I'm not sure it's a valid assumption for any guest, say Windows. The > > VF can start with the MAC address advertised from PF in the first > > reset, and the MAC filter generally will be activated at that point. > > Some other PF/VF variants enable the filter after that until the VF is > > brought up in guest, while some others enable the filter even before > > the VF gets assigned to guest. Trying to assume the behaviour on > > specific guest or specific NIC device is a slippery slope. > > > Is all this just theoretical or do you observe any problems in practice? > > > The only > > thing that's reliable is the semantics of ndo_vf_xxx interface for the > > PF. > > ndo_vf_xxx is an internal Linux interface. That's not guaranteed to be > stable at all. I think you mean the netlink interface that triggers > that. That should be stable but if what you say above is true isn't > fully defined. > > > You seem to overly assume too much on the specific PF behaviour > > which is not defined in the interface itself. > > So IMHO it's something that we should fix in Linux, > making all devices behave consistently. > > > > > > > In absence of an appropriate command, QEMU can detect bus master > > > enable and do that. > > > > > > 2. Create a variant of trusted VF where it starts out without a valid > > > MAC, guest can set a softmac MAC but only can set it to the specific > > > value that matches virtio. > > > Alternatively - if it's preferred for some reason - allow > > > guest to program just two MACs, the original one and the virtio one. > > > Any other value is denied. > > > > I am getting confused, I don't know why that's even needed. The > > management tool can set any predefined MAC that is deemed safe for VF > > to start with. Why it needs to be that complicated? What is the > > purpose of another model for trusted VF and softmac? It's the PF that > > changes the MAC not the VF. > > This will give us a simple solution without guest driver changes for > when VF is trusted. In particular it will work e.g. for PFs as well. > > > > > > > > > > > > > > However, > > > > it looks like as of today the MAC matching still haven't addressed > the > > > > datapath switching and error handling in a clean way. As said, for > > > > SR-IOV live migration on iSCSI root disk there will be a lot of > > > > dancing parts going along the way, reliable network connectity and > > > > dedicated handshakes are critical to this kind of setup. > > > > > > > > -Siwei > > > > > > I think MAC matching removes downtime when device is removed but not > > > when it's re-added, yes. It has the advantage of an already present > > > linux driver support, but if you are prepared to work on > > > adding e.g. bridge based matching, that will go away. > > > > The removal order and consequence will be the same between MAC > > matching and group ID based matching. It's just the initial discovery > > that's slightly different. Why do you think the downtime will be > > different for the removal scenario? And why do you think it's needed > > to alter the current PF driver behavior to support bridge based > > matching? Sorry I'm really confused about your suggestion. Those PF > > driver model changes are not needed acutally. The fact is that the > > bridge based matching is supposed to work quite well for any PF driver > > implementation no matter when the MAC address filters gets added or > > enabled. > > > > Thanks, > > -Siwei > > It seems that it requires a bunch of changes for all VF drivers > though. > > > > > > > > > > > > > > > > > > > -- > > > > > MST > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: > virtio-dev-unsubscribe@lists.oasis-open.org > > > > > For additional commands, e-mail: > virtio-dev-help@lists.oasis-open.org > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org > > -- Respectfully, *Sameeh Jubran* *Linkedin * *Software Engineer @ Daynix .*