All of lore.kernel.org
 help / color / mirror / Atom feed
* [summary] virtio network device failover writeup
@ 2019-03-17 13:55 Michael S. Tsirkin
  2019-03-19 12:38 ` Liran Alon
  2019-03-19 12:38 ` Liran Alon
  0 siblings, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-17 13:55 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Sridhar Samudrala, Alexander Duyck, Stephen Hemminger,
	Jakub Kicinski, Jiri Pirko, David Miller, Netdev, virtualization,
	liran.alon, boris.ostrovsky, vijay.balakrishna, jfreimann,
	ogerlitz, vuhuong

Hi all,
I've put up a blog post with a summary of where network
device failover stands and some open issues.
Not sure where best to host it, I just put it up on blogspot:
https://mstsirkin.blogspot.com/2019/03/virtio-network-device-failover-support.html

Comments, corrections are welcome!

-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-17 13:55 [summary] virtio network device failover writeup Michael S. Tsirkin
  2019-03-19 12:38 ` Liran Alon
@ 2019-03-19 12:38 ` Liran Alon
  2019-03-19 15:46   ` Stephen Hemminger
                     ` (4 more replies)
  1 sibling, 5 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-19 12:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Si-Wei Liu, Sridhar Samudrala, Alexander Duyck,
	Stephen Hemminger, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

Hi Michael,

Great blog-post which summarise everything very well!

Some comments I have:

1) I think that when we are using the term “1-netdev model” on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves”.
Therefore, I would suggest to just remove the “1-netdev model” section and rename the "3-device model with hidden slaves” section to “1-netdev model”.

2) The userspace issues result both from using “2-netdev model” and “3-netdev model”. However, they are described in blog-post as they only exist on “3-netdev model”.
The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model.
Which leads me to the next comment.

3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than “1-netdev model”.
The issues that I’m aware of are (Please tell me if you are aware of others!):
(a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it’s used as the standby interface of another net-failover netdev.
The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it’s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver.
The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted.
(b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include:
b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all.
(I don’t think Microsoft has handled this)
b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs.
(Microsoft haven’t handled (b.2) because they don’t have images which perform iSCSI boot in their Azure setup. Still an open issue)
b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
(Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...)

4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs.
To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests.

P.S:
I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases.

Regards,
-Liran

> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> Hi all,
> I've put up a blog post with a summary of where network
> device failover stands and some open issues.
> Not sure where best to host it, I just put it up on blogspot:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e=
> 
> Comments, corrections are welcome!
> 
> -- 
> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-17 13:55 [summary] virtio network device failover writeup Michael S. Tsirkin
@ 2019-03-19 12:38 ` Liran Alon
  2019-03-19 12:38 ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-19 12:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

Hi Michael,

Great blog-post which summarise everything very well!

Some comments I have:

1) I think that when we are using the term “1-netdev model” on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves”.
Therefore, I would suggest to just remove the “1-netdev model” section and rename the "3-device model with hidden slaves” section to “1-netdev model”.

2) The userspace issues result both from using “2-netdev model” and “3-netdev model”. However, they are described in blog-post as they only exist on “3-netdev model”.
The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model.
Which leads me to the next comment.

3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than “1-netdev model”.
The issues that I’m aware of are (Please tell me if you are aware of others!):
(a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it’s used as the standby interface of another net-failover netdev.
The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it’s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver.
The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted.
(b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include:
b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all.
(I don’t think Microsoft has handled this)
b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs.
(Microsoft haven’t handled (b.2) because they don’t have images which perform iSCSI boot in their Azure setup. Still an open issue)
b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
(Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...)

4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs.
To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests.

P.S:
I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases.

Regards,
-Liran

> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> Hi all,
> I've put up a blog post with a summary of where network
> device failover stands and some open issues.
> Not sure where best to host it, I just put it up on blogspot:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e=
> 
> Comments, corrections are welcome!
> 
> -- 
> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 12:38 ` Liran Alon
  2019-03-19 15:46   ` Stephen Hemminger
@ 2019-03-19 15:46   ` Stephen Hemminger
  2019-03-19 21:19     ` Michael S. Tsirkin
  2019-03-19 21:19     ` Michael S. Tsirkin
  2019-03-19 21:06   ` Michael S. Tsirkin
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 62+ messages in thread
From: Stephen Hemminger @ 2019-03-19 15:46 UTC (permalink / raw)
  To: Liran Alon
  Cc: Michael S. Tsirkin, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Tue, 19 Mar 2019 14:38:06 +0200
Liran Alon <liran.alon@oracle.com> wrote:

> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).

Cloud-init should really just ignore all devices that have a master device.
That would have been more general, and safer for other use cases.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 12:38 ` Liran Alon
@ 2019-03-19 15:46   ` Stephen Hemminger
  2019-03-19 15:46   ` Stephen Hemminger
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 62+ messages in thread
From: Stephen Hemminger @ 2019-03-19 15:46 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski,
	Sridhar Samudrala, Alexander Duyck, virtualization, Netdev,
	Si-Wei Liu, boris.ostrovsky, David Miller, ogerlitz

On Tue, 19 Mar 2019 14:38:06 +0200
Liran Alon <liran.alon@oracle.com> wrote:

> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).

Cloud-init should really just ignore all devices that have a master device.
That would have been more general, and safer for other use cases.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 12:38 ` Liran Alon
  2019-03-19 15:46   ` Stephen Hemminger
  2019-03-19 15:46   ` Stephen Hemminger
@ 2019-03-19 21:06   ` Michael S. Tsirkin
  2019-03-19 23:05     ` Liran Alon
  2019-03-19 23:05     ` Liran Alon
  2019-03-19 21:06   ` Michael S. Tsirkin
  2019-03-19 21:55   ` si-wei liu
  4 siblings, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-19 21:06 UTC (permalink / raw)
  To: Liran Alon
  Cc: Si-Wei Liu, Sridhar Samudrala, Alexander Duyck,
	Stephen Hemminger, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
> Hi Michael,
> 
> Great blog-post which summarise everything very well!
> 
> Some comments I have:

Thanks!
I'll try to update everything in the post when I'm not so jet-lagged.

> 1) I think that when we are using the term “1-netdev model” on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves”.
> Therefore, I would suggest to just remove the “1-netdev model” section and rename the "3-device model with hidden slaves” section to “1-netdev model”.
> 
> 2) The userspace issues result both from using “2-netdev model” and “3-netdev model”. However, they are described in blog-post as they only exist on “3-netdev model”.
> The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model.
> Which leads me to the next comment.
> 
> 3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than “1-netdev model”.
> The issues that I’m aware of are (Please tell me if you are aware of others!):
> (a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it’s used as the standby interface of another net-failover netdev.
> The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it’s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver.
> The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted.
> (b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include:
> b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all.
> (I don’t think Microsoft has handled this)
> b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs.
> (Microsoft haven’t handled (b.2) because they don’t have images which perform iSCSI boot in their Azure setup. Still an open issue)
> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...)
> 
> 4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
> In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs.

In this setup, how does VF keep going? If it doesn't keep going, why is
it helpful?

> To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests.
> 
> P.S:
> I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases.
> 
> Regards,
> -Liran
> 
> > On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > Hi all,
> > I've put up a blog post with a summary of where network
> > device failover stands and some open issues.
> > Not sure where best to host it, I just put it up on blogspot:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e=
> > 
> > Comments, corrections are welcome!
> > 
> > -- 
> > MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 12:38 ` Liran Alon
                     ` (2 preceding siblings ...)
  2019-03-19 21:06   ` Michael S. Tsirkin
@ 2019-03-19 21:06   ` Michael S. Tsirkin
  2019-03-19 21:55   ` si-wei liu
  4 siblings, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-19 21:06 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
> Hi Michael,
> 
> Great blog-post which summarise everything very well!
> 
> Some comments I have:

Thanks!
I'll try to update everything in the post when I'm not so jet-lagged.

> 1) I think that when we are using the term “1-netdev model” on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves”.
> Therefore, I would suggest to just remove the “1-netdev model” section and rename the "3-device model with hidden slaves” section to “1-netdev model”.
> 
> 2) The userspace issues result both from using “2-netdev model” and “3-netdev model”. However, they are described in blog-post as they only exist on “3-netdev model”.
> The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model.
> Which leads me to the next comment.
> 
> 3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than “1-netdev model”.
> The issues that I’m aware of are (Please tell me if you are aware of others!):
> (a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it’s used as the standby interface of another net-failover netdev.
> The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it’s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver.
> The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted.
> (b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include:
> b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all.
> (I don’t think Microsoft has handled this)
> b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs.
> (Microsoft haven’t handled (b.2) because they don’t have images which perform iSCSI boot in their Azure setup. Still an open issue)
> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...)
> 
> 4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
> In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs.

In this setup, how does VF keep going? If it doesn't keep going, why is
it helpful?

> To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests.
> 
> P.S:
> I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases.
> 
> Regards,
> -Liran
> 
> > On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > Hi all,
> > I've put up a blog post with a summary of where network
> > device failover stands and some open issues.
> > Not sure where best to host it, I just put it up on blogspot:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e=
> > 
> > Comments, corrections are welcome!
> > 
> > -- 
> > MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 15:46   ` Stephen Hemminger
  2019-03-19 21:19     ` Michael S. Tsirkin
@ 2019-03-19 21:19     ` Michael S. Tsirkin
  2019-03-19 23:25       ` Liran Alon
  2019-03-19 23:25       ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-19 21:19 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Liran Alon, Si-Wei Liu, Sridhar Samudrala, Alexander Duyck,
	Jakub Kicinski, Jiri Pirko, David Miller, Netdev, virtualization,
	boris.ostrovsky, vijay.balakrishna, jfreimann, ogerlitz, vuhuong

On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> On Tue, 19 Mar 2019 14:38:06 +0200
> Liran Alon <liran.alon@oracle.com> wrote:
> 
> > b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> > (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> 
> Cloud-init should really just ignore all devices that have a master device.
> That would have been more general, and safer for other use cases.

Given lots of userspace doesn't do this, I wonder whether it would be
safer to just somehow pretend to userspace that the slave links are
down? And add a special attribute for the actual link state.

-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 15:46   ` Stephen Hemminger
@ 2019-03-19 21:19     ` Michael S. Tsirkin
  2019-03-19 21:19     ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-19 21:19 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Liran Alon, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> On Tue, 19 Mar 2019 14:38:06 +0200
> Liran Alon <liran.alon@oracle.com> wrote:
> 
> > b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> > (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> 
> Cloud-init should really just ignore all devices that have a master device.
> That would have been more general, and safer for other use cases.

Given lots of userspace doesn't do this, I wonder whether it would be
safer to just somehow pretend to userspace that the slave links are
down? And add a special attribute for the actual link state.

-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 12:38 ` Liran Alon
                     ` (3 preceding siblings ...)
  2019-03-19 21:06   ` Michael S. Tsirkin
@ 2019-03-19 21:55   ` si-wei liu
  4 siblings, 0 replies; 62+ messages in thread
From: si-wei liu @ 2019-03-19 21:55 UTC (permalink / raw)
  To: Liran Alon, Michael S. Tsirkin
  Cc: Sridhar Samudrala, Alexander Duyck, Stephen Hemminger,
	Jakub Kicinski, Jiri Pirko, David Miller, Netdev, virtualization,
	boris.ostrovsky, vijay.balakrishna, jfreimann, ogerlitz, vuhuong



On 3/19/2019 5:38 AM, Liran Alon wrote:
> Hi Michael,
>
> Great blog-post which summarise everything very well!
>
> Some comments I have:
>
> 1) I think that when we are using the term “1-netdev model” on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves”.
> Therefore, I would suggest to just remove the “1-netdev model” section and rename the "3-device model with hidden slaves” section to “1-netdev model”.
>
> 2) The userspace issues result both from using “2-netdev model” and “3-netdev model”. However, they are described in blog-post as they only exist on “3-netdev model”.
> The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model.
> Which leads me to the next comment.
>
> 3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than “1-netdev model”.
> The issues that I’m aware of are (Please tell me if you are aware of others!):
> (a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it’s used as the standby interface of another net-failover netdev.
> The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it’s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver.
> The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted.
> (b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include:
> b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all.
> (I don’t think Microsoft has handled this)
> b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs.
> (Microsoft haven’t handled (b.2) because they don’t have images which perform iSCSI boot in their Azure setup. Still an open issue)
> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...)
Add one additional issue that was just uncovered:
b.5) netplan: 3-netdev confused Ubuntu's netplan tool which dynamically 
generates udev rules in /run/udev/rules.d on the fly that matches netdev 
by MAC address only. I will file an enhancement request on launchpad later.

-Siwei

>
> 4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
> In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs.
> To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests.
>
> P.S:
> I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases.
>
> Regards,
> -Liran
>
>> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst@redhat.com> wrote:
>>
>> Hi all,
>> I've put up a blog post with a summary of where network
>> device failover stands and some open issues.
>> Not sure where best to host it, I just put it up on blogspot:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e=
>>
>> Comments, corrections are welcome!
>>
>> -- 
>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 21:06   ` Michael S. Tsirkin
@ 2019-03-19 23:05     ` Liran Alon
  2019-03-19 23:05     ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-19 23:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Si-Wei Liu, Sridhar Samudrala, Alexander Duyck,
	Stephen Hemminger, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 19 Mar 2019, at 23:06, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
>> Hi Michael,
>> 
>> Great blog-post which summarise everything very well!
>> 
>> Some comments I have:
> 
> Thanks!
> I'll try to update everything in the post when I'm not so jet-lagged.
> 
>> 1) I think that when we are using the term “1-netdev model” on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves”.
>> Therefore, I would suggest to just remove the “1-netdev model” section and rename the "3-device model with hidden slaves” section to “1-netdev model”.
>> 
>> 2) The userspace issues result both from using “2-netdev model” and “3-netdev model”. However, they are described in blog-post as they only exist on “3-netdev model”.
>> The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model.
>> Which leads me to the next comment.
>> 
>> 3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than “1-netdev model”.
>> The issues that I’m aware of are (Please tell me if you are aware of others!):
>> (a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it’s used as the standby interface of another net-failover netdev.
>> The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it’s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver.
>> The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted.
>> (b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include:
>> b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all.
>> (I don’t think Microsoft has handled this)
>> b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs.
>> (Microsoft haven’t handled (b.2) because they don’t have images which perform iSCSI boot in their Azure setup. Still an open issue)
>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>> b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...)
>> 
>> 4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
>> In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs.
> 
> In this setup, how does VF keep going? If it doesn't keep going, why is
> it helpful?

Let me attempt to clarify.

First, let’s analyse what can a cloud provider do when it wishes to upgrade the NIC firmware when there are currently running guests utilising SR-IOV.
He can perform the following operations in order:
1) Hot-unplug all VFs from all running guests.
2) Upgrade NIC firmware. Will result in PCIe reset which will cause momentary network down-time on PF but immediately afterwards PF will be set up again and guests will have network connectivity.
3) Provision and hot-plug new VFs for all running guests. Guests again have accelerated networking.

Without the net-failover mechanism, host will have to hot-unplug all VFs from all running guests and provision new VFs and hot-plug them anyway. But in that case, the network down-time for guests is longer.

Second, let’s analyse what will happen when health service running on host notice that NIC firmware is in a bad state and therefore NIC should be reset to recover.
The health service can take exactly the same order of operations as described above besides (2) which will just become a PCIe reset.
Again, guests have shorter network down-time in this case as-well when utilising the net-failover mechanism.

> 
>> To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests.
>> 
>> P.S:
>> I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases.
>> 
>> Regards,
>> -Liran
>> 
>>> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> Hi all,
>>> I've put up a blog post with a summary of where network
>>> device failover stands and some open issues.
>>> Not sure where best to host it, I just put it up on blogspot:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e=
>>> 
>>> Comments, corrections are welcome!
>>> 
>>> -- 
>>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 21:06   ` Michael S. Tsirkin
  2019-03-19 23:05     ` Liran Alon
@ 2019-03-19 23:05     ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-19 23:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 19 Mar 2019, at 23:06, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
>> Hi Michael,
>> 
>> Great blog-post which summarise everything very well!
>> 
>> Some comments I have:
> 
> Thanks!
> I'll try to update everything in the post when I'm not so jet-lagged.
> 
>> 1) I think that when we are using the term “1-netdev model” on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves”.
>> Therefore, I would suggest to just remove the “1-netdev model” section and rename the "3-device model with hidden slaves” section to “1-netdev model”.
>> 
>> 2) The userspace issues result both from using “2-netdev model” and “3-netdev model”. However, they are described in blog-post as they only exist on “3-netdev model”.
>> The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model.
>> Which leads me to the next comment.
>> 
>> 3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than “1-netdev model”.
>> The issues that I’m aware of are (Please tell me if you are aware of others!):
>> (a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it’s used as the standby interface of another net-failover netdev.
>> The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it’s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver.
>> The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted.
>> (b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include:
>> b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all.
>> (I don’t think Microsoft has handled this)
>> b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs.
>> (Microsoft haven’t handled (b.2) because they don’t have images which perform iSCSI boot in their Azure setup. Still an open issue)
>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>> b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...)
>> 
>> 4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
>> In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs.
> 
> In this setup, how does VF keep going? If it doesn't keep going, why is
> it helpful?

Let me attempt to clarify.

First, let’s analyse what can a cloud provider do when it wishes to upgrade the NIC firmware when there are currently running guests utilising SR-IOV.
He can perform the following operations in order:
1) Hot-unplug all VFs from all running guests.
2) Upgrade NIC firmware. Will result in PCIe reset which will cause momentary network down-time on PF but immediately afterwards PF will be set up again and guests will have network connectivity.
3) Provision and hot-plug new VFs for all running guests. Guests again have accelerated networking.

Without the net-failover mechanism, host will have to hot-unplug all VFs from all running guests and provision new VFs and hot-plug them anyway. But in that case, the network down-time for guests is longer.

Second, let’s analyse what will happen when health service running on host notice that NIC firmware is in a bad state and therefore NIC should be reset to recover.
The health service can take exactly the same order of operations as described above besides (2) which will just become a PCIe reset.
Again, guests have shorter network down-time in this case as-well when utilising the net-failover mechanism.

> 
>> To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests.
>> 
>> P.S:
>> I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases.
>> 
>> Regards,
>> -Liran
>> 
>>> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> Hi all,
>>> I've put up a blog post with a summary of where network
>>> device failover stands and some open issues.
>>> Not sure where best to host it, I just put it up on blogspot:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e=
>>> 
>>> Comments, corrections are welcome!
>>> 
>>> -- 
>>> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 21:19     ` Michael S. Tsirkin
@ 2019-03-19 23:25       ` Liran Alon
  2019-03-20 10:25         ` Michael S. Tsirkin
  2019-03-20 10:25         ` Michael S. Tsirkin
  2019-03-19 23:25       ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-19 23:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon <liran.alon@oracle.com> wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.

I think this may be problematic as it would also break legit use case of userspace attempt to set various config on VF slave.
In general, lying to userspace usually leads to problems. If we reach to a scenario where we try to avoid userspace issues generically and not
on a userspace component basis, I believe the right path should be to hide the net-failover slaves such that explicit action is required
to actually manipulate them (As described in blog-post). E.g. Automatically move net-failover slaves by kernel to a different netns.

-Liran

> 
> -- 
> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 21:19     ` Michael S. Tsirkin
  2019-03-19 23:25       ` Liran Alon
@ 2019-03-19 23:25       ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-19 23:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon <liran.alon@oracle.com> wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.

I think this may be problematic as it would also break legit use case of userspace attempt to set various config on VF slave.
In general, lying to userspace usually leads to problems. If we reach to a scenario where we try to avoid userspace issues generically and not
on a userspace component basis, I believe the right path should be to hide the net-failover slaves such that explicit action is required
to actually manipulate them (As described in blog-post). E.g. Automatically move net-failover slaves by kernel to a different netns.

-Liran

> 
> -- 
> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 23:25       ` Liran Alon
  2019-03-20 10:25         ` Michael S. Tsirkin
@ 2019-03-20 10:25         ` Michael S. Tsirkin
  2019-03-20 12:23           ` Liran Alon
  2019-03-20 12:23           ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-20 10:25 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> 
> 
> > On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >> On Tue, 19 Mar 2019 14:38:06 +0200
> >> Liran Alon <liran.alon@oracle.com> wrote:
> >> 
> >>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> >>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> >> 
> >> Cloud-init should really just ignore all devices that have a master device.
> >> That would have been more general, and safer for other use cases.
> > 
> > Given lots of userspace doesn't do this, I wonder whether it would be
> > safer to just somehow pretend to userspace that the slave links are
> > down? And add a special attribute for the actual link state.
> 
> I think this may be problematic as it would also break legit use case
> of userspace attempt to set various config on VF slave.
> In general, lying to userspace usually leads to problems.

I hear you on this. So how about instead of lying,
we basically just fail some accesses to slaves
unless a flag is set e.g. in ethtool.

Some userspace will need to change to set it but in a minor way.
Arguably/hopefully failure to set config would generally be a safer
failure.

Which things to fail? Probably sending/receiving packets?  Getting MAC?
More?

> If we reach
> to a scenario where we try to avoid userspace issues generically and
> not on a userspace component basis, I believe the right path should be
> to hide the net-failover slaves such that explicit action is required
> to actually manipulate them (As described in blog-post). E.g.
> Automatically move net-failover slaves by kernel to a different netns.
> 
> -Liran
> 
> > 
> > -- 
> > MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-19 23:25       ` Liran Alon
@ 2019-03-20 10:25         ` Michael S. Tsirkin
  2019-03-20 10:25         ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-20 10:25 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> 
> 
> > On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >> On Tue, 19 Mar 2019 14:38:06 +0200
> >> Liran Alon <liran.alon@oracle.com> wrote:
> >> 
> >>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> >>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> >> 
> >> Cloud-init should really just ignore all devices that have a master device.
> >> That would have been more general, and safer for other use cases.
> > 
> > Given lots of userspace doesn't do this, I wonder whether it would be
> > safer to just somehow pretend to userspace that the slave links are
> > down? And add a special attribute for the actual link state.
> 
> I think this may be problematic as it would also break legit use case
> of userspace attempt to set various config on VF slave.
> In general, lying to userspace usually leads to problems.

I hear you on this. So how about instead of lying,
we basically just fail some accesses to slaves
unless a flag is set e.g. in ethtool.

Some userspace will need to change to set it but in a minor way.
Arguably/hopefully failure to set config would generally be a safer
failure.

Which things to fail? Probably sending/receiving packets?  Getting MAC?
More?

> If we reach
> to a scenario where we try to avoid userspace issues generically and
> not on a userspace component basis, I believe the right path should be
> to hide the net-failover slaves such that explicit action is required
> to actually manipulate them (As described in blog-post). E.g.
> Automatically move net-failover slaves by kernel to a different netns.
> 
> -Liran
> 
> > 
> > -- 
> > MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 10:25         ` Michael S. Tsirkin
@ 2019-03-20 12:23           ` Liran Alon
  2019-03-20 14:09             ` Michael S. Tsirkin
  2019-03-20 14:09             ` Michael S. Tsirkin
  2019-03-20 12:23           ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-20 12:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>> 
>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>>>> 
>>>> Cloud-init should really just ignore all devices that have a master device.
>>>> That would have been more general, and safer for other use cases.
>>> 
>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>> safer to just somehow pretend to userspace that the slave links are
>>> down? And add a special attribute for the actual link state.
>> 
>> I think this may be problematic as it would also break legit use case
>> of userspace attempt to set various config on VF slave.
>> In general, lying to userspace usually leads to problems.
> 
> I hear you on this. So how about instead of lying,
> we basically just fail some accesses to slaves
> unless a flag is set e.g. in ethtool.
> 
> Some userspace will need to change to set it but in a minor way.
> Arguably/hopefully failure to set config would generally be a safer
> failure.

Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.

Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
by userspace that it wishes to perform a set of actions on the net-failover slave.

Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
But of course maybe there are other ideas that can achieve similar behaviour.

-Liran

> 
> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> More?
> 
>> If we reach
>> to a scenario where we try to avoid userspace issues generically and
>> not on a userspace component basis, I believe the right path should be
>> to hide the net-failover slaves such that explicit action is required
>> to actually manipulate them (As described in blog-post). E.g.
>> Automatically move net-failover slaves by kernel to a different netns.
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 10:25         ` Michael S. Tsirkin
  2019-03-20 12:23           ` Liran Alon
@ 2019-03-20 12:23           ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-20 12:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>> 
>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>>>> 
>>>> Cloud-init should really just ignore all devices that have a master device.
>>>> That would have been more general, and safer for other use cases.
>>> 
>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>> safer to just somehow pretend to userspace that the slave links are
>>> down? And add a special attribute for the actual link state.
>> 
>> I think this may be problematic as it would also break legit use case
>> of userspace attempt to set various config on VF slave.
>> In general, lying to userspace usually leads to problems.
> 
> I hear you on this. So how about instead of lying,
> we basically just fail some accesses to slaves
> unless a flag is set e.g. in ethtool.
> 
> Some userspace will need to change to set it but in a minor way.
> Arguably/hopefully failure to set config would generally be a safer
> failure.

Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.

Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
by userspace that it wishes to perform a set of actions on the net-failover slave.

Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
But of course maybe there are other ideas that can achieve similar behaviour.

-Liran

> 
> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> More?
> 
>> If we reach
>> to a scenario where we try to avoid userspace issues generically and
>> not on a userspace component basis, I believe the right path should be
>> to hide the net-failover slaves such that explicit action is required
>> to actually manipulate them (As described in blog-post). E.g.
>> Automatically move net-failover slaves by kernel to a different netns.
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 12:23           ` Liran Alon
  2019-03-20 14:09             ` Michael S. Tsirkin
@ 2019-03-20 14:09             ` Michael S. Tsirkin
  2019-03-20 21:43               ` Liran Alon
  2019-03-20 21:43               ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-20 14:09 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> 
> 
> > On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >>>> On Tue, 19 Mar 2019 14:38:06 +0200
> >>>> Liran Alon <liran.alon@oracle.com> wrote:
> >>>> 
> >>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> >>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> >>>> 
> >>>> Cloud-init should really just ignore all devices that have a master device.
> >>>> That would have been more general, and safer for other use cases.
> >>> 
> >>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>> safer to just somehow pretend to userspace that the slave links are
> >>> down? And add a special attribute for the actual link state.
> >> 
> >> I think this may be problematic as it would also break legit use case
> >> of userspace attempt to set various config on VF slave.
> >> In general, lying to userspace usually leads to problems.
> > 
> > I hear you on this. So how about instead of lying,
> > we basically just fail some accesses to slaves
> > unless a flag is set e.g. in ethtool.
> > 
> > Some userspace will need to change to set it but in a minor way.
> > Arguably/hopefully failure to set config would generally be a safer
> > failure.
> 
> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.

Sorry about being unclear, the idea would be to require the flag on each ethtool operation.

> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.

I think sending/receiving should probably just fail unconditionally.

> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
> by userspace that it wishes to perform a set of actions on the net-failover slave.
> 
> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
> But of course maybe there are other ideas that can achieve similar behaviour.
> 
> -Liran
> 
> > 
> > Which things to fail? Probably sending/receiving packets?  Getting MAC?
> > More?
> > 
> >> If we reach
> >> to a scenario where we try to avoid userspace issues generically and
> >> not on a userspace component basis, I believe the right path should be
> >> to hide the net-failover slaves such that explicit action is required
> >> to actually manipulate them (As described in blog-post). E.g.
> >> Automatically move net-failover slaves by kernel to a different netns.
> >> 
> >> -Liran
> >> 
> >>> 
> >>> -- 
> >>> MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 12:23           ` Liran Alon
@ 2019-03-20 14:09             ` Michael S. Tsirkin
  2019-03-20 14:09             ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-20 14:09 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> 
> 
> > On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >>>> On Tue, 19 Mar 2019 14:38:06 +0200
> >>>> Liran Alon <liran.alon@oracle.com> wrote:
> >>>> 
> >>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> >>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> >>>> 
> >>>> Cloud-init should really just ignore all devices that have a master device.
> >>>> That would have been more general, and safer for other use cases.
> >>> 
> >>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>> safer to just somehow pretend to userspace that the slave links are
> >>> down? And add a special attribute for the actual link state.
> >> 
> >> I think this may be problematic as it would also break legit use case
> >> of userspace attempt to set various config on VF slave.
> >> In general, lying to userspace usually leads to problems.
> > 
> > I hear you on this. So how about instead of lying,
> > we basically just fail some accesses to slaves
> > unless a flag is set e.g. in ethtool.
> > 
> > Some userspace will need to change to set it but in a minor way.
> > Arguably/hopefully failure to set config would generally be a safer
> > failure.
> 
> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.

Sorry about being unclear, the idea would be to require the flag on each ethtool operation.

> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.

I think sending/receiving should probably just fail unconditionally.

> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
> by userspace that it wishes to perform a set of actions on the net-failover slave.
> 
> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
> But of course maybe there are other ideas that can achieve similar behaviour.
> 
> -Liran
> 
> > 
> > Which things to fail? Probably sending/receiving packets?  Getting MAC?
> > More?
> > 
> >> If we reach
> >> to a scenario where we try to avoid userspace issues generically and
> >> not on a userspace component basis, I believe the right path should be
> >> to hide the net-failover slaves such that explicit action is required
> >> to actually manipulate them (As described in blog-post). E.g.
> >> Automatically move net-failover slaves by kernel to a different netns.
> >> 
> >> -Liran
> >> 
> >>> 
> >>> -- 
> >>> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 14:09             ` Michael S. Tsirkin
  2019-03-20 21:43               ` Liran Alon
@ 2019-03-20 21:43               ` Liran Alon
  2019-03-20 22:10                 ` Michael S. Tsirkin
  2019-03-20 22:10                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-20 21:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>>>> 
>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>>>>>> 
>>>>>> Cloud-init should really just ignore all devices that have a master device.
>>>>>> That would have been more general, and safer for other use cases.
>>>>> 
>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>> down? And add a special attribute for the actual link state.
>>>> 
>>>> I think this may be problematic as it would also break legit use case
>>>> of userspace attempt to set various config on VF slave.
>>>> In general, lying to userspace usually leads to problems.
>>> 
>>> I hear you on this. So how about instead of lying,
>>> we basically just fail some accesses to slaves
>>> unless a flag is set e.g. in ethtool.
>>> 
>>> Some userspace will need to change to set it but in a minor way.
>>> Arguably/hopefully failure to set config would generally be a safer
>>> failure.
>> 
>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
> 
> Sorry about being unclear, the idea would be to require the flag on each ethtool operation.

Oh. I have indeed misunderstood your previous email then. :)
Thanks for clarifying.

> 
>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
> 
> I think sending/receiving should probably just fail unconditionally.

You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
unless skb is marked with some flag to indicate it has been sent via the net-failover master?

This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).

However, I see a couple of down-sides to it:
1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.
2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.

The set of changes I vision to fix our issues are:
1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
(E.g. Configure the net-failover VF slave in some special way).
2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).

Is there any clear issue we see regarding the above suggestion?

-Liran

> 
>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
>> by userspace that it wishes to perform a set of actions on the net-failover slave.
>> 
>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
>> But of course maybe there are other ideas that can achieve similar behaviour.
>> 
>> -Liran
>> 
>>> 
>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
>>> More?
>>> 
>>>> If we reach
>>>> to a scenario where we try to avoid userspace issues generically and
>>>> not on a userspace component basis, I believe the right path should be
>>>> to hide the net-failover slaves such that explicit action is required
>>>> to actually manipulate them (As described in blog-post). E.g.
>>>> Automatically move net-failover slaves by kernel to a different netns.
>>>> 
>>>> -Liran
>>>> 
>>>>> 
>>>>> -- 
>>>>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 14:09             ` Michael S. Tsirkin
@ 2019-03-20 21:43               ` Liran Alon
  2019-03-20 21:43               ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-20 21:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>>>> 
>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>>>>>> 
>>>>>> Cloud-init should really just ignore all devices that have a master device.
>>>>>> That would have been more general, and safer for other use cases.
>>>>> 
>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>> down? And add a special attribute for the actual link state.
>>>> 
>>>> I think this may be problematic as it would also break legit use case
>>>> of userspace attempt to set various config on VF slave.
>>>> In general, lying to userspace usually leads to problems.
>>> 
>>> I hear you on this. So how about instead of lying,
>>> we basically just fail some accesses to slaves
>>> unless a flag is set e.g. in ethtool.
>>> 
>>> Some userspace will need to change to set it but in a minor way.
>>> Arguably/hopefully failure to set config would generally be a safer
>>> failure.
>> 
>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
> 
> Sorry about being unclear, the idea would be to require the flag on each ethtool operation.

Oh. I have indeed misunderstood your previous email then. :)
Thanks for clarifying.

> 
>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
> 
> I think sending/receiving should probably just fail unconditionally.

You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
unless skb is marked with some flag to indicate it has been sent via the net-failover master?

This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).

However, I see a couple of down-sides to it:
1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.
2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.

The set of changes I vision to fix our issues are:
1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
(E.g. Configure the net-failover VF slave in some special way).
2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).

Is there any clear issue we see regarding the above suggestion?

-Liran

> 
>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
>> by userspace that it wishes to perform a set of actions on the net-failover slave.
>> 
>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
>> But of course maybe there are other ideas that can achieve similar behaviour.
>> 
>> -Liran
>> 
>>> 
>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
>>> More?
>>> 
>>>> If we reach
>>>> to a scenario where we try to avoid userspace issues generically and
>>>> not on a userspace component basis, I believe the right path should be
>>>> to hide the net-failover slaves such that explicit action is required
>>>> to actually manipulate them (As described in blog-post). E.g.
>>>> Automatically move net-failover slaves by kernel to a different netns.
>>>> 
>>>> -Liran
>>>> 
>>>>> 
>>>>> -- 
>>>>> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 21:43               ` Liran Alon
  2019-03-20 22:10                 ` Michael S. Tsirkin
@ 2019-03-20 22:10                 ` Michael S. Tsirkin
  2019-03-20 22:19                   ` Liran Alon
  2019-03-20 22:19                   ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-20 22:10 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
> 
> 
> > On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
> >>>>>> Liran Alon <liran.alon@oracle.com> wrote:
> >>>>>> 
> >>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> >>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> >>>>>> 
> >>>>>> Cloud-init should really just ignore all devices that have a master device.
> >>>>>> That would have been more general, and safer for other use cases.
> >>>>> 
> >>>>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>>>> safer to just somehow pretend to userspace that the slave links are
> >>>>> down? And add a special attribute for the actual link state.
> >>>> 
> >>>> I think this may be problematic as it would also break legit use case
> >>>> of userspace attempt to set various config on VF slave.
> >>>> In general, lying to userspace usually leads to problems.
> >>> 
> >>> I hear you on this. So how about instead of lying,
> >>> we basically just fail some accesses to slaves
> >>> unless a flag is set e.g. in ethtool.
> >>> 
> >>> Some userspace will need to change to set it but in a minor way.
> >>> Arguably/hopefully failure to set config would generally be a safer
> >>> failure.
> >> 
> >> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
> > 
> > Sorry about being unclear, the idea would be to require the flag on each ethtool operation.
> 
> Oh. I have indeed misunderstood your previous email then. :)
> Thanks for clarifying.
> 
> > 
> >> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
> > 
> > I think sending/receiving should probably just fail unconditionally.
> 
> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
> unless skb is marked with some flag to indicate it has been sent via the net-failover master?

We can maybe avoid binding a protocol socket to the device?

> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
> 
> However, I see a couple of down-sides to it:
> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.


How about we fail to retrieve mac from the slave?

> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> 
> The set of changes I vision to fix our issues are:
> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> (E.g. Configure the net-failover VF slave in some special way).
> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> 
> Is there any clear issue we see regarding the above suggestion?
> 
> -Liran

The issue would be this: how do we avoid conflicting with namespaces
created by users?

> > 
> >> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
> >> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
> >> by userspace that it wishes to perform a set of actions on the net-failover slave.
> >> 
> >> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
> >> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
> >> But of course maybe there are other ideas that can achieve similar behaviour.
> >> 
> >> -Liran
> >> 
> >>> 
> >>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> >>> More?
> >>> 
> >>>> If we reach
> >>>> to a scenario where we try to avoid userspace issues generically and
> >>>> not on a userspace component basis, I believe the right path should be
> >>>> to hide the net-failover slaves such that explicit action is required
> >>>> to actually manipulate them (As described in blog-post). E.g.
> >>>> Automatically move net-failover slaves by kernel to a different netns.
> >>>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>> -- 
> >>>>> MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 21:43               ` Liran Alon
@ 2019-03-20 22:10                 ` Michael S. Tsirkin
  2019-03-20 22:10                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-20 22:10 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
> 
> 
> > On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
> >>>>>> Liran Alon <liran.alon@oracle.com> wrote:
> >>>>>> 
> >>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> >>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> >>>>>> 
> >>>>>> Cloud-init should really just ignore all devices that have a master device.
> >>>>>> That would have been more general, and safer for other use cases.
> >>>>> 
> >>>>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>>>> safer to just somehow pretend to userspace that the slave links are
> >>>>> down? And add a special attribute for the actual link state.
> >>>> 
> >>>> I think this may be problematic as it would also break legit use case
> >>>> of userspace attempt to set various config on VF slave.
> >>>> In general, lying to userspace usually leads to problems.
> >>> 
> >>> I hear you on this. So how about instead of lying,
> >>> we basically just fail some accesses to slaves
> >>> unless a flag is set e.g. in ethtool.
> >>> 
> >>> Some userspace will need to change to set it but in a minor way.
> >>> Arguably/hopefully failure to set config would generally be a safer
> >>> failure.
> >> 
> >> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
> > 
> > Sorry about being unclear, the idea would be to require the flag on each ethtool operation.
> 
> Oh. I have indeed misunderstood your previous email then. :)
> Thanks for clarifying.
> 
> > 
> >> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
> > 
> > I think sending/receiving should probably just fail unconditionally.
> 
> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
> unless skb is marked with some flag to indicate it has been sent via the net-failover master?

We can maybe avoid binding a protocol socket to the device?

> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
> 
> However, I see a couple of down-sides to it:
> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.


How about we fail to retrieve mac from the slave?

> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> 
> The set of changes I vision to fix our issues are:
> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> (E.g. Configure the net-failover VF slave in some special way).
> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> 
> Is there any clear issue we see regarding the above suggestion?
> 
> -Liran

The issue would be this: how do we avoid conflicting with namespaces
created by users?

> > 
> >> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
> >> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
> >> by userspace that it wishes to perform a set of actions on the net-failover slave.
> >> 
> >> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
> >> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
> >> But of course maybe there are other ideas that can achieve similar behaviour.
> >> 
> >> -Liran
> >> 
> >>> 
> >>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> >>> More?
> >>> 
> >>>> If we reach
> >>>> to a scenario where we try to avoid userspace issues generically and
> >>>> not on a userspace component basis, I believe the right path should be
> >>>> to hide the net-failover slaves such that explicit action is required
> >>>> to actually manipulate them (As described in blog-post). E.g.
> >>>> Automatically move net-failover slaves by kernel to a different netns.
> >>>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>> -- 
> >>>>> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 22:10                 ` Michael S. Tsirkin
@ 2019-03-20 22:19                   ` Liran Alon
  2019-03-21  8:58                     ` Michael S. Tsirkin
  2019-03-21  8:58                     ` Michael S. Tsirkin
  2019-03-20 22:19                   ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-20 22:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>>>>>> 
>>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>>>>>>>> 
>>>>>>>> Cloud-init should really just ignore all devices that have a master device.
>>>>>>>> That would have been more general, and safer for other use cases.
>>>>>>> 
>>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>>>> down? And add a special attribute for the actual link state.
>>>>>> 
>>>>>> I think this may be problematic as it would also break legit use case
>>>>>> of userspace attempt to set various config on VF slave.
>>>>>> In general, lying to userspace usually leads to problems.
>>>>> 
>>>>> I hear you on this. So how about instead of lying,
>>>>> we basically just fail some accesses to slaves
>>>>> unless a flag is set e.g. in ethtool.
>>>>> 
>>>>> Some userspace will need to change to set it but in a minor way.
>>>>> Arguably/hopefully failure to set config would generally be a safer
>>>>> failure.
>>>> 
>>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
>>> 
>>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation.
>> 
>> Oh. I have indeed misunderstood your previous email then. :)
>> Thanks for clarifying.
>> 
>>> 
>>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
>>> 
>>> I think sending/receiving should probably just fail unconditionally.
>> 
>> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
>> unless skb is marked with some flag to indicate it has been sent via the net-failover master?
> 
> We can maybe avoid binding a protocol socket to the device?

That is indeed another possibility that would work to avoid the DHCP issues.
And will still allow checking connectivity. So it is better.
However, I still think it provides an non-intuitive customer experience.
In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev.
i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined.
Customers usually don’t care how they get accelerated networking. They just care they do.

> 
>> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
>> 
>> However, I see a couple of down-sides to it:
>> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
>> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.
> 
> 
> How about we fail to retrieve mac from the slave?

That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC.

> 
>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>> 
>> The set of changes I vision to fix our issues are:
>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>> (E.g. Configure the net-failover VF slave in some special way).
>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>> 
>> Is there any clear issue we see regarding the above suggestion?
>> 
>> -Liran
> 
> The issue would be this: how do we avoid conflicting with namespaces
> created by users?

This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
To reference a hidden netns, you need to do it explicitly. 
Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
Does this seems reasonable?

-Liran

> 
>>> 
>>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
>>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
>>>> by userspace that it wishes to perform a set of actions on the net-failover slave.
>>>> 
>>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
>>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
>>>> But of course maybe there are other ideas that can achieve similar behaviour.
>>>> 
>>>> -Liran
>>>> 
>>>>> 
>>>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
>>>>> More?
>>>>> 
>>>>>> If we reach
>>>>>> to a scenario where we try to avoid userspace issues generically and
>>>>>> not on a userspace component basis, I believe the right path should be
>>>>>> to hide the net-failover slaves such that explicit action is required
>>>>>> to actually manipulate them (As described in blog-post). E.g.
>>>>>> Automatically move net-failover slaves by kernel to a different netns.
>>>>>> 
>>>>>> -Liran
>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 22:10                 ` Michael S. Tsirkin
  2019-03-20 22:19                   ` Liran Alon
@ 2019-03-20 22:19                   ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-20 22:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>>>>>> 
>>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>>>>>>>> 
>>>>>>>> Cloud-init should really just ignore all devices that have a master device.
>>>>>>>> That would have been more general, and safer for other use cases.
>>>>>>> 
>>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>>>> down? And add a special attribute for the actual link state.
>>>>>> 
>>>>>> I think this may be problematic as it would also break legit use case
>>>>>> of userspace attempt to set various config on VF slave.
>>>>>> In general, lying to userspace usually leads to problems.
>>>>> 
>>>>> I hear you on this. So how about instead of lying,
>>>>> we basically just fail some accesses to slaves
>>>>> unless a flag is set e.g. in ethtool.
>>>>> 
>>>>> Some userspace will need to change to set it but in a minor way.
>>>>> Arguably/hopefully failure to set config would generally be a safer
>>>>> failure.
>>>> 
>>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
>>> 
>>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation.
>> 
>> Oh. I have indeed misunderstood your previous email then. :)
>> Thanks for clarifying.
>> 
>>> 
>>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
>>> 
>>> I think sending/receiving should probably just fail unconditionally.
>> 
>> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
>> unless skb is marked with some flag to indicate it has been sent via the net-failover master?
> 
> We can maybe avoid binding a protocol socket to the device?

That is indeed another possibility that would work to avoid the DHCP issues.
And will still allow checking connectivity. So it is better.
However, I still think it provides an non-intuitive customer experience.
In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev.
i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined.
Customers usually don’t care how they get accelerated networking. They just care they do.

> 
>> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
>> 
>> However, I see a couple of down-sides to it:
>> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
>> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.
> 
> 
> How about we fail to retrieve mac from the slave?

That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC.

> 
>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>> 
>> The set of changes I vision to fix our issues are:
>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>> (E.g. Configure the net-failover VF slave in some special way).
>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>> 
>> Is there any clear issue we see regarding the above suggestion?
>> 
>> -Liran
> 
> The issue would be this: how do we avoid conflicting with namespaces
> created by users?

This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
To reference a hidden netns, you need to do it explicitly. 
Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
Does this seems reasonable?

-Liran

> 
>>> 
>>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
>>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
>>>> by userspace that it wishes to perform a set of actions on the net-failover slave.
>>>> 
>>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
>>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
>>>> But of course maybe there are other ideas that can achieve similar behaviour.
>>>> 
>>>> -Liran
>>>> 
>>>>> 
>>>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
>>>>> More?
>>>>> 
>>>>>> If we reach
>>>>>> to a scenario where we try to avoid userspace issues generically and
>>>>>> not on a userspace component basis, I believe the right path should be
>>>>>> to hide the net-failover slaves such that explicit action is required
>>>>>> to actually manipulate them (As described in blog-post). E.g.
>>>>>> Automatically move net-failover slaves by kernel to a different netns.
>>>>>> 
>>>>>> -Liran
>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 22:19                   ` Liran Alon
  2019-03-21  8:58                     ` Michael S. Tsirkin
@ 2019-03-21  8:58                     ` Michael S. Tsirkin
  2019-03-21 10:07                       ` Liran Alon
  2019-03-21 10:07                       ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21  8:58 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
> >>>>>>>> Liran Alon <liran.alon@oracle.com> wrote:
> >>>>>>>> 
> >>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> >>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> >>>>>>>> 
> >>>>>>>> Cloud-init should really just ignore all devices that have a master device.
> >>>>>>>> That would have been more general, and safer for other use cases.
> >>>>>>> 
> >>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>>>>>> safer to just somehow pretend to userspace that the slave links are
> >>>>>>> down? And add a special attribute for the actual link state.
> >>>>>> 
> >>>>>> I think this may be problematic as it would also break legit use case
> >>>>>> of userspace attempt to set various config on VF slave.
> >>>>>> In general, lying to userspace usually leads to problems.
> >>>>> 
> >>>>> I hear you on this. So how about instead of lying,
> >>>>> we basically just fail some accesses to slaves
> >>>>> unless a flag is set e.g. in ethtool.
> >>>>> 
> >>>>> Some userspace will need to change to set it but in a minor way.
> >>>>> Arguably/hopefully failure to set config would generally be a safer
> >>>>> failure.
> >>>> 
> >>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
> >>> 
> >>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation.
> >> 
> >> Oh. I have indeed misunderstood your previous email then. :)
> >> Thanks for clarifying.
> >> 
> >>> 
> >>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
> >>> 
> >>> I think sending/receiving should probably just fail unconditionally.
> >> 
> >> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
> >> unless skb is marked with some flag to indicate it has been sent via the net-failover master?
> > 
> > We can maybe avoid binding a protocol socket to the device?
> 
> That is indeed another possibility that would work to avoid the DHCP issues.
> And will still allow checking connectivity. So it is better.
> However, I still think it provides an non-intuitive customer experience.
> In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev.
> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined.
> Customers usually don’t care how they get accelerated networking. They just care they do.
> 
> > 
> >> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
> >> 
> >> However, I see a couple of down-sides to it:
> >> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
> >> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.
> > 
> > 
> > How about we fail to retrieve mac from the slave?
> 
> That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC.

There's a reference to that under "Non-MAC based pairing".

I'll look into making it more explicit.

> > 
> >> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >> 
> >> The set of changes I vision to fix our issues are:
> >> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >> (E.g. Configure the net-failover VF slave in some special way).
> >> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >> 
> >> Is there any clear issue we see regarding the above suggestion?
> >> 
> >> -Liran
> > 
> > The issue would be this: how do we avoid conflicting with namespaces
> > created by users?
> 
> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> To reference a hidden netns, you need to do it explicitly. 
> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).

Maybe it's an unnamed namespace. Hidden until userspace gives it a name?

> Does this seems reasonable?
> 
> -Liran

Reasonable I'd say yes, easy to implement probably no. But maybe I
missed a trick or two.

> > 
> >>> 
> >>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
> >>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
> >>>> by userspace that it wishes to perform a set of actions on the net-failover slave.
> >>>> 
> >>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
> >>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
> >>>> But of course maybe there are other ideas that can achieve similar behaviour.
> >>>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> >>>>> More?
> >>>>> 
> >>>>>> If we reach
> >>>>>> to a scenario where we try to avoid userspace issues generically and
> >>>>>> not on a userspace component basis, I believe the right path should be
> >>>>>> to hide the net-failover slaves such that explicit action is required
> >>>>>> to actually manipulate them (As described in blog-post). E.g.
> >>>>>> Automatically move net-failover slaves by kernel to a different netns.
> >>>>>> 
> >>>>>> -Liran
> >>>>>> 
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-20 22:19                   ` Liran Alon
@ 2019-03-21  8:58                     ` Michael S. Tsirkin
  2019-03-21  8:58                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21  8:58 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
> >>>>>>>> Liran Alon <liran.alon@oracle.com> wrote:
> >>>>>>>> 
> >>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
> >>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> >>>>>>>> 
> >>>>>>>> Cloud-init should really just ignore all devices that have a master device.
> >>>>>>>> That would have been more general, and safer for other use cases.
> >>>>>>> 
> >>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>>>>>> safer to just somehow pretend to userspace that the slave links are
> >>>>>>> down? And add a special attribute for the actual link state.
> >>>>>> 
> >>>>>> I think this may be problematic as it would also break legit use case
> >>>>>> of userspace attempt to set various config on VF slave.
> >>>>>> In general, lying to userspace usually leads to problems.
> >>>>> 
> >>>>> I hear you on this. So how about instead of lying,
> >>>>> we basically just fail some accesses to slaves
> >>>>> unless a flag is set e.g. in ethtool.
> >>>>> 
> >>>>> Some userspace will need to change to set it but in a minor way.
> >>>>> Arguably/hopefully failure to set config would generally be a safer
> >>>>> failure.
> >>>> 
> >>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
> >>> 
> >>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation.
> >> 
> >> Oh. I have indeed misunderstood your previous email then. :)
> >> Thanks for clarifying.
> >> 
> >>> 
> >>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
> >>> 
> >>> I think sending/receiving should probably just fail unconditionally.
> >> 
> >> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
> >> unless skb is marked with some flag to indicate it has been sent via the net-failover master?
> > 
> > We can maybe avoid binding a protocol socket to the device?
> 
> That is indeed another possibility that would work to avoid the DHCP issues.
> And will still allow checking connectivity. So it is better.
> However, I still think it provides an non-intuitive customer experience.
> In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev.
> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined.
> Customers usually don’t care how they get accelerated networking. They just care they do.
> 
> > 
> >> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
> >> 
> >> However, I see a couple of down-sides to it:
> >> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
> >> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.
> > 
> > 
> > How about we fail to retrieve mac from the slave?
> 
> That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC.

There's a reference to that under "Non-MAC based pairing".

I'll look into making it more explicit.

> > 
> >> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >> 
> >> The set of changes I vision to fix our issues are:
> >> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >> (E.g. Configure the net-failover VF slave in some special way).
> >> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >> 
> >> Is there any clear issue we see regarding the above suggestion?
> >> 
> >> -Liran
> > 
> > The issue would be this: how do we avoid conflicting with namespaces
> > created by users?
> 
> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> To reference a hidden netns, you need to do it explicitly. 
> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).

Maybe it's an unnamed namespace. Hidden until userspace gives it a name?

> Does this seems reasonable?
> 
> -Liran

Reasonable I'd say yes, easy to implement probably no. But maybe I
missed a trick or two.

> > 
> >>> 
> >>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
> >>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
> >>>> by userspace that it wishes to perform a set of actions on the net-failover slave.
> >>>> 
> >>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
> >>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
> >>>> But of course maybe there are other ideas that can achieve similar behaviour.
> >>>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> >>>>> More?
> >>>>> 
> >>>>>> If we reach
> >>>>>> to a scenario where we try to avoid userspace issues generically and
> >>>>>> not on a userspace component basis, I believe the right path should be
> >>>>>> to hide the net-failover slaves such that explicit action is required
> >>>>>> to actually manipulate them (As described in blog-post). E.g.
> >>>>>> Automatically move net-failover slaves by kernel to a different netns.
> >>>>>> 
> >>>>>> -Liran
> >>>>>> 
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21  8:58                     ` Michael S. Tsirkin
@ 2019-03-21 10:07                       ` Liran Alon
  2019-03-21 12:37                         ` Michael S. Tsirkin
  2019-03-21 12:37                         ` Michael S. Tsirkin
  2019-03-21 10:07                       ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 10:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 21 Mar 2019, at 10:58, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>>>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>>>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>>>>>>>>>> 
>>>>>>>>>> Cloud-init should really just ignore all devices that have a master device.
>>>>>>>>>> That would have been more general, and safer for other use cases.
>>>>>>>>> 
>>>>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>>>>>> down? And add a special attribute for the actual link state.
>>>>>>>> 
>>>>>>>> I think this may be problematic as it would also break legit use case
>>>>>>>> of userspace attempt to set various config on VF slave.
>>>>>>>> In general, lying to userspace usually leads to problems.
>>>>>>> 
>>>>>>> I hear you on this. So how about instead of lying,
>>>>>>> we basically just fail some accesses to slaves
>>>>>>> unless a flag is set e.g. in ethtool.
>>>>>>> 
>>>>>>> Some userspace will need to change to set it but in a minor way.
>>>>>>> Arguably/hopefully failure to set config would generally be a safer
>>>>>>> failure.
>>>>>> 
>>>>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
>>>>> 
>>>>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation.
>>>> 
>>>> Oh. I have indeed misunderstood your previous email then. :)
>>>> Thanks for clarifying.
>>>> 
>>>>> 
>>>>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
>>>>> 
>>>>> I think sending/receiving should probably just fail unconditionally.
>>>> 
>>>> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
>>>> unless skb is marked with some flag to indicate it has been sent via the net-failover master?
>>> 
>>> We can maybe avoid binding a protocol socket to the device?
>> 
>> That is indeed another possibility that would work to avoid the DHCP issues.
>> And will still allow checking connectivity. So it is better.
>> However, I still think it provides an non-intuitive customer experience.
>> In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev.
>> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined.
>> Customers usually don’t care how they get accelerated networking. They just care they do.
>> 
>>> 
>>>> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
>>>> 
>>>> However, I see a couple of down-sides to it:
>>>> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
>>>> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.
>>> 
>>> 
>>> How about we fail to retrieve mac from the slave?
>> 
>> That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC.
> 
> There's a reference to that under "Non-MAC based pairing".
> 
> I'll look into making it more explicit.

Yes I know. I was referring to what you described in that section.

> 
>>> 
>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>> 
>>>> The set of changes I vision to fix our issues are:
>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>> 
>>>> Is there any clear issue we see regarding the above suggestion?
>>>> 
>>>> -Liran
>>> 
>>> The issue would be this: how do we avoid conflicting with namespaces
>>> created by users?
>> 
>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>> To reference a hidden netns, you need to do it explicitly. 
>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> 
> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?

This is also a good idea that will solve the issue. Yes.

> 
>> Does this seems reasonable?
>> 
>> -Liran
> 
> Reasonable I'd say yes, easy to implement probably no. But maybe I
> missed a trick or two.

BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
the breakage we have today for the various userspace components.

-Liran

> 
>>> 
>>>>> 
>>>>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
>>>>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
>>>>>> by userspace that it wishes to perform a set of actions on the net-failover slave.
>>>>>> 
>>>>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
>>>>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
>>>>>> But of course maybe there are other ideas that can achieve similar behaviour.
>>>>>> 
>>>>>> -Liran
>>>>>> 
>>>>>>> 
>>>>>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
>>>>>>> More?
>>>>>>> 
>>>>>>>> If we reach
>>>>>>>> to a scenario where we try to avoid userspace issues generically and
>>>>>>>> not on a userspace component basis, I believe the right path should be
>>>>>>>> to hide the net-failover slaves such that explicit action is required
>>>>>>>> to actually manipulate them (As described in blog-post). E.g.
>>>>>>>> Automatically move net-failover slaves by kernel to a different netns.
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21  8:58                     ` Michael S. Tsirkin
  2019-03-21 10:07                       ` Liran Alon
@ 2019-03-21 10:07                       ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 10:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 21 Mar 2019, at 10:58, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>>>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves.
>>>>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>>>>>>>>>> 
>>>>>>>>>> Cloud-init should really just ignore all devices that have a master device.
>>>>>>>>>> That would have been more general, and safer for other use cases.
>>>>>>>>> 
>>>>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>>>>>> down? And add a special attribute for the actual link state.
>>>>>>>> 
>>>>>>>> I think this may be problematic as it would also break legit use case
>>>>>>>> of userspace attempt to set various config on VF slave.
>>>>>>>> In general, lying to userspace usually leads to problems.
>>>>>>> 
>>>>>>> I hear you on this. So how about instead of lying,
>>>>>>> we basically just fail some accesses to slaves
>>>>>>> unless a flag is set e.g. in ethtool.
>>>>>>> 
>>>>>>> Some userspace will need to change to set it but in a minor way.
>>>>>>> Arguably/hopefully failure to set config would generally be a safer
>>>>>>> failure.
>>>>>> 
>>>>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work.
>>>>> 
>>>>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation.
>>>> 
>>>> Oh. I have indeed misunderstood your previous email then. :)
>>>> Thanks for clarifying.
>>>> 
>>>>> 
>>>>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed.
>>>>> 
>>>>> I think sending/receiving should probably just fail unconditionally.
>>>> 
>>>> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev
>>>> unless skb is marked with some flag to indicate it has been sent via the net-failover master?
>>> 
>>> We can maybe avoid binding a protocol socket to the device?
>> 
>> That is indeed another possibility that would work to avoid the DHCP issues.
>> And will still allow checking connectivity. So it is better.
>> However, I still think it provides an non-intuitive customer experience.
>> In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev.
>> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined.
>> Customers usually don’t care how they get accelerated networking. They just care they do.
>> 
>>> 
>>>> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
>>>> 
>>>> However, I see a couple of down-sides to it:
>>>> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves.
>>>> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC.
>>> 
>>> 
>>> How about we fail to retrieve mac from the slave?
>> 
>> That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC.
> 
> There's a reference to that under "Non-MAC based pairing".
> 
> I'll look into making it more explicit.

Yes I know. I was referring to what you described in that section.

> 
>>> 
>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>> 
>>>> The set of changes I vision to fix our issues are:
>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>> 
>>>> Is there any clear issue we see regarding the above suggestion?
>>>> 
>>>> -Liran
>>> 
>>> The issue would be this: how do we avoid conflicting with namespaces
>>> created by users?
>> 
>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>> To reference a hidden netns, you need to do it explicitly. 
>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> 
> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?

This is also a good idea that will solve the issue. Yes.

> 
>> Does this seems reasonable?
>> 
>> -Liran
> 
> Reasonable I'd say yes, easy to implement probably no. But maybe I
> missed a trick or two.

BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
the breakage we have today for the various userspace components.

-Liran

> 
>>> 
>>>>> 
>>>>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace.
>>>>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated
>>>>>> by userspace that it wishes to perform a set of actions on the net-failover slave.
>>>>>> 
>>>>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns.
>>>>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev.
>>>>>> But of course maybe there are other ideas that can achieve similar behaviour.
>>>>>> 
>>>>>> -Liran
>>>>>> 
>>>>>>> 
>>>>>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
>>>>>>> More?
>>>>>>> 
>>>>>>>> If we reach
>>>>>>>> to a scenario where we try to avoid userspace issues generically and
>>>>>>>> not on a userspace component basis, I believe the right path should be
>>>>>>>> to hide the net-failover slaves such that explicit action is required
>>>>>>>> to actually manipulate them (As described in blog-post). E.g.
>>>>>>>> Automatically move net-failover slaves by kernel to a different netns.
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 10:07                       ` Liran Alon
  2019-03-21 12:37                         ` Michael S. Tsirkin
@ 2019-03-21 12:37                         ` Michael S. Tsirkin
  2019-03-21 12:47                           ` Liran Alon
  2019-03-21 12:47                           ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 12:37 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>> 
> >>>> The set of changes I vision to fix our issues are:
> >>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>> 
> >>>> Is there any clear issue we see regarding the above suggestion?
> >>>> 
> >>>> -Liran
> >>> 
> >>> The issue would be this: how do we avoid conflicting with namespaces
> >>> created by users?
> >> 
> >> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >> To reference a hidden netns, you need to do it explicitly. 
> >> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> > 
> > Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> 
> This is also a good idea that will solve the issue. Yes.
> 
> > 
> >> Does this seems reasonable?
> >> 
> >> -Liran
> > 
> > Reasonable I'd say yes, easy to implement probably no. But maybe I
> > missed a trick or two.
> 
> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> the breakage we have today for the various userspace components.
> 
> -Liran

It seems quite easy to supply that as a module parameter. Do we need two
namespaces though? Won't some userspace still be confused by the two
slaves sharing the MAC address?

-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 10:07                       ` Liran Alon
@ 2019-03-21 12:37                         ` Michael S. Tsirkin
  2019-03-21 12:37                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 12:37 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>> 
> >>>> The set of changes I vision to fix our issues are:
> >>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>> 
> >>>> Is there any clear issue we see regarding the above suggestion?
> >>>> 
> >>>> -Liran
> >>> 
> >>> The issue would be this: how do we avoid conflicting with namespaces
> >>> created by users?
> >> 
> >> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >> To reference a hidden netns, you need to do it explicitly. 
> >> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> > 
> > Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> 
> This is also a good idea that will solve the issue. Yes.
> 
> > 
> >> Does this seems reasonable?
> >> 
> >> -Liran
> > 
> > Reasonable I'd say yes, easy to implement probably no. But maybe I
> > missed a trick or two.
> 
> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> the breakage we have today for the various userspace components.
> 
> -Liran

It seems quite easy to supply that as a module parameter. Do we need two
namespaces though? Won't some userspace still be confused by the two
slaves sharing the MAC address?

-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 12:37                         ` Michael S. Tsirkin
@ 2019-03-21 12:47                           ` Liran Alon
  2019-03-21 12:57                             ` Michael S. Tsirkin
  2019-03-21 12:57                             ` Michael S. Tsirkin
  2019-03-21 12:47                           ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 12:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>> 
>>>>>> The set of changes I vision to fix our issues are:
>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>> 
>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>> created by users?
>>>> 
>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>> To reference a hidden netns, you need to do it explicitly. 
>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>> 
>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>> 
>> This is also a good idea that will solve the issue. Yes.
>> 
>>> 
>>>> Does this seems reasonable?
>>>> 
>>>> -Liran
>>> 
>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>> missed a trick or two.
>> 
>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>> the breakage we have today for the various userspace components.
>> 
>> -Liran
> 
> It seems quite easy to supply that as a module parameter. Do we need two
> namespaces though? Won't some userspace still be confused by the two
> slaves sharing the MAC address?

That’s one reasonable option.
Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.

The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.

-Liran

> 
> -- 
> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 12:37                         ` Michael S. Tsirkin
  2019-03-21 12:47                           ` Liran Alon
@ 2019-03-21 12:47                           ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 12:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>> 
>>>>>> The set of changes I vision to fix our issues are:
>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>> 
>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>> created by users?
>>>> 
>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>> To reference a hidden netns, you need to do it explicitly. 
>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>> 
>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>> 
>> This is also a good idea that will solve the issue. Yes.
>> 
>>> 
>>>> Does this seems reasonable?
>>>> 
>>>> -Liran
>>> 
>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>> missed a trick or two.
>> 
>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>> the breakage we have today for the various userspace components.
>> 
>> -Liran
> 
> It seems quite easy to supply that as a module parameter. Do we need two
> namespaces though? Won't some userspace still be confused by the two
> slaves sharing the MAC address?

That’s one reasonable option.
Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.

The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.

-Liran

> 
> -- 
> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 12:47                           ` Liran Alon
  2019-03-21 12:57                             ` Michael S. Tsirkin
@ 2019-03-21 12:57                             ` Michael S. Tsirkin
  2019-03-21 13:04                               ` Liran Alon
                                                 ` (3 more replies)
  1 sibling, 4 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 12:57 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>> 
> >>>>>> The set of changes I vision to fix our issues are:
> >>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>>>> 
> >>>>>> Is there any clear issue we see regarding the above suggestion?
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> The issue would be this: how do we avoid conflicting with namespaces
> >>>>> created by users?
> >>>> 
> >>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >>>> To reference a hidden netns, you need to do it explicitly. 
> >>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> >>> 
> >>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >> 
> >> This is also a good idea that will solve the issue. Yes.
> >> 
> >>> 
> >>>> Does this seems reasonable?
> >>>> 
> >>>> -Liran
> >>> 
> >>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>> missed a trick or two.
> >> 
> >> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> >> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> >> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> >> the breakage we have today for the various userspace components.
> >> 
> >> -Liran
> > 
> > It seems quite easy to supply that as a module parameter. Do we need two
> > namespaces though? Won't some userspace still be confused by the two
> > slaves sharing the MAC address?
> 
> That’s one reasonable option.
> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> 
> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> 
> -Liran

OK. Now what happens if master is moved to another namespace? Do we need
to move the slaves too?

Also siwei's patch is then kind of extraneous right?
Attempts to rename a slave will now fail as it's in a namespace...

> > 
> > -- 
> > MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 12:47                           ` Liran Alon
@ 2019-03-21 12:57                             ` Michael S. Tsirkin
  2019-03-21 12:57                             ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 12:57 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>> 
> >>>>>> The set of changes I vision to fix our issues are:
> >>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>>>> 
> >>>>>> Is there any clear issue we see regarding the above suggestion?
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> The issue would be this: how do we avoid conflicting with namespaces
> >>>>> created by users?
> >>>> 
> >>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >>>> To reference a hidden netns, you need to do it explicitly. 
> >>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> >>> 
> >>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >> 
> >> This is also a good idea that will solve the issue. Yes.
> >> 
> >>> 
> >>>> Does this seems reasonable?
> >>>> 
> >>>> -Liran
> >>> 
> >>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>> missed a trick or two.
> >> 
> >> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> >> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> >> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> >> the breakage we have today for the various userspace components.
> >> 
> >> -Liran
> > 
> > It seems quite easy to supply that as a module parameter. Do we need two
> > namespaces though? Won't some userspace still be confused by the two
> > slaves sharing the MAC address?
> 
> That’s one reasonable option.
> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> 
> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> 
> -Liran

OK. Now what happens if master is moved to another namespace? Do we need
to move the slaves too?

Also siwei's patch is then kind of extraneous right?
Attempts to rename a slave will now fail as it's in a namespace...

> > 
> > -- 
> > MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 12:57                             ` Michael S. Tsirkin
  2019-03-21 13:04                               ` Liran Alon
@ 2019-03-21 13:04                               ` Liran Alon
  2019-03-21 13:12                                 ` Michael S. Tsirkin
                                                   ` (3 more replies)
  2019-03-21 15:44                               ` Stephen Hemminger
  2019-03-21 15:44                               ` Stephen Hemminger
  3 siblings, 4 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 13:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>>>> 
>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>>>> 
>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>> created by users?
>>>>>> 
>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>>>> 
>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>> 
>>>> This is also a good idea that will solve the issue. Yes.
>>>> 
>>>>> 
>>>>>> Does this seems reasonable?
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>> missed a trick or two.
>>>> 
>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>>>> the breakage we have today for the various userspace components.
>>>> 
>>>> -Liran
>>> 
>>> It seems quite easy to supply that as a module parameter. Do we need two
>>> namespaces though? Won't some userspace still be confused by the two
>>> slaves sharing the MAC address?
>> 
>> That’s one reasonable option.
>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
>> 
>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
>> 
>> -Liran
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?

No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.

> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace…

I’m not sure actually. Isn't udev/systemd netns-aware?
I would expect it to be able to provide names also to netdevs in netns different than default netns.
If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.

-Liran

> 
>>> 
>>> -- 
>>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 12:57                             ` Michael S. Tsirkin
@ 2019-03-21 13:04                               ` Liran Alon
  2019-03-21 13:04                               ` Liran Alon
                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 13:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>>>> 
>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>>>> 
>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>> created by users?
>>>>>> 
>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>>>> 
>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>> 
>>>> This is also a good idea that will solve the issue. Yes.
>>>> 
>>>>> 
>>>>>> Does this seems reasonable?
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>> missed a trick or two.
>>>> 
>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>>>> the breakage we have today for the various userspace components.
>>>> 
>>>> -Liran
>>> 
>>> It seems quite easy to supply that as a module parameter. Do we need two
>>> namespaces though? Won't some userspace still be confused by the two
>>> slaves sharing the MAC address?
>> 
>> That’s one reasonable option.
>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
>> 
>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
>> 
>> -Liran
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?

No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.

> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace…

I’m not sure actually. Isn't udev/systemd netns-aware?
I would expect it to be able to provide names also to netdevs in netns different than default netns.
If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.

-Liran

> 
>>> 
>>> -- 
>>> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:04                               ` Liran Alon
  2019-03-21 13:12                                 ` Michael S. Tsirkin
@ 2019-03-21 13:12                                 ` Michael S. Tsirkin
  2019-03-21 13:24                                   ` Liran Alon
  2019-03-21 13:24                                   ` Liran Alon
  2019-03-21 15:45                                 ` Stephen Hemminger
  2019-03-21 15:45                                 ` Stephen Hemminger
  3 siblings, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 13:12 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>>>> 
> >>>>>>>> The set of changes I vision to fix our issues are:
> >>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>>>>>> 
> >>>>>>>> Is there any clear issue we see regarding the above suggestion?
> >>>>>>>> 
> >>>>>>>> -Liran
> >>>>>>> 
> >>>>>>> The issue would be this: how do we avoid conflicting with namespaces
> >>>>>>> created by users?
> >>>>>> 
> >>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >>>>>> To reference a hidden netns, you need to do it explicitly. 
> >>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> >>>>> 
> >>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >>>> 
> >>>> This is also a good idea that will solve the issue. Yes.
> >>>> 
> >>>>> 
> >>>>>> Does this seems reasonable?
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>>>> missed a trick or two.
> >>>> 
> >>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> >>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> >>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> >>>> the breakage we have today for the various userspace components.
> >>>> 
> >>>> -Liran
> >>> 
> >>> It seems quite easy to supply that as a module parameter. Do we need two
> >>> namespaces though? Won't some userspace still be confused by the two
> >>> slaves sharing the MAC address?
> >> 
> >> That’s one reasonable option.
> >> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> >> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> >> 
> >> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> >> 
> >> -Liran
> > 
> > OK. Now what happens if master is moved to another namespace? Do we need
> > to move the slaves too?
> 
> No. Why would we move the slaves?


The reason we have 3 device model at all is so users can fine tune the
slaves. I don't see why this applies to the root namespace but not
a container. If it has access to failover it should have access
to slaves.

> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.

So that makes the common case easy. That is good. My worry is it might
make some uncommon cases impossible.

> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> 
> > 
> > Also siwei's patch is then kind of extraneous right?
> > Attempts to rename a slave will now fail as it's in a namespace…
> 
> I’m not sure actually. Isn't udev/systemd netns-aware?
> I would expect it to be able to provide names also to netdevs in netns different than default netns.

I think most people move devices after they are renamed.

> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
> 
> -Liran
> 
> > 
> >>> 
> >>> -- 
> >>> MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:04                               ` Liran Alon
@ 2019-03-21 13:12                                 ` Michael S. Tsirkin
  2019-03-21 13:12                                 ` Michael S. Tsirkin
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 13:12 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>>>> 
> >>>>>>>> The set of changes I vision to fix our issues are:
> >>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>>>>>> 
> >>>>>>>> Is there any clear issue we see regarding the above suggestion?
> >>>>>>>> 
> >>>>>>>> -Liran
> >>>>>>> 
> >>>>>>> The issue would be this: how do we avoid conflicting with namespaces
> >>>>>>> created by users?
> >>>>>> 
> >>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >>>>>> To reference a hidden netns, you need to do it explicitly. 
> >>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> >>>>> 
> >>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >>>> 
> >>>> This is also a good idea that will solve the issue. Yes.
> >>>> 
> >>>>> 
> >>>>>> Does this seems reasonable?
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>>>> missed a trick or two.
> >>>> 
> >>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> >>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> >>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> >>>> the breakage we have today for the various userspace components.
> >>>> 
> >>>> -Liran
> >>> 
> >>> It seems quite easy to supply that as a module parameter. Do we need two
> >>> namespaces though? Won't some userspace still be confused by the two
> >>> slaves sharing the MAC address?
> >> 
> >> That’s one reasonable option.
> >> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> >> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> >> 
> >> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> >> 
> >> -Liran
> > 
> > OK. Now what happens if master is moved to another namespace? Do we need
> > to move the slaves too?
> 
> No. Why would we move the slaves?


The reason we have 3 device model at all is so users can fine tune the
slaves. I don't see why this applies to the root namespace but not
a container. If it has access to failover it should have access
to slaves.

> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.

So that makes the common case easy. That is good. My worry is it might
make some uncommon cases impossible.

> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> 
> > 
> > Also siwei's patch is then kind of extraneous right?
> > Attempts to rename a slave will now fail as it's in a namespace…
> 
> I’m not sure actually. Isn't udev/systemd netns-aware?
> I would expect it to be able to provide names also to netdevs in netns different than default netns.

I think most people move devices after they are renamed.

> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
> 
> -Liran
> 
> > 
> >>> 
> >>> -- 
> >>> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:12                                 ` Michael S. Tsirkin
  2019-03-21 13:24                                   ` Liran Alon
@ 2019-03-21 13:24                                   ` Liran Alon
  2019-03-21 13:51                                     ` Michael S. Tsirkin
  2019-03-21 13:51                                     ` Michael S. Tsirkin
  1 sibling, 2 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 13:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>>>>>> 
>>>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>>>>>> 
>>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>>>> 
>>>>>>>>>> -Liran
>>>>>>>>> 
>>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>>>> created by users?
>>>>>>>> 
>>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>>>>>> 
>>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>>>> 
>>>>>> This is also a good idea that will solve the issue. Yes.
>>>>>> 
>>>>>>> 
>>>>>>>> Does this seems reasonable?
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>>>> missed a trick or two.
>>>>>> 
>>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>>>>>> the breakage we have today for the various userspace components.
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> It seems quite easy to supply that as a module parameter. Do we need two
>>>>> namespaces though? Won't some userspace still be confused by the two
>>>>> slaves sharing the MAC address?
>>>> 
>>>> That’s one reasonable option.
>>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
>>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
>>>> 
>>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
>>>> 
>>>> -Liran
>>> 
>>> OK. Now what happens if master is moved to another namespace? Do we need
>>> to move the slaves too?
>> 
>> No. Why would we move the slaves?
> 
> 
> The reason we have 3 device model at all is so users can fine tune the
> slaves.

I Agree.

> I don't see why this applies to the root namespace but not
> a container. If it has access to failover it should have access
> to slaves.

Oh now I see your point. I haven’t thought about the containers usage.
My thinking was that customer can always just enter to the “hidden” netns and configure there whatever he wants.

Do you have a suggestion how to handle this?

One option can be that every "visible" netns on system will have a “hidden” unnamed netns where the net-failover slaves reside in.
If customer wishes to be able to enter to that netns and manage the net-failover slaves explicitly, it will need to have an updated iproute2
that knows how to enter to that hidden netns. For most customers, they won’t need to ever enter that netns and thus it is ok they don’t
have this updated iproute2.

> 
>> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> 
> So that makes the common case easy. That is good. My worry is it might
> make some uncommon cases impossible.
> 
>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
>> 
>>> 
>>> Also siwei's patch is then kind of extraneous right?
>>> Attempts to rename a slave will now fail as it's in a namespace…
>> 
>> I’m not sure actually. Isn't udev/systemd netns-aware?
>> I would expect it to be able to provide names also to netdevs in netns different than default netns.
> 
> I think most people move devices after they are renamed.

So?
Si-Wei patch handles the issue that resolves from the fact the net-failover master will be opened before the rename on the net-failover slaves occur.
This should happen (to my understanding) regardless of network namespaces.

-Liran

> 
>> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
>> 
>> -Liran
>> 
>>> 
>>>>> 
>>>>> -- 
>>>>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:12                                 ` Michael S. Tsirkin
@ 2019-03-21 13:24                                   ` Liran Alon
  2019-03-21 13:24                                   ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 13:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>>>>>> 
>>>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>>>>>> 
>>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>>>> 
>>>>>>>>>> -Liran
>>>>>>>>> 
>>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>>>> created by users?
>>>>>>>> 
>>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>>>>>> 
>>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>>>> 
>>>>>> This is also a good idea that will solve the issue. Yes.
>>>>>> 
>>>>>>> 
>>>>>>>> Does this seems reasonable?
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>>>> missed a trick or two.
>>>>>> 
>>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>>>>>> the breakage we have today for the various userspace components.
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> It seems quite easy to supply that as a module parameter. Do we need two
>>>>> namespaces though? Won't some userspace still be confused by the two
>>>>> slaves sharing the MAC address?
>>>> 
>>>> That’s one reasonable option.
>>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
>>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
>>>> 
>>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
>>>> 
>>>> -Liran
>>> 
>>> OK. Now what happens if master is moved to another namespace? Do we need
>>> to move the slaves too?
>> 
>> No. Why would we move the slaves?
> 
> 
> The reason we have 3 device model at all is so users can fine tune the
> slaves.

I Agree.

> I don't see why this applies to the root namespace but not
> a container. If it has access to failover it should have access
> to slaves.

Oh now I see your point. I haven’t thought about the containers usage.
My thinking was that customer can always just enter to the “hidden” netns and configure there whatever he wants.

Do you have a suggestion how to handle this?

One option can be that every "visible" netns on system will have a “hidden” unnamed netns where the net-failover slaves reside in.
If customer wishes to be able to enter to that netns and manage the net-failover slaves explicitly, it will need to have an updated iproute2
that knows how to enter to that hidden netns. For most customers, they won’t need to ever enter that netns and thus it is ok they don’t
have this updated iproute2.

> 
>> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> 
> So that makes the common case easy. That is good. My worry is it might
> make some uncommon cases impossible.
> 
>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
>> 
>>> 
>>> Also siwei's patch is then kind of extraneous right?
>>> Attempts to rename a slave will now fail as it's in a namespace…
>> 
>> I’m not sure actually. Isn't udev/systemd netns-aware?
>> I would expect it to be able to provide names also to netdevs in netns different than default netns.
> 
> I think most people move devices after they are renamed.

So?
Si-Wei patch handles the issue that resolves from the fact the net-failover master will be opened before the rename on the net-failover slaves occur.
This should happen (to my understanding) regardless of network namespaces.

-Liran

> 
>> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
>> 
>> -Liran
>> 
>>> 
>>>>> 
>>>>> -- 
>>>>> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:24                                   ` Liran Alon
  2019-03-21 13:51                                     ` Michael S. Tsirkin
@ 2019-03-21 13:51                                     ` Michael S. Tsirkin
  2019-03-21 14:16                                       ` Liran Alon
  2019-03-21 14:16                                       ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 13:51 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>>>>>> 
> >>>>>>>>>> The set of changes I vision to fix our issues are:
> >>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>>>>>>>> 
> >>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
> >>>>>>>>>> 
> >>>>>>>>>> -Liran
> >>>>>>>>> 
> >>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
> >>>>>>>>> created by users?
> >>>>>>>> 
> >>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >>>>>>>> To reference a hidden netns, you need to do it explicitly. 
> >>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> >>>>>>> 
> >>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >>>>>> 
> >>>>>> This is also a good idea that will solve the issue. Yes.
> >>>>>> 
> >>>>>>> 
> >>>>>>>> Does this seems reasonable?
> >>>>>>>> 
> >>>>>>>> -Liran
> >>>>>>> 
> >>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>>>>>> missed a trick or two.
> >>>>>> 
> >>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> >>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> >>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> >>>>>> the breakage we have today for the various userspace components.
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> It seems quite easy to supply that as a module parameter. Do we need two
> >>>>> namespaces though? Won't some userspace still be confused by the two
> >>>>> slaves sharing the MAC address?
> >>>> 
> >>>> That’s one reasonable option.
> >>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> >>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> >>>> 
> >>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> >>>> 
> >>>> -Liran
> >>> 
> >>> OK. Now what happens if master is moved to another namespace? Do we need
> >>> to move the slaves too?
> >> 
> >> No. Why would we move the slaves?
> > 
> > 
> > The reason we have 3 device model at all is so users can fine tune the
> > slaves.
> 
> I Agree.
> 
> > I don't see why this applies to the root namespace but not
> > a container. If it has access to failover it should have access
> > to slaves.
> 
> Oh now I see your point. I haven’t thought about the containers usage.
> My thinking was that customer can always just enter to the “hidden” netns and configure there whatever he wants.
> 
> Do you have a suggestion how to handle this?
> 
> One option can be that every "visible" netns on system will have a “hidden” unnamed netns where the net-failover slaves reside in.
> If customer wishes to be able to enter to that netns and manage the net-failover slaves explicitly, it will need to have an updated iproute2
> that knows how to enter to that hidden netns. For most customers, they won’t need to ever enter that netns and thus it is ok they don’t
> have this updated iproute2.

Right so slaves need to be moved whenever master is moved.

Given the amount of mess involved, should we just teach
userspace to create the hidden netns and move slaves there?

> > 
> >> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> > 
> > So that makes the common case easy. That is good. My worry is it might
> > make some uncommon cases impossible.
> > 
> >> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> >> 
> >>> 
> >>> Also siwei's patch is then kind of extraneous right?
> >>> Attempts to rename a slave will now fail as it's in a namespace…
> >> 
> >> I’m not sure actually. Isn't udev/systemd netns-aware?
> >> I would expect it to be able to provide names also to netdevs in netns different than default netns.
> > 
> > I think most people move devices after they are renamed.
> 
> So?
> Si-Wei patch handles the issue that resolves from the fact the net-failover master will be opened before the rename on the net-failover slaves occur.
> This should happen (to my understanding) regardless of network namespaces.
> 
> -Liran

My point was that any tool that moves devices after they
are renamed will be broken by kernel automatically putting
them in a namespace.

> > 
> >> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
> >> 
> >> -Liran
> >> 
> >>> 
> >>>>> 
> >>>>> -- 
> >>>>> MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:24                                   ` Liran Alon
@ 2019-03-21 13:51                                     ` Michael S. Tsirkin
  2019-03-21 13:51                                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 13:51 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>>>>>> 
> >>>>>>>>>> The set of changes I vision to fix our issues are:
> >>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>>>>>>>> 
> >>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
> >>>>>>>>>> 
> >>>>>>>>>> -Liran
> >>>>>>>>> 
> >>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
> >>>>>>>>> created by users?
> >>>>>>>> 
> >>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >>>>>>>> To reference a hidden netns, you need to do it explicitly. 
> >>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> >>>>>>> 
> >>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >>>>>> 
> >>>>>> This is also a good idea that will solve the issue. Yes.
> >>>>>> 
> >>>>>>> 
> >>>>>>>> Does this seems reasonable?
> >>>>>>>> 
> >>>>>>>> -Liran
> >>>>>>> 
> >>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>>>>>> missed a trick or two.
> >>>>>> 
> >>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> >>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> >>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> >>>>>> the breakage we have today for the various userspace components.
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> It seems quite easy to supply that as a module parameter. Do we need two
> >>>>> namespaces though? Won't some userspace still be confused by the two
> >>>>> slaves sharing the MAC address?
> >>>> 
> >>>> That’s one reasonable option.
> >>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> >>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> >>>> 
> >>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> >>>> 
> >>>> -Liran
> >>> 
> >>> OK. Now what happens if master is moved to another namespace? Do we need
> >>> to move the slaves too?
> >> 
> >> No. Why would we move the slaves?
> > 
> > 
> > The reason we have 3 device model at all is so users can fine tune the
> > slaves.
> 
> I Agree.
> 
> > I don't see why this applies to the root namespace but not
> > a container. If it has access to failover it should have access
> > to slaves.
> 
> Oh now I see your point. I haven’t thought about the containers usage.
> My thinking was that customer can always just enter to the “hidden” netns and configure there whatever he wants.
> 
> Do you have a suggestion how to handle this?
> 
> One option can be that every "visible" netns on system will have a “hidden” unnamed netns where the net-failover slaves reside in.
> If customer wishes to be able to enter to that netns and manage the net-failover slaves explicitly, it will need to have an updated iproute2
> that knows how to enter to that hidden netns. For most customers, they won’t need to ever enter that netns and thus it is ok they don’t
> have this updated iproute2.

Right so slaves need to be moved whenever master is moved.

Given the amount of mess involved, should we just teach
userspace to create the hidden netns and move slaves there?

> > 
> >> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> > 
> > So that makes the common case easy. That is good. My worry is it might
> > make some uncommon cases impossible.
> > 
> >> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> >> 
> >>> 
> >>> Also siwei's patch is then kind of extraneous right?
> >>> Attempts to rename a slave will now fail as it's in a namespace…
> >> 
> >> I’m not sure actually. Isn't udev/systemd netns-aware?
> >> I would expect it to be able to provide names also to netdevs in netns different than default netns.
> > 
> > I think most people move devices after they are renamed.
> 
> So?
> Si-Wei patch handles the issue that resolves from the fact the net-failover master will be opened before the rename on the net-failover slaves occur.
> This should happen (to my understanding) regardless of network namespaces.
> 
> -Liran

My point was that any tool that moves devices after they
are renamed will be broken by kernel automatically putting
them in a namespace.

> > 
> >> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
> >> 
> >> -Liran
> >> 
> >>> 
> >>>>> 
> >>>>> -- 
> >>>>> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:51                                     ` Michael S. Tsirkin
@ 2019-03-21 14:16                                       ` Liran Alon
  2019-03-21 15:15                                         ` Michael S. Tsirkin
  2019-03-21 15:15                                         ` Michael S. Tsirkin
  2019-03-21 14:16                                       ` Liran Alon
  1 sibling, 2 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 14:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 21 Mar 2019, at 15:51, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>>>>>>>> 
>>>>>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>>>>>> 
>>>>>>>>>>>> -Liran
>>>>>>>>>>> 
>>>>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>>>>>> created by users?
>>>>>>>>>> 
>>>>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>>>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>>>>>>>> 
>>>>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>>>>>> 
>>>>>>>> This is also a good idea that will solve the issue. Yes.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Does this seems reasonable?
>>>>>>>>>> 
>>>>>>>>>> -Liran
>>>>>>>>> 
>>>>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>>>>>> missed a trick or two.
>>>>>>>> 
>>>>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>>>>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>>>>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>>>>>>>> the breakage we have today for the various userspace components.
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> It seems quite easy to supply that as a module parameter. Do we need two
>>>>>>> namespaces though? Won't some userspace still be confused by the two
>>>>>>> slaves sharing the MAC address?
>>>>>> 
>>>>>> That’s one reasonable option.
>>>>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
>>>>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
>>>>>> 
>>>>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> OK. Now what happens if master is moved to another namespace? Do we need
>>>>> to move the slaves too?
>>>> 
>>>> No. Why would we move the slaves?
>>> 
>>> 
>>> The reason we have 3 device model at all is so users can fine tune the
>>> slaves.
>> 
>> I Agree.
>> 
>>> I don't see why this applies to the root namespace but not
>>> a container. If it has access to failover it should have access
>>> to slaves.
>> 
>> Oh now I see your point. I haven’t thought about the containers usage.
>> My thinking was that customer can always just enter to the “hidden” netns and configure there whatever he wants.
>> 
>> Do you have a suggestion how to handle this?
>> 
>> One option can be that every "visible" netns on system will have a “hidden” unnamed netns where the net-failover slaves reside in.
>> If customer wishes to be able to enter to that netns and manage the net-failover slaves explicitly, it will need to have an updated iproute2
>> that knows how to enter to that hidden netns. For most customers, they won’t need to ever enter that netns and thus it is ok they don’t
>> have this updated iproute2.
> 
> Right so slaves need to be moved whenever master is moved.
> 
> Given the amount of mess involved, should we just teach
> userspace to create the hidden netns and move slaves there?

That’s a good question.

However, I believe that it is easier and more suitable to happen in kernel. This is because:
1) Implementation is generic across all various distros.
2) We seem to discover more and more issues with userspace as we keep testing this on various distros, configurations and workloads.
3) It seems weird that kernel does some things automagically and some things don’t. i.e. Kernel automatically binds the virtio-net and VF to net-failover master
and automatically opens the net-failover slave when the net-failover master is opened, but it doesn’t care about the consequences these actions have on userspace.
Therefore, I propose let’s go “all in”: Kernel should also be responsible for hiding it’s artefacts unless customer userspace explicitly wants to view and manipulate them.

> 
>>> 
>>>> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
>>> 
>>> So that makes the common case easy. That is good. My worry is it might
>>> make some uncommon cases impossible.
>>> 
>>>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
>>>> 
>>>>> 
>>>>> Also siwei's patch is then kind of extraneous right?
>>>>> Attempts to rename a slave will now fail as it's in a namespace…
>>>> 
>>>> I’m not sure actually. Isn't udev/systemd netns-aware?
>>>> I would expect it to be able to provide names also to netdevs in netns different than default netns.
>>> 
>>> I think most people move devices after they are renamed.
>> 
>> So?
>> Si-Wei patch handles the issue that resolves from the fact the net-failover master will be opened before the rename on the net-failover slaves occur.
>> This should happen (to my understanding) regardless of network namespaces.
>> 
>> -Liran
> 
> My point was that any tool that moves devices after they
> are renamed will be broken by kernel automatically putting
> them in a namespace.

I’m not sure I follow. How is this related to Si-Wei patch?
Si-Wei patch (and the root-cause that leads to the issue it fixes) have nothing to do with network namespaces.

What do you mean tool that moves devices after they are renamed will be broken by kernel?
Care to give an example to clarify?

-Liran

> 
>>> 
>>>> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
>>>> 
>>>> -Liran
>>>> 
>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:51                                     ` Michael S. Tsirkin
  2019-03-21 14:16                                       ` Liran Alon
@ 2019-03-21 14:16                                       ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 14:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 21 Mar 2019, at 15:51, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>>>>>>>> 
>>>>>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>>>>>> 
>>>>>>>>>>>> -Liran
>>>>>>>>>>> 
>>>>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>>>>>> created by users?
>>>>>>>>>> 
>>>>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>>>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>>>>>>>> 
>>>>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>>>>>> 
>>>>>>>> This is also a good idea that will solve the issue. Yes.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Does this seems reasonable?
>>>>>>>>>> 
>>>>>>>>>> -Liran
>>>>>>>>> 
>>>>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>>>>>> missed a trick or two.
>>>>>>>> 
>>>>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>>>>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>>>>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>>>>>>>> the breakage we have today for the various userspace components.
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> It seems quite easy to supply that as a module parameter. Do we need two
>>>>>>> namespaces though? Won't some userspace still be confused by the two
>>>>>>> slaves sharing the MAC address?
>>>>>> 
>>>>>> That’s one reasonable option.
>>>>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
>>>>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
>>>>>> 
>>>>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> OK. Now what happens if master is moved to another namespace? Do we need
>>>>> to move the slaves too?
>>>> 
>>>> No. Why would we move the slaves?
>>> 
>>> 
>>> The reason we have 3 device model at all is so users can fine tune the
>>> slaves.
>> 
>> I Agree.
>> 
>>> I don't see why this applies to the root namespace but not
>>> a container. If it has access to failover it should have access
>>> to slaves.
>> 
>> Oh now I see your point. I haven’t thought about the containers usage.
>> My thinking was that customer can always just enter to the “hidden” netns and configure there whatever he wants.
>> 
>> Do you have a suggestion how to handle this?
>> 
>> One option can be that every "visible" netns on system will have a “hidden” unnamed netns where the net-failover slaves reside in.
>> If customer wishes to be able to enter to that netns and manage the net-failover slaves explicitly, it will need to have an updated iproute2
>> that knows how to enter to that hidden netns. For most customers, they won’t need to ever enter that netns and thus it is ok they don’t
>> have this updated iproute2.
> 
> Right so slaves need to be moved whenever master is moved.
> 
> Given the amount of mess involved, should we just teach
> userspace to create the hidden netns and move slaves there?

That’s a good question.

However, I believe that it is easier and more suitable to happen in kernel. This is because:
1) Implementation is generic across all various distros.
2) We seem to discover more and more issues with userspace as we keep testing this on various distros, configurations and workloads.
3) It seems weird that kernel does some things automagically and some things don’t. i.e. Kernel automatically binds the virtio-net and VF to net-failover master
and automatically opens the net-failover slave when the net-failover master is opened, but it doesn’t care about the consequences these actions have on userspace.
Therefore, I propose let’s go “all in”: Kernel should also be responsible for hiding it’s artefacts unless customer userspace explicitly wants to view and manipulate them.

> 
>>> 
>>>> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
>>> 
>>> So that makes the common case easy. That is good. My worry is it might
>>> make some uncommon cases impossible.
>>> 
>>>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
>>>> 
>>>>> 
>>>>> Also siwei's patch is then kind of extraneous right?
>>>>> Attempts to rename a slave will now fail as it's in a namespace…
>>>> 
>>>> I’m not sure actually. Isn't udev/systemd netns-aware?
>>>> I would expect it to be able to provide names also to netdevs in netns different than default netns.
>>> 
>>> I think most people move devices after they are renamed.
>> 
>> So?
>> Si-Wei patch handles the issue that resolves from the fact the net-failover master will be opened before the rename on the net-failover slaves occur.
>> This should happen (to my understanding) regardless of network namespaces.
>> 
>> -Liran
> 
> My point was that any tool that moves devices after they
> are renamed will be broken by kernel automatically putting
> them in a namespace.

I’m not sure I follow. How is this related to Si-Wei patch?
Si-Wei patch (and the root-cause that leads to the issue it fixes) have nothing to do with network namespaces.

What do you mean tool that moves devices after they are renamed will be broken by kernel?
Care to give an example to clarify?

-Liran

> 
>>> 
>>>> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
>>>> 
>>>> -Liran
>>>> 
>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 14:16                                       ` Liran Alon
@ 2019-03-21 15:15                                         ` Michael S. Tsirkin
  2019-03-21 15:15                                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 15:15 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Thu, Mar 21, 2019 at 04:16:14PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 15:51, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The set of changes I vision to fix our issues are:
> >>>>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> -Liran
> >>>>>>>>>>> 
> >>>>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
> >>>>>>>>>>> created by users?
> >>>>>>>>>> 
> >>>>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >>>>>>>>>> To reference a hidden netns, you need to do it explicitly. 
> >>>>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> >>>>>>>>> 
> >>>>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >>>>>>>> 
> >>>>>>>> This is also a good idea that will solve the issue. Yes.
> >>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> Does this seems reasonable?
> >>>>>>>>>> 
> >>>>>>>>>> -Liran
> >>>>>>>>> 
> >>>>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>>>>>>>> missed a trick or two.
> >>>>>>>> 
> >>>>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> >>>>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> >>>>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> >>>>>>>> the breakage we have today for the various userspace components.
> >>>>>>>> 
> >>>>>>>> -Liran
> >>>>>>> 
> >>>>>>> It seems quite easy to supply that as a module parameter. Do we need two
> >>>>>>> namespaces though? Won't some userspace still be confused by the two
> >>>>>>> slaves sharing the MAC address?
> >>>>>> 
> >>>>>> That’s one reasonable option.
> >>>>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> >>>>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> >>>>>> 
> >>>>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> OK. Now what happens if master is moved to another namespace? Do we need
> >>>>> to move the slaves too?
> >>>> 
> >>>> No. Why would we move the slaves?
> >>> 
> >>> 
> >>> The reason we have 3 device model at all is so users can fine tune the
> >>> slaves.
> >> 
> >> I Agree.
> >> 
> >>> I don't see why this applies to the root namespace but not
> >>> a container. If it has access to failover it should have access
> >>> to slaves.
> >> 
> >> Oh now I see your point. I haven’t thought about the containers usage.
> >> My thinking was that customer can always just enter to the “hidden” netns and configure there whatever he wants.
> >> 
> >> Do you have a suggestion how to handle this?
> >> 
> >> One option can be that every "visible" netns on system will have a “hidden” unnamed netns where the net-failover slaves reside in.
> >> If customer wishes to be able to enter to that netns and manage the net-failover slaves explicitly, it will need to have an updated iproute2
> >> that knows how to enter to that hidden netns. For most customers, they won’t need to ever enter that netns and thus it is ok they don’t
> >> have this updated iproute2.
> > 
> > Right so slaves need to be moved whenever master is moved.
> > 
> > Given the amount of mess involved, should we just teach
> > userspace to create the hidden netns and move slaves there?
> 
> That’s a good question.
> 
> However, I believe that it is easier and more suitable to happen in kernel. This is because:
> 1) Implementation is generic across all various distros.
> 2) We seem to discover more and more issues with userspace as we keep testing this on various distros, configurations and workloads.
> 3) It seems weird that kernel does some things automagically and some things don’t. i.e. Kernel automatically binds the virtio-net and VF to net-failover master
> and automatically opens the net-failover slave when the net-failover master is opened, but it doesn’t care about the consequences these actions have on userspace.
> Therefore, I propose let’s go “all in”: Kernel should also be responsible for hiding it’s artefacts unless customer userspace explicitly wants to view and manipulate them.

Just a minor point: failover device is an artefact of kernel. Standy and
primary devices are created by the hypervisor.

> > 
> >>> 
> >>>> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> >>> 
> >>> So that makes the common case easy. That is good. My worry is it might
> >>> make some uncommon cases impossible.
> >>> 
> >>>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> >>>> 
> >>>>> 
> >>>>> Also siwei's patch is then kind of extraneous right?
> >>>>> Attempts to rename a slave will now fail as it's in a namespace…
> >>>> 
> >>>> I’m not sure actually. Isn't udev/systemd netns-aware?
> >>>> I would expect it to be able to provide names also to netdevs in netns different than default netns.
> >>> 
> >>> I think most people move devices after they are renamed.
> >> 
> >> So?
> >> Si-Wei patch handles the issue that resolves from the fact the net-failover master will be opened before the rename on the net-failover slaves occur.
> >> This should happen (to my understanding) regardless of network namespaces.
> >> 
> >> -Liran
> > 
> > My point was that any tool that moves devices after they
> > are renamed will be broken by kernel automatically putting
> > them in a namespace.
> 
> I’m not sure I follow. How is this related to Si-Wei patch?
> Si-Wei patch (and the root-cause that leads to the issue it fixes) have nothing to do with network namespaces.
> 
> What do you mean tool that moves devices after they are renamed will be broken by kernel?
> Care to give an example to clarify?
> 
> -Liran

I'll have to get back to you next week when I'm less jetlaged and more
lucid.

> > 
> >>> 
> >>>> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
> >>>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 14:16                                       ` Liran Alon
  2019-03-21 15:15                                         ` Michael S. Tsirkin
@ 2019-03-21 15:15                                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 15:15 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, Mar 21, 2019 at 04:16:14PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 15:51, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> 
> >>>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >>>>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> >>>>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The set of changes I vision to fix our issues are:
> >>>>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
> >>>>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> >>>>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> >>>>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> -Liran
> >>>>>>>>>>> 
> >>>>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
> >>>>>>>>>>> created by users?
> >>>>>>>>>> 
> >>>>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> >>>>>>>>>> To reference a hidden netns, you need to do it explicitly. 
> >>>>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
> >>>>>>>>> 
> >>>>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >>>>>>>> 
> >>>>>>>> This is also a good idea that will solve the issue. Yes.
> >>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> Does this seems reasonable?
> >>>>>>>>>> 
> >>>>>>>>>> -Liran
> >>>>>>>>> 
> >>>>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>>>>>>>> missed a trick or two.
> >>>>>>>> 
> >>>>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> >>>>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> >>>>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> >>>>>>>> the breakage we have today for the various userspace components.
> >>>>>>>> 
> >>>>>>>> -Liran
> >>>>>>> 
> >>>>>>> It seems quite easy to supply that as a module parameter. Do we need two
> >>>>>>> namespaces though? Won't some userspace still be confused by the two
> >>>>>>> slaves sharing the MAC address?
> >>>>>> 
> >>>>>> That’s one reasonable option.
> >>>>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> >>>>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> >>>>>> 
> >>>>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> OK. Now what happens if master is moved to another namespace? Do we need
> >>>>> to move the slaves too?
> >>>> 
> >>>> No. Why would we move the slaves?
> >>> 
> >>> 
> >>> The reason we have 3 device model at all is so users can fine tune the
> >>> slaves.
> >> 
> >> I Agree.
> >> 
> >>> I don't see why this applies to the root namespace but not
> >>> a container. If it has access to failover it should have access
> >>> to slaves.
> >> 
> >> Oh now I see your point. I haven’t thought about the containers usage.
> >> My thinking was that customer can always just enter to the “hidden” netns and configure there whatever he wants.
> >> 
> >> Do you have a suggestion how to handle this?
> >> 
> >> One option can be that every "visible" netns on system will have a “hidden” unnamed netns where the net-failover slaves reside in.
> >> If customer wishes to be able to enter to that netns and manage the net-failover slaves explicitly, it will need to have an updated iproute2
> >> that knows how to enter to that hidden netns. For most customers, they won’t need to ever enter that netns and thus it is ok they don’t
> >> have this updated iproute2.
> > 
> > Right so slaves need to be moved whenever master is moved.
> > 
> > Given the amount of mess involved, should we just teach
> > userspace to create the hidden netns and move slaves there?
> 
> That’s a good question.
> 
> However, I believe that it is easier and more suitable to happen in kernel. This is because:
> 1) Implementation is generic across all various distros.
> 2) We seem to discover more and more issues with userspace as we keep testing this on various distros, configurations and workloads.
> 3) It seems weird that kernel does some things automagically and some things don’t. i.e. Kernel automatically binds the virtio-net and VF to net-failover master
> and automatically opens the net-failover slave when the net-failover master is opened, but it doesn’t care about the consequences these actions have on userspace.
> Therefore, I propose let’s go “all in”: Kernel should also be responsible for hiding it’s artefacts unless customer userspace explicitly wants to view and manipulate them.

Just a minor point: failover device is an artefact of kernel. Standy and
primary devices are created by the hypervisor.

> > 
> >>> 
> >>>> The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> >>> 
> >>> So that makes the common case easy. That is good. My worry is it might
> >>> make some uncommon cases impossible.
> >>> 
> >>>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> >>>> 
> >>>>> 
> >>>>> Also siwei's patch is then kind of extraneous right?
> >>>>> Attempts to rename a slave will now fail as it's in a namespace…
> >>>> 
> >>>> I’m not sure actually. Isn't udev/systemd netns-aware?
> >>>> I would expect it to be able to provide names also to netdevs in netns different than default netns.
> >>> 
> >>> I think most people move devices after they are renamed.
> >> 
> >> So?
> >> Si-Wei patch handles the issue that resolves from the fact the net-failover master will be opened before the rename on the net-failover slaves occur.
> >> This should happen (to my understanding) regardless of network namespaces.
> >> 
> >> -Liran
> > 
> > My point was that any tool that moves devices after they
> > are renamed will be broken by kernel automatically putting
> > them in a namespace.
> 
> I’m not sure I follow. How is this related to Si-Wei patch?
> Si-Wei patch (and the root-cause that leads to the issue it fixes) have nothing to do with network namespaces.
> 
> What do you mean tool that moves devices after they are renamed will be broken by kernel?
> Care to give an example to clarify?
> 
> -Liran

I'll have to get back to you next week when I'm less jetlaged and more
lucid.

> > 
> >>> 
> >>>> If that’s the case, Si-Wei patch to be able to rename a net-failover slave when it is already open is still required. As the race-condition still exists.
> >>>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 12:57                             ` Michael S. Tsirkin
  2019-03-21 13:04                               ` Liran Alon
  2019-03-21 13:04                               ` Liran Alon
@ 2019-03-21 15:44                               ` Stephen Hemminger
  2019-03-21 22:33                                 ` si-wei liu
  2019-03-21 15:44                               ` Stephen Hemminger
  3 siblings, 1 reply; 62+ messages in thread
From: Stephen Hemminger @ 2019-03-21 15:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Liran Alon, Si-Wei Liu, Sridhar Samudrala, Alexander Duyck,
	Jakub Kicinski, Jiri Pirko, David Miller, Netdev, virtualization,
	boris.ostrovsky, vijay.balakrishna, jfreimann, ogerlitz, vuhuong

On Thu, 21 Mar 2019 08:57:03 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> > 
> >   
> > > On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > 
> > > On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:  
> > >>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> > >>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> > >>>>>> 
> > >>>>>> The set of changes I vision to fix our issues are:
> > >>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> > >>>>>> (E.g. Configure the net-failover VF slave in some special way).
> > >>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> > >>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> > >>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> > >>>>>> 
> > >>>>>> Is there any clear issue we see regarding the above suggestion?
> > >>>>>> 
> > >>>>>> -Liran  
> > >>>>> 
> > >>>>> The issue would be this: how do we avoid conflicting with namespaces
> > >>>>> created by users?  
> > >>>> 
> > >>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> > >>>> To reference a hidden netns, you need to do it explicitly. 
> > >>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).  
> > >>> 
> > >>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?  
> > >> 
> > >> This is also a good idea that will solve the issue. Yes.
> > >>   
> > >>>   
> > >>>> Does this seems reasonable?
> > >>>> 
> > >>>> -Liran  
> > >>> 
> > >>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> > >>> missed a trick or two.  
> > >> 
> > >> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> > >> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> > >> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> > >> the breakage we have today for the various userspace components.
> > >> 
> > >> -Liran  
> > > 
> > > It seems quite easy to supply that as a module parameter. Do we need two
> > > namespaces though? Won't some userspace still be confused by the two
> > > slaves sharing the MAC address?  
> > 
> > That’s one reasonable option.
> > Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> > i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> > 
> > The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> > 
> > -Liran  
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?
> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace...

I did try moving slave device into a namespace at one point.
The problem is that introduces all sorts of locking problems in the code
because you can't do it directly in the context of when the callback
happens that a new slave device is discovered.

Since you can't safely change device namespace in the notifier,
it requires a work queue. Then you add more complexity and error cases
because the slave is exposed for a short period, and handling all the
state race unwinds...

Good idea but hard to implement

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 12:57                             ` Michael S. Tsirkin
                                                 ` (2 preceding siblings ...)
  2019-03-21 15:44                               ` Stephen Hemminger
@ 2019-03-21 15:44                               ` Stephen Hemminger
  3 siblings, 0 replies; 62+ messages in thread
From: Stephen Hemminger @ 2019-03-21 15:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Liran Alon, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, 21 Mar 2019 08:57:03 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> > 
> >   
> > > On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > 
> > > On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:  
> > >>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
> > >>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
> > >>>>>> 
> > >>>>>> The set of changes I vision to fix our issues are:
> > >>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
> > >>>>>> (E.g. Configure the net-failover VF slave in some special way).
> > >>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
> > >>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
> > >>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
> > >>>>>> 
> > >>>>>> Is there any clear issue we see regarding the above suggestion?
> > >>>>>> 
> > >>>>>> -Liran  
> > >>>>> 
> > >>>>> The issue would be this: how do we avoid conflicting with namespaces
> > >>>>> created by users?  
> > >>>> 
> > >>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
> > >>>> To reference a hidden netns, you need to do it explicitly. 
> > >>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).  
> > >>> 
> > >>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?  
> > >> 
> > >> This is also a good idea that will solve the issue. Yes.
> > >>   
> > >>>   
> > >>>> Does this seems reasonable?
> > >>>> 
> > >>>> -Liran  
> > >>> 
> > >>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> > >>> missed a trick or two.  
> > >> 
> > >> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
> > >> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
> > >> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
> > >> the breakage we have today for the various userspace components.
> > >> 
> > >> -Liran  
> > > 
> > > It seems quite easy to supply that as a module parameter. Do we need two
> > > namespaces though? Won't some userspace still be confused by the two
> > > slaves sharing the MAC address?  
> > 
> > That’s one reasonable option.
> > Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
> > i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
> > 
> > The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
> > 
> > -Liran  
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?
> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace...

I did try moving slave device into a namespace at one point.
The problem is that introduces all sorts of locking problems in the code
because you can't do it directly in the context of when the callback
happens that a new slave device is discovered.

Since you can't safely change device namespace in the notifier,
it requires a work queue. Then you add more complexity and error cases
because the slave is exposed for a short period, and handling all the
state race unwinds...

Good idea but hard to implement
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:04                               ` Liran Alon
                                                   ` (2 preceding siblings ...)
  2019-03-21 15:45                                 ` Stephen Hemminger
@ 2019-03-21 15:45                                 ` Stephen Hemminger
  2019-03-21 15:50                                   ` Michael S. Tsirkin
  2019-03-21 15:50                                   ` Michael S. Tsirkin
  3 siblings, 2 replies; 62+ messages in thread
From: Stephen Hemminger @ 2019-03-21 15:45 UTC (permalink / raw)
  To: Liran Alon
  Cc: Michael S. Tsirkin, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Thu, 21 Mar 2019 15:04:37 +0200
Liran Alon <liran.alon@oracle.com> wrote:

> > 
> > OK. Now what happens if master is moved to another namespace? Do we need
> > to move the slaves too?  
> 
> No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.


The 2-device netvsc already handles case where master changes namespace.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 13:04                               ` Liran Alon
  2019-03-21 13:12                                 ` Michael S. Tsirkin
  2019-03-21 13:12                                 ` Michael S. Tsirkin
@ 2019-03-21 15:45                                 ` Stephen Hemminger
  2019-03-21 15:45                                 ` Stephen Hemminger
  3 siblings, 0 replies; 62+ messages in thread
From: Stephen Hemminger @ 2019-03-21 15:45 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski,
	Sridhar Samudrala, Alexander Duyck, virtualization, Netdev,
	Si-Wei Liu, boris.ostrovsky, David Miller, ogerlitz

On Thu, 21 Mar 2019 15:04:37 +0200
Liran Alon <liran.alon@oracle.com> wrote:

> > 
> > OK. Now what happens if master is moved to another namespace? Do we need
> > to move the slaves too?  
> 
> No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.


The 2-device netvsc already handles case where master changes namespace.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 15:45                                 ` Stephen Hemminger
@ 2019-03-21 15:50                                   ` Michael S. Tsirkin
  2019-03-21 16:31                                     ` Liran Alon
  2019-03-21 16:31                                     ` Liran Alon
  2019-03-21 15:50                                   ` Michael S. Tsirkin
  1 sibling, 2 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 15:50 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Liran Alon, Si-Wei Liu, Sridhar Samudrala, Alexander Duyck,
	Jakub Kicinski, Jiri Pirko, David Miller, Netdev, virtualization,
	boris.ostrovsky, vijay.balakrishna, jfreimann, ogerlitz, vuhuong

On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
> On Thu, 21 Mar 2019 15:04:37 +0200
> Liran Alon <liran.alon@oracle.com> wrote:
> 
> > > 
> > > OK. Now what happens if master is moved to another namespace? Do we need
> > > to move the slaves too?  
> > 
> > No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> > We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> 
> 
> The 2-device netvsc already handles case where master changes namespace.

Is it by moving slave with it?

-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 15:45                                 ` Stephen Hemminger
  2019-03-21 15:50                                   ` Michael S. Tsirkin
@ 2019-03-21 15:50                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 15:50 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Liran Alon, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
> On Thu, 21 Mar 2019 15:04:37 +0200
> Liran Alon <liran.alon@oracle.com> wrote:
> 
> > > 
> > > OK. Now what happens if master is moved to another namespace? Do we need
> > > to move the slaves too?  
> > 
> > No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> > We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> 
> 
> The 2-device netvsc already handles case where master changes namespace.

Is it by moving slave with it?

-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 15:50                                   ` Michael S. Tsirkin
@ 2019-03-21 16:31                                     ` Liran Alon
  2019-03-21 17:12                                         ` Michael S. Tsirkin
  2019-03-21 16:31                                     ` Liran Alon
  1 sibling, 1 reply; 62+ messages in thread
From: Liran Alon @ 2019-03-21 16:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 21 Mar 2019, at 17:50, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
>> On Thu, 21 Mar 2019 15:04:37 +0200
>> Liran Alon <liran.alon@oracle.com> wrote:
>> 
>>>> 
>>>> OK. Now what happens if master is moved to another namespace? Do we need
>>>> to move the slaves too?  
>>> 
>>> No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
>>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
>> 
>> 
>> The 2-device netvsc already handles case where master changes namespace.
> 
> Is it by moving slave with it?

See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
It seems that when NetVSC master netdev changes netns, the VF is moved to the same netns by the NetVSC driver.
Kinda the opposite than what we are suggesting here to make sure that the net-failover master netdev is on a separate
netns than it’s slaves...

-Liran

> 
> -- 
> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 15:50                                   ` Michael S. Tsirkin
  2019-03-21 16:31                                     ` Liran Alon
@ 2019-03-21 16:31                                     ` Liran Alon
  1 sibling, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 16:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 21 Mar 2019, at 17:50, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
>> On Thu, 21 Mar 2019 15:04:37 +0200
>> Liran Alon <liran.alon@oracle.com> wrote:
>> 
>>>> 
>>>> OK. Now what happens if master is moved to another namespace? Do we need
>>>> to move the slaves too?  
>>> 
>>> No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
>>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
>> 
>> 
>> The 2-device netvsc already handles case where master changes namespace.
> 
> Is it by moving slave with it?

See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
It seems that when NetVSC master netdev changes netns, the VF is moved to the same netns by the NetVSC driver.
Kinda the opposite than what we are suggesting here to make sure that the net-failover master netdev is on a separate
netns than it’s slaves...

-Liran

> 
> -- 
> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 16:31                                     ` Liran Alon
@ 2019-03-21 17:12                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 17:12 UTC (permalink / raw)
  To: Liran Alon
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong

On Thu, Mar 21, 2019 at 06:31:35PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 17:50, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
> >> On Thu, 21 Mar 2019 15:04:37 +0200
> >> Liran Alon <liran.alon@oracle.com> wrote:
> >> 
> >>>> 
> >>>> OK. Now what happens if master is moved to another namespace? Do we need
> >>>> to move the slaves too?  
> >>> 
> >>> No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> >>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> >> 
> >> 
> >> The 2-device netvsc already handles case where master changes namespace.
> > 
> > Is it by moving slave with it?
> 
> See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
> It seems that when NetVSC master netdev changes netns, the VF is moved to the same netns by the NetVSC driver.
> Kinda the opposite than what we are suggesting here to make sure that the net-failover master netdev is on a separate
> netns than it’s slaves...
> 
> -Liran
> 
> > 
> > -- 
> > MST

Not exactly opposite I'd say.

If failover is in host ns, slaves in /primary and /standby, then moving
failover to /container should move slaves to /container/primary and
/container/standby.


-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
@ 2019-03-21 17:12                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-21 17:12 UTC (permalink / raw)
  To: Liran Alon
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz

On Thu, Mar 21, 2019 at 06:31:35PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 17:50, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
> >> On Thu, 21 Mar 2019 15:04:37 +0200
> >> Liran Alon <liran.alon@oracle.com> wrote:
> >> 
> >>>> 
> >>>> OK. Now what happens if master is moved to another namespace? Do we need
> >>>> to move the slaves too?  
> >>> 
> >>> No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
> >>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
> >> 
> >> 
> >> The 2-device netvsc already handles case where master changes namespace.
> > 
> > Is it by moving slave with it?
> 
> See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
> It seems that when NetVSC master netdev changes netns, the VF is moved to the same netns by the NetVSC driver.
> Kinda the opposite than what we are suggesting here to make sure that the net-failover master netdev is on a separate
> netns than it’s slaves...
> 
> -Liran
> 
> > 
> > -- 
> > MST

Not exactly opposite I'd say.

If failover is in host ns, slaves in /primary and /standby, then moving
failover to /container should move slaves to /container/primary and
/container/standby.


-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 17:12                                         ` Michael S. Tsirkin
  (?)
  (?)
@ 2019-03-21 17:15                                         ` Liran Alon
  -1 siblings, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 17:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Si-Wei Liu, Sridhar Samudrala,
	Alexander Duyck, Jakub Kicinski, Jiri Pirko, David Miller,
	Netdev, virtualization, boris.ostrovsky, vijay.balakrishna,
	jfreimann, ogerlitz, vuhuong



> On 21 Mar 2019, at 19:12, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 06:31:35PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 17:50, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
>>>> On Thu, 21 Mar 2019 15:04:37 +0200
>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>> 
>>>>>> 
>>>>>> OK. Now what happens if master is moved to another namespace? Do we need
>>>>>> to move the slaves too?  
>>>>> 
>>>>> No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
>>>>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
>>>> 
>>>> 
>>>> The 2-device netvsc already handles case where master changes namespace.
>>> 
>>> Is it by moving slave with it?
>> 
>> See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
>> It seems that when NetVSC master netdev changes netns, the VF is moved to the same netns by the NetVSC driver.
>> Kinda the opposite than what we are suggesting here to make sure that the net-failover master netdev is on a separate
>> netns than it’s slaves...
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST
> 
> Not exactly opposite I'd say.
> 
> If failover is in host ns, slaves in /primary and /standby, then moving
> failover to /container should move slaves to /container/primary and
> /container/standby.

Yes I agree.
I meant that they tried to keep the VF on the same netns as the NetVSC.
But of course what you just described is exactly the functionality I would have wanted in our net-failover mechanism.

-Liran

> 
> 
> -- 
> MST


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 17:12                                         ` Michael S. Tsirkin
  (?)
@ 2019-03-21 17:15                                         ` Liran Alon
  -1 siblings, 0 replies; 62+ messages in thread
From: Liran Alon @ 2019-03-21 17:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, Netdev, Si-Wei Liu,
	boris.ostrovsky, David Miller, ogerlitz



> On 21 Mar 2019, at 19:12, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Mar 21, 2019 at 06:31:35PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 17:50, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
>>>> On Thu, 21 Mar 2019 15:04:37 +0200
>>>> Liran Alon <liran.alon@oracle.com> wrote:
>>>> 
>>>>>> 
>>>>>> OK. Now what happens if master is moved to another namespace? Do we need
>>>>>> to move the slaves too?  
>>>>> 
>>>>> No. Why would we move the slaves? The whole point is to make most customer ignore the net-failover slaves and remain them “hidden” in their dedicated netns.
>>>>> We won’t prevent customer from explicitly moving the net-failover slaves out of this netns, but we will not move them out of there automatically.
>>>> 
>>>> 
>>>> The 2-device netvsc already handles case where master changes namespace.
>>> 
>>> Is it by moving slave with it?
>> 
>> See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
>> It seems that when NetVSC master netdev changes netns, the VF is moved to the same netns by the NetVSC driver.
>> Kinda the opposite than what we are suggesting here to make sure that the net-failover master netdev is on a separate
>> netns than it’s slaves...
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST
> 
> Not exactly opposite I'd say.
> 
> If failover is in host ns, slaves in /primary and /standby, then moving
> failover to /container should move slaves to /container/primary and
> /container/standby.

Yes I agree.
I meant that they tried to keep the VF on the same netns as the NetVSC.
But of course what you just described is exactly the functionality I would have wanted in our net-failover mechanism.

-Liran

> 
> 
> -- 
> MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [summary] virtio network device failover writeup
  2019-03-21 15:44                               ` Stephen Hemminger
@ 2019-03-21 22:33                                 ` si-wei liu
  0 siblings, 0 replies; 62+ messages in thread
From: si-wei liu @ 2019-03-21 22:33 UTC (permalink / raw)
  To: Stephen Hemminger, Michael S. Tsirkin
  Cc: Liran Alon, Sridhar Samudrala, Alexander Duyck, Jakub Kicinski,
	Jiri Pirko, David Miller, Netdev, virtualization,
	boris.ostrovsky, vijay.balakrishna, jfreimann, ogerlitz, vuhuong



On 3/21/2019 8:44 AM, Stephen Hemminger wrote:
> On Thu, 21 Mar 2019 08:57:03 -0400
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>>>    
>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>
>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity
>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity.
>>>>>>>>>
>>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly.
>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC).
>>>>>>>>>
>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>>>
>>>>>>>>> -Liran
>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>>> created by users?
>>>>>>> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal.
>>>>>>> To reference a hidden netns, you need to do it explicitly.
>>>>>>> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…).
>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>>> This is also a good idea that will solve the issue. Yes.
>>>>>    
>>>>>>    
>>>>>>> Does this seems reasonable?
>>>>>>>
>>>>>>> -Liran
>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>>> missed a trick or two.
>>>>> BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this,
>>>>> it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves")
>>>>> that will break only userspace workloads that by a very rare-chance have a netns that collides with this then
>>>>> the breakage we have today for the various userspace components.
>>>>>
>>>>> -Liran
>>>> It seems quite easy to supply that as a module parameter. Do we need two
>>>> namespaces though? Won't some userspace still be confused by the two
>>>> slaves sharing the MAC address?
>>> That’s one reasonable option.
>>> Another one is that we will indeed change the mechanism by which we determine a VF should be bonded with a virtio-net device.
>>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be bonded with.
>>>
>>> The second seems cleaner but I don’t have a strong opinion on this. Both seem reasonable to me and your suggestion is faster to implement from current state of things.
>>>
>>> -Liran
>> OK. Now what happens if master is moved to another namespace? Do we need
>> to move the slaves too?
>>
>> Also siwei's patch is then kind of extraneous right?
>> Attempts to rename a slave will now fail as it's in a namespace...
> I did try moving slave device into a namespace at one point.
> The problem is that introduces all sorts of locking problems in the code
> because you can't do it directly in the context of when the callback
> happens that a new slave device is discovered.
>
> Since you can't safely change device namespace in the notifier,
> it requires a work queue. Then you add more complexity and error cases
> because the slave is exposed for a short period, and handling all the
> state race unwinds...
Thanks for your input, that's why I never got it started on the 
implementation before getting consensus here. I think we need to put the 
slave into kernel created netns otherwise it suffers from various 
userspace races. Userspace tool (such as udevd and ip) needs to 
specifically subscribe to events in those kernel created netns for 
rename and config. Locking still complicated, though there should be one 
way or another to work out.

-Siwei



>
> Good idea but hard to implement



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [summary] virtio network device failover writeup
@ 2019-03-17 13:55 Michael S. Tsirkin
  0 siblings, 0 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2019-03-17 13:55 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: vuhuong, Jiri Pirko, Jakub Kicinski, Sridhar Samudrala,
	Alexander Duyck, virtualization, liran.alon, Netdev,
	boris.ostrovsky, David Miller, ogerlitz

Hi all,
I've put up a blog post with a summary of where network
device failover stands and some open issues.
Not sure where best to host it, I just put it up on blogspot:
https://mstsirkin.blogspot.com/2019/03/virtio-network-device-failover-support.html

Comments, corrections are welcome!

-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2019-03-21 22:32 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-17 13:55 [summary] virtio network device failover writeup Michael S. Tsirkin
2019-03-19 12:38 ` Liran Alon
2019-03-19 12:38 ` Liran Alon
2019-03-19 15:46   ` Stephen Hemminger
2019-03-19 15:46   ` Stephen Hemminger
2019-03-19 21:19     ` Michael S. Tsirkin
2019-03-19 21:19     ` Michael S. Tsirkin
2019-03-19 23:25       ` Liran Alon
2019-03-20 10:25         ` Michael S. Tsirkin
2019-03-20 10:25         ` Michael S. Tsirkin
2019-03-20 12:23           ` Liran Alon
2019-03-20 14:09             ` Michael S. Tsirkin
2019-03-20 14:09             ` Michael S. Tsirkin
2019-03-20 21:43               ` Liran Alon
2019-03-20 21:43               ` Liran Alon
2019-03-20 22:10                 ` Michael S. Tsirkin
2019-03-20 22:10                 ` Michael S. Tsirkin
2019-03-20 22:19                   ` Liran Alon
2019-03-21  8:58                     ` Michael S. Tsirkin
2019-03-21  8:58                     ` Michael S. Tsirkin
2019-03-21 10:07                       ` Liran Alon
2019-03-21 12:37                         ` Michael S. Tsirkin
2019-03-21 12:37                         ` Michael S. Tsirkin
2019-03-21 12:47                           ` Liran Alon
2019-03-21 12:57                             ` Michael S. Tsirkin
2019-03-21 12:57                             ` Michael S. Tsirkin
2019-03-21 13:04                               ` Liran Alon
2019-03-21 13:04                               ` Liran Alon
2019-03-21 13:12                                 ` Michael S. Tsirkin
2019-03-21 13:12                                 ` Michael S. Tsirkin
2019-03-21 13:24                                   ` Liran Alon
2019-03-21 13:24                                   ` Liran Alon
2019-03-21 13:51                                     ` Michael S. Tsirkin
2019-03-21 13:51                                     ` Michael S. Tsirkin
2019-03-21 14:16                                       ` Liran Alon
2019-03-21 15:15                                         ` Michael S. Tsirkin
2019-03-21 15:15                                         ` Michael S. Tsirkin
2019-03-21 14:16                                       ` Liran Alon
2019-03-21 15:45                                 ` Stephen Hemminger
2019-03-21 15:45                                 ` Stephen Hemminger
2019-03-21 15:50                                   ` Michael S. Tsirkin
2019-03-21 16:31                                     ` Liran Alon
2019-03-21 17:12                                       ` Michael S. Tsirkin
2019-03-21 17:12                                         ` Michael S. Tsirkin
2019-03-21 17:15                                         ` Liran Alon
2019-03-21 17:15                                         ` Liran Alon
2019-03-21 16:31                                     ` Liran Alon
2019-03-21 15:50                                   ` Michael S. Tsirkin
2019-03-21 15:44                               ` Stephen Hemminger
2019-03-21 22:33                                 ` si-wei liu
2019-03-21 15:44                               ` Stephen Hemminger
2019-03-21 12:47                           ` Liran Alon
2019-03-21 10:07                       ` Liran Alon
2019-03-20 22:19                   ` Liran Alon
2019-03-20 12:23           ` Liran Alon
2019-03-19 23:25       ` Liran Alon
2019-03-19 21:06   ` Michael S. Tsirkin
2019-03-19 23:05     ` Liran Alon
2019-03-19 23:05     ` Liran Alon
2019-03-19 21:06   ` Michael S. Tsirkin
2019-03-19 21:55   ` si-wei liu
2019-03-17 13:55 Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.