From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ar10=RD=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 10763C43381
	for <netdev@archiver.kernel.org>; Thu, 28 Feb 2019 09:32:44 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id BF7CC218AE
	for <netdev@archiver.kernel.org>; Thu, 28 Feb 2019 09:32:43 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="UV895fZe"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1731367AbfB1Jcm (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Thu, 28 Feb 2019 04:32:42 -0500
Received: from userp2130.oracle.com ([156.151.31.86]:36526 "EHLO
        userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725978AbfB1Jcl (ORCPT
        <rfc822;netdev@vger.kernel.org>); Thu, 28 Feb 2019 04:32:41 -0500
Received: from pps.filterd (userp2130.oracle.com [127.0.0.1])
        by userp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x1S9OFoH044291;
        Thu, 28 Feb 2019 09:32:25 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to :
 references : cc : from : message-id : date : mime-version : in-reply-to :
 content-type : content-transfer-encoding; s=corp-2018-07-02;
 bh=pQLaNj4mIYhPh/gfjt8Ef4P42v37cH1AQ+sSMuE/Ugo=;
 b=UV895fZeoDeG9N3rLqjoq6JsvAoxbneFeCCXufbULcbdWeHI+Ub/YQU3bdZbOwb/iFXl
 inMI0eVgd+1DNz7Q1DbYPSNVOGF0WcOWk07JC2B0YLcJWQdYqr4fv6MX/8inFBS6foNH
 ZcnNeZ6XUPkjG9+RJnKXZAWLDy9j6LzmHtORbI+l8HZWvIdfdOxL7Hw3uIC8aBhmXnct
 AMHVngeYcBmjm3jRESvai5bBNsrKyrsY9yPrMo+SZW7MymP6T2OJ5hAWnR0K0gIfbigR
 0HpbliyR+6HujJuRSlIYLxBd1ftyUpe4K8q4Zr2PSw6Fw4jreZI4UI6ot9pRGvp2lH26 qA== 
Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74])
        by userp2130.oracle.com with ESMTP id 2qtwkufutq-1
        (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
        Thu, 28 Feb 2019 09:32:24 +0000
Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72])
        by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id x1S9WJBi027853
        (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
        Thu, 28 Feb 2019 09:32:19 GMT
Received: from abhmp0007.oracle.com (abhmp0007.oracle.com [141.146.116.13])
        by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x1S9WHrZ029287;
        Thu, 28 Feb 2019 09:32:18 GMT
Received: from [10.159.225.38] (/10.159.225.38)
        by default (Oracle Beehive Gateway v4.0)
        with ESMTP ; Thu, 28 Feb 2019 01:32:17 -0800
Subject: Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC
 PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use
 the bypass framework)
To:     "Michael S. Tsirkin" <mst@redhat.com>
References: <91d4cbb1-be7a-b53c-6b2a-99bef07e7c53@intel.com>
 <d9ef40a2-237b-0cce-4401-ecaeac4c602a@oracle.com>
 <20190222100753-mutt-send-email-mst@kernel.org>
 <e6a53bd1-83ab-f170-406a-03276e8c87e2@oracle.com>
 <20190225210529-mutt-send-email-mst@kernel.org>
 <d1060c75-eaba-ab6f-ff31-38cb3a47c711@oracle.com>
 <20190227173710-mutt-send-email-mst@kernel.org>
 <c72ce9eb-254c-cc3e-1969-f7f108506d5e@oracle.com>
 <20190227184601-mutt-send-email-mst@kernel.org>
 <a617ce13-4114-469d-ef33-a1c91150eeca@oracle.com>
 <20190227193923-mutt-send-email-mst@kernel.org>
Cc:     "Samudrala, Sridhar" <sridhar.samudrala@intel.com>,
        Siwei Liu <loseweigh@gmail.com>, Jiri Pirko <jiri@resnulli.us>,
        Stephen Hemminger <stephen@networkplumber.org>,
        David Miller <davem@davemloft.net>,
        Netdev <netdev@vger.kernel.org>,
        virtualization@lists.linux-foundation.org,
        virtio-dev <virtio-dev@lists.oasis-open.org>,
        "Brandeburg, Jesse" <jesse.brandeburg@intel.com>,
        Alexander Duyck <alexander.h.duyck@intel.com>,
        Jakub Kicinski <kubakici@wp.pl>,
        Jason Wang <jasowang@redhat.com>, liran.alon@oracle.com
From:   si-wei liu <si-wei.liu@oracle.com>
Organization: Oracle Corporation
Message-ID: <36901346-e3d5-4e51-6a8d-678eb5b9e352@oracle.com>
Date:   Thu, 28 Feb 2019 01:32:12 -0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <20190227193923-mutt-send-email-mst@kernel.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9180 signatures=668685
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0
 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015
 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000
 definitions=main-1902280067
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org


On 2/27/2019 4:41 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:
>>
>> On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
>>> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>>>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>>>> Above says:
>>>>>>>>>>>>>
>>>>>>>>>>>>>          there's no motivation in the systemd/udevd community at
>>>>>>>>>>>>>          this point to refactor the rename logic and make it work well with
>>>>>>>>>>>>>          3-netdev.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>>>>
>>>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>>>> solved...
>>>>>>>>> I was just wondering what did you mean when you said
>>>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>>>> was there a proposal udev rejected?
>>>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>>>> the feasibility...
>>>>>>>>
>>>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>>>> into, or add new ones.
>>>>>>>>>
>>>>>>>> See attached diagram.
>>>>>>>>
>>>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>>>> to only work with the master failover device.
>>>>>>>>>> Where does this expectation come from?
>>>>>>>>>>
>>>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>>>>
>>>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>>>> migration.
>>>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>>>> master and have it automatically propagated to the slave.
>>>>>>>>>
>>>>>>>>> BTW this is something we should look at IMHO.
>>>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>>>> the very beginning.
>>>>>>>>
>>>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>>>> fix this particular naming problem under 3-netdev.
>>>>>>>>
>>>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Siwei
>>>>>>>>> Your scripts would not work at all then, right?
>>>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>>>> emerges in upstream eventually.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>>>>> -Siwei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>>>>
>>>>>>>>       net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>>>> devices created and initialized,                  |                              |
>>>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>>>            net_failover_create()                      |                              |
>>>>>>>> was done.)                                        |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |  runs `ifup ens3' ->         |
>>>>>>>>                                                       |    ip link set dev ens3 up   |
>>>>>>>> net_failover_open()                               |                              |
>>>>>>>>       dev_open(virtnet_dev)                           |                              |
>>>>>>>>         virtnet_open(virtnet_dev)                     |                              |
>>>>>>>>       netif_carrier_on(failover_dev)                  |                              |
>>>>>>>>       ...                                             |                              |
>>>>>>>>                                                       |                              |
>>>>>>>> (VF hot plugged in)                               |                              |
>>>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>>>      register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>>>       netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>>>        kobject_add(ixgbevf_dev)                       |                              |
>>>>>>>>         device_add(ixgbevf_dev)                       |                              |
>>>>>>>>          kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>>>           netlink_broadcast()                         |                              |
>>>>>>>>       ...                                             |                              |
>>>>>>>>       call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>>>        failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>>>         failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>>>          net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>>>           dev_open(ixgbevf_netdev)                    |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |                              |   received ADD uevent from netlink fd
>>>>>>>>                                                       |                              |   ...
>>>>>>>>                                                       |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>>>                                                       |                              |   (decided to renamed 'eth0' )
>>>>>>>>                                                       |                              |     ip link set dev eth0 name ens4
>>>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>
>>>>>>> Given renaming slaves does not work anyway:
>>>>>> I was actually thinking what if we relieve the rename restriction just for
>>>>>> the failover slave? What the impact would be? I think users don't care about
>>>>>> slave being renamed when it's in use, especially the initial rename.
>>>>>> Thoughts?
>>>>>>
>>>>>>>      would it work if we just
>>>>>>> hard-coded slave names instead?
>>>>>>>
>>>>>>> E.g.
>>>>>>> 1. fail slave renames
>>>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>>>        and primary to XXnpry
>>>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>>>> may not be present.
>>>>> In this scheme if VF is not there it will be renamed immediately after registration.
>>>> Who will be responsible to rename the slave, the kernel?
>>> That's the idea.
>>>
>>>> Note the master's
>>>> name may or may not come from the userspace. If it comes from the userspace,
>>>> should the userspace daemon change their expectation not to name/rename
>>>> _any_ slaves (today there's no distinction)?
>>> Yes the idea would be to fail renaming slaves.
>> No I was asking about the userspace expectation: whether it should track and
>> detect the lifecycle events of failover slaves and decide what to do. How
>> does it get back to the user specified name if VF is not enslaved (say
>> someone unloads the virtio-net module)?
> When virtio net is removed VF will shortly be removed too.
>
>> As this scheme adds much complexity to the kernel naming convention
>> (currently it's just ethX names) that no userspace can understand.
> Anything that pokes at slaves needs to be specially designed anyway.
> Naming seems like a minor issue.
>
>> Will the
>> change break userspace further?
>>
>> -Siwei
> Didn't you show userspace is already broken. You can't "further
> break it", rename already fails.
It's a race, userspace tends to give slave a user(space) desired name 
but sometimes may fail due to this race. Today if failover master is not 
up, rename would succeed anyway. While what you proposed prohibits user 
from providing a name in all circumstances if I understand you 
correctly. That's what I meant of breaking userspace further. On the 
other hand, you seem to tighten the kernel default naming to udev 
predictable names, which is derived from only recent systemd-udevd, 
while there exists many possible userspace naming schemes out of that. 
Users today who deliberately chooses to disable predictable naming 
(net.ifnames=0 biosdevname=0) and fall back to kernel provided names 
would expect the ethX pattern, with this change admin/user scripts which 
matches the ethX pattern could potentially break.

IMHO that change is more risky than allow userspace to change the name 
for failover slave in any case. I would refresh everyone's mind that the 
target users of net_failover is very specific to the live migration 
scenario, who typically don't have profound knowledge to fiddle with the 
low level plumbing but just expect to operate on master device directly. 
I don't have much concern over the slave netfilter rule brokenness or 
whatsoever if just lifting up the rename restriction: the failover slave 
naming itself is already unreliable, how can we break those apps relying 
on consistent naming further without fixing it in the first place? It 
could be just simply two lines of code change, if any net_failover user, 
who may break due to this change, would have come here and complained 
about the naming issue earlier. IOW at the very least, the change below 
shouldn't make the current situation any worse.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1127,7 +1127,8 @@ int dev_change_name(struct net_device *dev, const 
char *newname)
         BUG_ON(!dev_net(dev));

         net = dev_net(dev);
-       if (dev->flags & IFF_UP)
+       if (dev->flags & IFF_UP &&
+           !(dev->priv_flags & IFF_FAILOVER_SLAVE))
                 return -EBUSY;

         write_seqcount_begin(&devnet_rename_seq);

Thanks,
-Siwei


>
>>>> How do users know which name to
>>>> trust, depending on which wins the race more often? Say if kernel wants a
>>>> ens3npry name while userspace wants it named as ens4.
>>>>
>>>> -Siwei
>>> With this approach kernel will deny attempts by userspace to rename
>>> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
>>> will rename both slaves.
>>>
>>> It seems pretty solid to me, the only issue is that in theory userspace
>>> can use a name like XXXnsby for something else. But this seems unlikely.
>>>
>>>
>>>>>> I don't like the idea to delay exposing failover master
>>>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> I agree, this was not what I meant.
>>>>>


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: virtio-dev-return-5554-cohuck=redhat.com@lists.oasis-open.org
Sender: <virtio-dev@lists.oasis-open.org>
List-Post: <mailto:virtio-dev@lists.oasis-open.org>
List-Help: <mailto:virtio-dev-help@lists.oasis-open.org>
List-Unsubscribe: <mailto:virtio-dev-unsubscribe@lists.oasis-open.org>
List-Subscribe: <mailto:virtio-dev-subscribe@lists.oasis-open.org>
Received: from lists.oasis-open.org (oasis.ws5.connectedcommunity.org [10.110.1.242])
	by lists.oasis-open.org (Postfix) with ESMTP id F037E985A9B
	for <virtio-dev@lists.oasis-open.org>; Thu, 28 Feb 2019 09:32:33 +0000 (UTC)
References: <91d4cbb1-be7a-b53c-6b2a-99bef07e7c53@intel.com>
 <d9ef40a2-237b-0cce-4401-ecaeac4c602a@oracle.com>
 <20190222100753-mutt-send-email-mst@kernel.org>
 <e6a53bd1-83ab-f170-406a-03276e8c87e2@oracle.com>
 <20190225210529-mutt-send-email-mst@kernel.org>
 <d1060c75-eaba-ab6f-ff31-38cb3a47c711@oracle.com>
 <20190227173710-mutt-send-email-mst@kernel.org>
 <c72ce9eb-254c-cc3e-1969-f7f108506d5e@oracle.com>
 <20190227184601-mutt-send-email-mst@kernel.org>
 <a617ce13-4114-469d-ef33-a1c91150eeca@oracle.com>
 <20190227193923-mutt-send-email-mst@kernel.org>
From: si-wei liu <si-wei.liu@oracle.com>
Message-ID: <36901346-e3d5-4e51-6a8d-678eb5b9e352@oracle.com>
Date: Thu, 28 Feb 2019 01:32:12 -0800
MIME-Version: 1.0
In-Reply-To: <20190227193923-mutt-send-email-mst@kernel.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC
 PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use
 the bypass framework)
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>, Siwei Liu <loseweigh@gmail.com>, Jiri Pirko <jiri@resnulli.us>, Stephen Hemminger <stephen@networkplumber.org>, David Miller <davem@davemloft.net>, Netdev <netdev@vger.kernel.org>, virtualization@lists.linux-foundation.org, virtio-dev <virtio-dev@lists.oasis-open.org>, "Brandeburg, Jesse" <jesse.brandeburg@intel.com>, Alexander Duyck <alexander.h.duyck@intel.com>, Jakub Kicinski <kubakici@wp.pl>, Jason Wang <jasowang@redhat.com>, liran.alon@oracle.com
List-ID: <virtio-dev.lists.oasis-open.org>


On 2/27/2019 4:41 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:
>>
>> On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
>>> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>>>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>>>> Above says:
>>>>>>>>>>>>>
>>>>>>>>>>>>>          there's no motivation in the systemd/udevd community at
>>>>>>>>>>>>>          this point to refactor the rename logic and make it work well with
>>>>>>>>>>>>>          3-netdev.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>>>>
>>>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>>>> solved...
>>>>>>>>> I was just wondering what did you mean when you said
>>>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>>>> was there a proposal udev rejected?
>>>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>>>> the feasibility...
>>>>>>>>
>>>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>>>> into, or add new ones.
>>>>>>>>>
>>>>>>>> See attached diagram.
>>>>>>>>
>>>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>>>> to only work with the master failover device.
>>>>>>>>>> Where does this expectation come from?
>>>>>>>>>>
>>>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>>>>
>>>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>>>> migration.
>>>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>>>> master and have it automatically propagated to the slave.
>>>>>>>>>
>>>>>>>>> BTW this is something we should look at IMHO.
>>>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>>>> the very beginning.
>>>>>>>>
>>>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>>>> fix this particular naming problem under 3-netdev.
>>>>>>>>
>>>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Siwei
>>>>>>>>> Your scripts would not work at all then, right?
>>>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>>>> emerges in upstream eventually.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>>>>> -Siwei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>>>>
>>>>>>>>       net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>>>> devices created and initialized,                  |                              |
>>>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>>>            net_failover_create()                      |                              |
>>>>>>>> was done.)                                        |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |  runs `ifup ens3' ->         |
>>>>>>>>                                                       |    ip link set dev ens3 up   |
>>>>>>>> net_failover_open()                               |                              |
>>>>>>>>       dev_open(virtnet_dev)                           |                              |
>>>>>>>>         virtnet_open(virtnet_dev)                     |                              |
>>>>>>>>       netif_carrier_on(failover_dev)                  |                              |
>>>>>>>>       ...                                             |                              |
>>>>>>>>                                                       |                              |
>>>>>>>> (VF hot plugged in)                               |                              |
>>>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>>>      register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>>>       netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>>>        kobject_add(ixgbevf_dev)                       |                              |
>>>>>>>>         device_add(ixgbevf_dev)                       |                              |
>>>>>>>>          kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>>>           netlink_broadcast()                         |                              |
>>>>>>>>       ...                                             |                              |
>>>>>>>>       call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>>>        failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>>>         failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>>>          net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>>>           dev_open(ixgbevf_netdev)                    |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |                              |   received ADD uevent from netlink fd
>>>>>>>>                                                       |                              |   ...
>>>>>>>>                                                       |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>>>                                                       |                              |   (decided to renamed 'eth0' )
>>>>>>>>                                                       |                              |     ip link set dev eth0 name ens4
>>>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>
>>>>>>> Given renaming slaves does not work anyway:
>>>>>> I was actually thinking what if we relieve the rename restriction just for
>>>>>> the failover slave? What the impact would be? I think users don't care about
>>>>>> slave being renamed when it's in use, especially the initial rename.
>>>>>> Thoughts?
>>>>>>
>>>>>>>      would it work if we just
>>>>>>> hard-coded slave names instead?
>>>>>>>
>>>>>>> E.g.
>>>>>>> 1. fail slave renames
>>>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>>>        and primary to XXnpry
>>>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>>>> may not be present.
>>>>> In this scheme if VF is not there it will be renamed immediately after registration.
>>>> Who will be responsible to rename the slave, the kernel?
>>> That's the idea.
>>>
>>>> Note the master's
>>>> name may or may not come from the userspace. If it comes from the userspace,
>>>> should the userspace daemon change their expectation not to name/rename
>>>> _any_ slaves (today there's no distinction)?
>>> Yes the idea would be to fail renaming slaves.
>> No I was asking about the userspace expectation: whether it should track and
>> detect the lifecycle events of failover slaves and decide what to do. How
>> does it get back to the user specified name if VF is not enslaved (say
>> someone unloads the virtio-net module)?
> When virtio net is removed VF will shortly be removed too.
>
>> As this scheme adds much complexity to the kernel naming convention
>> (currently it's just ethX names) that no userspace can understand.
> Anything that pokes at slaves needs to be specially designed anyway.
> Naming seems like a minor issue.
>
>> Will the
>> change break userspace further?
>>
>> -Siwei
> Didn't you show userspace is already broken. You can't "further
> break it", rename already fails.
It's a race, userspace tends to give slave a user(space) desired name 
but sometimes may fail due to this race. Today if failover master is not 
up, rename would succeed anyway. While what you proposed prohibits user 
from providing a name in all circumstances if I understand you 
correctly. That's what I meant of breaking userspace further. On the 
other hand, you seem to tighten the kernel default naming to udev 
predictable names, which is derived from only recent systemd-udevd, 
while there exists many possible userspace naming schemes out of that. 
Users today who deliberately chooses to disable predictable naming 
(net.ifnames=0 biosdevname=0) and fall back to kernel provided names 
would expect the ethX pattern, with this change admin/user scripts which 
matches the ethX pattern could potentially break.

IMHO that change is more risky than allow userspace to change the name 
for failover slave in any case. I would refresh everyone's mind that the 
target users of net_failover is very specific to the live migration 
scenario, who typically don't have profound knowledge to fiddle with the 
low level plumbing but just expect to operate on master device directly. 
I don't have much concern over the slave netfilter rule brokenness or 
whatsoever if just lifting up the rename restriction: the failover slave 
naming itself is already unreliable, how can we break those apps relying 
on consistent naming further without fixing it in the first place? It 
could be just simply two lines of code change, if any net_failover user, 
who may break due to this change, would have come here and complained 
about the naming issue earlier. IOW at the very least, the change below 
shouldn't make the current situation any worse.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1127,7 +1127,8 @@ int dev_change_name(struct net_device *dev, const 
char *newname)
         BUG_ON(!dev_net(dev));

         net = dev_net(dev);
-       if (dev->flags & IFF_UP)
+       if (dev->flags & IFF_UP &&
+           !(dev->priv_flags & IFF_FAILOVER_SLAVE))
                 return -EBUSY;

         write_seqcount_begin(&devnet_rename_seq);

Thanks,
-Siwei


>
>>>> How do users know which name to
>>>> trust, depending on which wins the race more often? Say if kernel wants a
>>>> ens3npry name while userspace wants it named as ens4.
>>>>
>>>> -Siwei
>>> With this approach kernel will deny attempts by userspace to rename
>>> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
>>> will rename both slaves.
>>>
>>> It seems pretty solid to me, the only issue is that in theory userspace
>>> can use a name like XXXnsby for something else. But this seems unlikely.
>>>
>>>
>>>>>> I don't like the idea to delay exposing failover master
>>>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> I agree, this was not what I meant.
>>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org