From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3D9AC43381 for ; Thu, 28 Feb 2019 00:01:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7ACA421850 for ; Thu, 28 Feb 2019 00:01:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="mur85RNx" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730438AbfB1ABK (ORCPT ); Wed, 27 Feb 2019 19:01:10 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:53868 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728397AbfB1ABK (ORCPT ); Wed, 27 Feb 2019 19:01:10 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x1RNt2QD019160; Thu, 28 Feb 2019 00:00:58 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2018-07-02; bh=hlPT4V70E0USkaP8qDVIFZ2G0qOyXhB9+s1bHI8XlVA=; b=mur85RNxOizwwJm0HYGHK2TZ8QWI+eIQCJAi7xQrNcENTw3BauJL5/yOEIl4dKuDMBL3 W7sF2miEAoJIVsdQgS2Tzm/taZULDn+i0v/BkujKqZob30/VyAA0NBQAbeeSFuEcE7nb k9M6QoMp09CG6NM111x34GV5KukJ6foYijj1+4eJnsctXQyv9f8EBO+hY/HWog+U+kPU pFSQM6Pkjo7NRjOWA0zhMD4ep8Cp+/bka89++ecYN9304222HltaPbj2ozzCE/vMyGcb cQwiLHwugqwyGsAV24KB8tyXSsI6J+jTla7KbtMl9iRN3BZmJ6UKYRbLdWTX5K7/EUeB PA== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2130.oracle.com with ESMTP id 2qtwkue0us-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 28 Feb 2019 00:00:58 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id x1S00whV007054 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 28 Feb 2019 00:00:58 GMT Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id x1S00vnr018375; Thu, 28 Feb 2019 00:00:57 GMT Received: from [10.74.126.23] (/10.74.126.23) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 27 Feb 2019 16:00:54 -0800 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework) From: Liran Alon In-Reply-To: <20190227184601-mutt-send-email-mst@kernel.org> Date: Thu, 28 Feb 2019 02:00:48 +0200 Cc: si-wei liu , "Samudrala, Sridhar" , Siwei Liu , Jiri Pirko , Stephen Hemminger , David Miller , Netdev , virtualization@lists.linux-foundation.org, virtio-dev , "Brandeburg, Jesse" , Alexander Duyck , Jakub Kicinski , Jason Wang Content-Transfer-Encoding: quoted-printable Message-Id: References: <20190221203808-mutt-send-email-mst@kernel.org> <581e4399-3969-aecd-e923-03bbc0880733@oracle.com> <91d4cbb1-be7a-b53c-6b2a-99bef07e7c53@intel.com> <20190222100753-mutt-send-email-mst@kernel.org> <20190225210529-mutt-send-email-mst@kernel.org> <20190227173710-mutt-send-email-mst@kernel.org> <20190227184601-mutt-send-email-mst@kernel.org> To: "Michael S. Tsirkin" X-Mailer: Apple Mail (2.3445.4.7) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9180 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1902270154 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org > On 28 Feb 2019, at 1:50, Michael S. Tsirkin wrote: >=20 > On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote: >>=20 >>=20 >> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote: >>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote: >>>>=20 >>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote: >>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote: >>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: >>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: >>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: >>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote: >>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: >>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: >>>>>>>>>>>> Sorry for replying to this ancient thread. There was some = remaining >>>>>>>>>>>> issue that I don't think the initial net_failover patch got = addressed >>>>>>>>>>>> cleanly, see: >>>>>>>>>>>>=20 >>>>>>>>>>>> = https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__bugs.launchpad.net_= ubuntu_-2Bsource_linux_-2Bbug_1815268&d=3DDwIBAg&c=3DRoP1YumCXCgaWHvlZYR8P= Zh8Bv7qIrMUB65eapI_JnE&r=3DJk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=3D= aL-QfUoSYx8r0XCOBkcDtF8f-cYxrJI3skYLFTb8XJE&s=3Dyk6Nqv3a6_JMzyrXKY67h00FyN= rDJyQ-PYMFffDSTXM&e=3D >>>>>>>>>>>>=20 >>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev = userspace was >>>>>>>>>>>> not specifically writtten for such kernel automatic = enslavement. >>>>>>>>>>>> Specifically, if it is a bond or team, the slave would = typically get >>>>>>>>>>>> renamed *before* virtual device gets created, that's what = udev can >>>>>>>>>>>> control (without getting netdev opened early by the other = part of >>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs, >>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel >>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace = convention, >>>>>>>>>>>> which don't provides a solution if user care about = consistent naming >>>>>>>>>>>> on the slave netdevs specifically. >>>>>>>>>>>>=20 >>>>>>>>>>>> Previously this issue had been specifically called out when = IFF_HIDDEN >>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a = solution to this >>>>>>>>>>>> problem ever since. Please share your mind how to proceed = and solve >>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev = model. >>>>>>>>>>> Above says: >>>>>>>>>>>=20 >>>>>>>>>>> there's no motivation in the systemd/udevd community = at >>>>>>>>>>> this point to refactor the rename logic and make it = work well with >>>>>>>>>>> 3-netdev. >>>>>>>>>>>=20 >>>>>>>>>>> What would the fix be? Skip slave devices? >>>>>>>>>>>=20 >>>>>>>>>> There's nothing user can get if just skipping slave devices - = the >>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 = the >>>>>>>>>> next reboot, while the rest may conform to the naming scheme = (ens3 >>>>>>>>>> and such). There's no way one can fix this in userspace alone = - when >>>>>>>>>> the failover is created the enslaved netdev was opened by the = kernel >>>>>>>>>> earlier than the userspace is made aware of, and there's no >>>>>>>>>> negotiation protocol for kernel to know when userspace has = done >>>>>>>>>> initial renaming of the interface. I would expect netdev list = should >>>>>>>>>> at least provide the direction in general for how this can be >>>>>>>>>> solved... >>>>>>> I was just wondering what did you mean when you said >>>>>>> "refactor the rename logic and make it work well with 3-netdev" = - >>>>>>> was there a proposal udev rejected? >>>>>> No. I never believed this particular issue can be fixed in = userspace alone. >>>>>> Previously someone had said it could be, but I never see any work = or >>>>>> relevant discussion ever happened in various userspace = communities (for e.g. >>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO = the root >>>>>> of the issue derives from the kernel, it makes more sense to = start from >>>>>> netdev, work out and decide on a solution: see what can be done = in the >>>>>> kernel in order to fix it, then after that engage userspace = community for >>>>>> the feasibility... >>>>>>=20 >>>>>>> Anyway, can we write a time diagram for what happens in which = order that >>>>>>> leads to failure? That would help look for triggers that we can = tie >>>>>>> into, or add new ones. >>>>>>>=20 >>>>>> See attached diagram. >>>>>>=20 >>>>>>>=20 >>>>>>>=20 >>>>>>>>> Is there an issue if slave device names are not predictable? = The user/admin scripts are expected >>>>>>>>> to only work with the master failover device. >>>>>>>> Where does this expectation come from? >>>>>>>>=20 >>>>>>>> Admin users may have ethtool or tc configurations that need to = deal with >>>>>>>> predictable interface name. Third-party app which was built = upon specifying >>>>>>>> certain interface name can't be modified to chase dynamic = names. >>>>>>>>=20 >>>>>>>> Specifically, we have pre-canned image that uses ethtool to = fine tune VF >>>>>>>> offload settings post boot for specific workload. Those images = won't work >>>>>>>> well if the name is constantly changing just after couple = rounds of live >>>>>>>> migration. >>>>>>> It should be possible to specify the ethtool configuration on = the >>>>>>> master and have it automatically propagated to the slave. >>>>>>>=20 >>>>>>> BTW this is something we should look at IMHO. >>>>>> I was elaborating a few examples that the expectation and = assumption that >>>>>> user/admin scripts only deal with master failover device is = incorrect. It >>>>>> had never been taken good care of, although I did try to = emphasize it from >>>>>> the very beginning. >>>>>>=20 >>>>>> Basically what you said about propagating the ethtool = configuration down to >>>>>> the slave is the key pursuance of 1-netdev model. However, what I = am seeking >>>>>> now is any alternative that can also fix the specific udev rename = problem, >>>>>> before concluding that 1-netdev is the only solution. Generally a = 1-netdev >>>>>> scheme would take time to implement, while I'm trying to find a = way out to >>>>>> fix this particular naming problem under 3-netdev. >>>>>>=20 >>>>>>>>> Moreover, you were suggesting hiding the lower slave devices = anyway. There was some discussion >>>>>>>>> about moving them to a hidden network namespace so that they = are not visible from the default namespace. >>>>>>>>> I looked into this sometime back, but did not find the right = kernel api to create a network namespace within >>>>>>>>> kernel. If so, we could use this mechanism to simulate a = 1-netdev model. >>>>>>>> Yes, that's one possible implementation (IMHO the key is to = make 1-netdev >>>>>>>> model as much transparent to a real NIC as possible, while a = hidden netns is >>>>>>>> just the vehicle). However, I recall there was resistance = around this >>>>>>>> discussion that even the concept of hiding itself is a taboo = for Linux >>>>>>>> netdev. I would like to summon potential alternatives before = concluding >>>>>>>> 1-netdev is the only solution too soon. >>>>>>>>=20 >>>>>>>> Thanks, >>>>>>>> -Siwei >>>>>>> Your scripts would not work at all then, right? >>>>>> At this point we don't claim images with such usage as SR-IOV = live >>>>>> migrate-able. We would flag it as live migrate-able until this = ethtool >>>>>> config issue is fully addressed and a transparent live migration = solution >>>>>> emerges in upstream eventually. >>>>>>=20 >>>>>>=20 >>>>>> Thanks, >>>>>> -Siwei >>>>>>>>>> -Siwei >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>> = --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: = virtio-dev-unsubscribe@lists.oasis-open.org >>>>>>> For additional commands, e-mail: = virtio-dev-help@lists.oasis-open.org >>>>>>>=20 >>>>>> net_failover(kernel) | = network.service (user) | systemd-udevd (user) >>>>>> = --------------------------------------------------+-----------------------= -------+-------------------------------------------- >>>>>> (standby virtio-net and net_failover | = | >>>>>> devices created and initialized, | = | >>>>>> i.e. virtnet_probe()-> | = | >>>>>> net_failover_create() | = | >>>>>> was done.) | = | >>>>>> | = | >>>>>> | runs `ifup = ens3' -> | >>>>>> | ip link = set dev ens3 up | >>>>>> net_failover_open() | = | >>>>>> dev_open(virtnet_dev) | = | >>>>>> virtnet_open(virtnet_dev) | = | >>>>>> netif_carrier_on(failover_dev) | = | >>>>>> ... | = | >>>>>> | = | >>>>>> (VF hot plugged in) | = | >>>>>> ixgbevf_probe() | = | >>>>>> register_netdev(ixgbevf_netdev) | = | >>>>>> netdev_register_kobject(ixgbevf_netdev) | = | >>>>>> kobject_add(ixgbevf_dev) | = | >>>>>> device_add(ixgbevf_dev) | = | >>>>>> kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | = | >>>>>> netlink_broadcast() | = | >>>>>> ... | = | >>>>>> call_netdevice_notifiers(NETDEV_REGISTER) | = | >>>>>> failover_event(..., NETDEV_REGISTER, ...) | = | >>>>>> failover_slave_register(ixgbevf_netdev) | = | >>>>>> net_failover_slave_register(ixgbevf_netdev) | = | >>>>>> dev_open(ixgbevf_netdev) | = | >>>>>> | = | >>>>>> | = | >>>>>> | = | received ADD uevent from netlink fd >>>>>> | = | ... >>>>>> | = | udev-builtin-net_id.c:dev_pci_slot() >>>>>> | = | (decided to renamed 'eth0' ) >>>>>> | = | ip link set dev eth0 name ens4 >>>>>> (dev_change_name() returns -EBUSY as | = | >>>>>> ixgbevf_netdev->flags has IFF_UP) | = | >>>>>> | = | >>>>>>=20 >>>>> Given renaming slaves does not work anyway: >>>> I was actually thinking what if we relieve the rename restriction = just for >>>> the failover slave? What the impact would be? I think users don't = care about >>>> slave being renamed when it's in use, especially the initial = rename. >>>> Thoughts? >>>>=20 >>>>> would it work if we just >>>>> hard-coded slave names instead? >>>>>=20 >>>>> E.g. >>>>> 1. fail slave renames >>>>> 2. rename of failover to XX automatically renames standby to = XXnsby >>>>> and primary to XXnpry >>>> That wouldn't help. The time when the failover master gets renamed, = the VF >>>> may not be present. >>> In this scheme if VF is not there it will be renamed immediately = after registration. >> Who will be responsible to rename the slave, the kernel? >=20 > That's the idea. >=20 >> Note the master's >> name may or may not come from the userspace. If it comes from the = userspace, >> should the userspace daemon change their expectation not to = name/rename >> _any_ slaves (today there's no distinction)? >=20 > Yes the idea would be to fail renaming slaves. >=20 >> How do users know which name to >> trust, depending on which wins the race more often? Say if kernel = wants a >> ens3npry name while userspace wants it named as ens4. >>=20 >> -Siwei >=20 > With this approach kernel will deny attempts by userspace to rename > slaves. Slaves will always be named XXXnsby and XXnpry. Master = renames > will rename both slaves. >=20 > It seems pretty solid to me, the only issue is that in theory = userspace > can use a name like XXXnsby for something else. But this seems = unlikely. I=E2=80=99m fond of this idea and I have similar opinion. I think it simplifies the issue here. I don=E2=80=99t see a real reason for customer to define udev rule to = rename a net-failover slave to have different postfix. -Liran >=20 >=20 >>>=20 >>>> I don't like the idea to delay exposing failover master >>>> until VF is hot plugged in (probably subject to various failures) = later. >>>>=20 >>>> Thanks, >>>> -Siwei >>>=20 >>> I agree, this was not what I meant. >>>=20 >>>>>=20