From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,URIBL_SBL,URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB422C432C2 for ; Thu, 26 Sep 2019 04:42:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id C99FB2146E for ; Thu, 26 Sep 2019 04:42:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391823AbfIZEjC convert rfc822-to-8bit (ORCPT ); Thu, 26 Sep 2019 00:39:02 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:36223 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728592AbfIZEjB (ORCPT ); Thu, 26 Sep 2019 00:39:01 -0400 Received: from c-67-160-6-8.hsd1.wa.comcast.net ([67.160.6.8] helo=nyx.localdomain) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1iDLYO-0005Au-Q9; Thu, 26 Sep 2019 04:38:57 +0000 Received: by nyx.localdomain (Postfix, from userid 1000) id B11E824778D; Wed, 25 Sep 2019 21:38:54 -0700 (PDT) Received: from nyx (localhost [127.0.0.1]) by nyx.localdomain (Postfix) with ESMTP id AA514289C56; Wed, 25 Sep 2019 21:38:54 -0700 (PDT) From: Jay Vosburgh To: Aleksei Zakharov cc: netdev@vger.kernel.org, "zhangsha (A)" Subject: Re: Fwd: [PATCH] bonding/802.3ad: fix slave initialization states race In-reply-to: References: <20190918130545.GA11133@yandex.ru> <31893.1568817274@nyx> <9357.1568880036@nyx> <7236.1568906827@nyx> <7154.1568987531@nyx> <10497.1569049560@nyx> <16538.1569371467@famine> Comments: In-reply-to Aleksei Zakharov message dated "Wed, 25 Sep 2019 14:01:50 +0300." X-Mailer: MH-E 8.5+bzr; nmh 1.7.1-RC3; GNU Emacs 27.0.50 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Date: Wed, 25 Sep 2019 21:38:54 -0700 Message-ID: <15507.1569472734@nyx> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Aleksei Zakharov wrote: >ср, 25 сент. 2019 г. в 03:31, Jay Vosburgh : >> >> Алексей Захаров wrote: >> [...] >> >Right after reboot one of the slaves hangs with actor port state 71 >> >and partner port state 1. >> >It doesn't send lacpdu and seems to be broken. >> >Setting link down and up again fixes slave state. >> [...] >> >> I think I see what failed in the first patch, could you test the >> following patch? This one is for net-next, so you'd need to again swap >> slave_err / netdev_err for the Ubuntu 4.15 kernel. >> >I've tested new patch. It seems to work. I can't reproduce the bug >with this patch. >There are two types of messages when link becomes up: >First: >bond-san: EVENT 1 llu 4294895911 slave eth2 >8021q: adding VLAN 0 to HW filter on device eth2 >bond-san: link status definitely down for interface eth2, disabling it >mlx4_en: eth2: Link Up >bond-san: EVENT 4 llu 4294895911 slave eth2 >bond-san: link status up for interface eth2, enabling it in 500 ms >bond-san: invalid new link 3 on slave eth2 >bond-san: link status definitely up for interface eth2, 10000 Mbps full duplex >Second: >bond-san: EVENT 1 llu 4295147594 slave eth2 >8021q: adding VLAN 0 to HW filter on device eth2 >mlx4_en: eth2: Link Up >bond-san: EVENT 4 llu 4295147594 slave eth2 >bond-san: link status up again after 0 ms for interface eth2 >bond-san: link status definitely up for interface eth2, 10000 Mbps full duplex > >These messages (especially "invalid new link") look a bit unclear from >sysadmin point of view. The "invalid new link" is appearing because bond_miimon_commit is being asked to commit a new state that isn't UP or DOWN (3 is BOND_LINK_BACK). I looked through the patched code today, and I don't see a way to get to that message with the new link set to 3, so I'll add some instrumentation and send out another patch to figure out what's going on, as that shouldn't happen. I don't see the "invalid" message testing locally, I think because my network device doesn't transition to carrier up as quickly as yours. I thought you were getting BOND_LINK_BACK passed through from bond_enslave (which calls bond_set_slave_link_state, which will set link_new_link to BOND_LINK_BACK and leave it there), but the link_new_link is reset first thing in bond_miimon_inspect, so I'm not sure how it gets into bond_miimon_commit (I'm thinking perhaps a concurrent commit triggered by another slave, which then picks up this proposed link state change by happenstance). -J --- -Jay Vosburgh, jay.vosburgh@canonical.com