From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 255F5C636C9 for ; Mon, 19 Jul 2021 14:52:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 120146112D for ; Mon, 19 Jul 2021 14:52:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241317AbhGSOLj (ORCPT ); Mon, 19 Jul 2021 10:11:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241406AbhGSOLA (ORCPT ); Mon, 19 Jul 2021 10:11:00 -0400 Received: from mail-wr1-x42a.google.com (mail-wr1-x42a.google.com [IPv6:2a00:1450:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EFF96C061574; Mon, 19 Jul 2021 07:19:45 -0700 (PDT) Received: by mail-wr1-x42a.google.com with SMTP id f17so22403476wrt.6; Mon, 19 Jul 2021 07:51:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=2TMqggDy1d4GjwkeBWFENMCiF9IVcT7UERyMIqhwYKo=; b=cePslZ91T589YXi+3HkXUOwh7qnmfP0xhw4Qm8qke5RPoUaOTCXMBbnt5FwRU+446L g1LcU7pZVTxoYFwIBxr1awlaMkj7dGrYj2KAEQaKVPaGuEpo8n+cE4glttLHzbyfBpEr bKmSTOc8mNjgqtXOj3NHPxRf/tdPK3EaZernnrC0PNKrqkmrG0R9aCzuvbtChw1Ws0/c BmICbm/pkEemp7ThE+EXCfvvMN+fzqAZ1AAznQF6ZMLGYzbQIU+Tx+vMJG+Ior1sGkGb fKRQK3OyUDgijP6LmQVPc0nxJ63hYLdRshhkLZcpY415xApz2M+tXXe28pDTjMYcF3vy cbKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=2TMqggDy1d4GjwkeBWFENMCiF9IVcT7UERyMIqhwYKo=; b=dn7ln2jmwM4DINGKp6fUCrGDPYiwe9JDPmmjpLhX381B4mMlvb8MhBViGDjQPYy4NA cq8qY1AJQu9Vl6Ua2qOaRsxcU4zdTvoK7rfm8Uk9FkLGzGqn7y4fze3Kmw86xpZxA0kj 3hfHQno9duXNCKVhNWue9E27FKtCn/CPFLGkQQ15VehXXFLlRSP+F5do2uagT+g5mPjG vnEhTjDqZPhJtuiCpKCmnzdGKEzOso8MVm93iYLuZSrqKNCkmunL0QSvt4aESw3fyDSc hhI2rRKOxNrgQqQe2uGK/VIlczFqsluurrLVsHZ029qF8/8KC7bjxAb39g7FPdq1DwSt r3NQ== X-Gm-Message-State: AOAM531HXFBXn0tMblnU1zFm3SVSMM7Ah7JyUp5G1KQnvu6GPdXmuuky GfjbSzOwLjZLlEncXdvGxYLDsXeDD5lTP0nkymg= X-Google-Smtp-Source: ABdhPJwNz0auJQIu/oQY1Paggbn4qOK+Whd3rECDa6KNwvbLE3jYddobF9KeV80+E0G/5YgUvnGlUM5EYJUHDVnCYfE= X-Received: by 2002:adf:ed46:: with SMTP id u6mr30387204wro.252.1626706291240; Mon, 19 Jul 2021 07:51:31 -0700 (PDT) MIME-Version: 1.0 References: <20210715080822.14575-1-justin.he@arm.com> In-Reply-To: From: Prabhakar Kushwaha Date: Mon, 19 Jul 2021 20:20:54 +0530 Message-ID: Subject: Re: [PATCH] qed: fix possible unpaired spin_{un}lock_bh in _qed_mcp_cmd_and_union() To: Justin He Cc: Ariel Elior , "GR-everest-linux-l2@marvell.com" , "David S. Miller" , Jakub Kicinski , "netdev@vger.kernel.org" , Linux Kernel Mailing List , nd , Shai Malin , Shai Malin , Prabhakar Kushwaha Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Justin, On Mon, Jul 19, 2021 at 6:47 PM Justin He wrote: > > Hi Prabhakar > > > -----Original Message----- > > From: Prabhakar Kushwaha > > Sent: Monday, July 19, 2021 6:36 PM > > To: Justin He > > Cc: Ariel Elior ; GR-everest-linux-l2@marvell.com; > > David S. Miller ; Jakub Kicinski ; > > netdev@vger.kernel.org; Linux Kernel Mailing List > kernel@vger.kernel.org>; nd ; Shai Malin ; > > Shai Malin ; Prabhakar Kushwaha > > Subject: Re: [PATCH] qed: fix possible unpaired spin_{un}lock_bh in > > _qed_mcp_cmd_and_union() > > > > Hi Jia, > > > > On Thu, Jul 15, 2021 at 2:28 PM Jia He wrote: > > > > > > Liajian reported a bug_on hit on a ThunderX2 arm64 server with FastLinQ > > > QL41000 ethernet controller: > > > BUG: scheduling while atomic: kworker/0:4/531/0x00000200 > > > [qed_probe:488()]hw prepare failed > > > kernel BUG at mm/vmalloc.c:2355! > > > Internal error: Oops - BUG: 0 [#1] SMP > > > CPU: 0 PID: 531 Comm: kworker/0:4 Tainted: G W 5.4.0-77-generic #86- > > Ubuntu > > > pstate: 00400009 (nzcv daif +PAN -UAO) > > > Call trace: > > > vunmap+0x4c/0x50 > > > iounmap+0x48/0x58 > > > qed_free_pci+0x60/0x80 [qed] > > > qed_probe+0x35c/0x688 [qed] > > > __qede_probe+0x88/0x5c8 [qede] > > > qede_probe+0x60/0xe0 [qede] > > > local_pci_probe+0x48/0xa0 > > > work_for_cpu_fn+0x24/0x38 > > > process_one_work+0x1d0/0x468 > > > worker_thread+0x238/0x4e0 > > > kthread+0xf0/0x118 > > > ret_from_fork+0x10/0x18 > > > > > > In this case, qed_hw_prepare() returns error due to hw/fw error, but in > > > theory work queue should be in process context instead of interrupt. > > > > > > The root cause might be the unpaired spin_{un}lock_bh() in > > > _qed_mcp_cmd_and_union(), which causes botton half is disabled > > incorrectly. > > > > > > Reported-by: Lijian Zhang > > > Signed-off-by: Jia He > > > --- > > > > This patch is adding additional spin_{un}lock_bh(). > > Can you please enlighten about the exact flow causing this unpaired > > spin_{un}lock_bh. > > > For instance: > _qed_mcp_cmd_and_union() > In while loop > spin_lock_bh() > qed_mcp_has_pending_cmd() (assume false), will break the loop I agree till here. > if (cnt >= max_retries) { > ... > return -EAGAIN; <-- here returns -EAGAIN without invoking bh unlock > } > Because of break, cnt has not been increased. - cnt is still less than max_retries. - if (cnt >= max_retries) will not be *true*, leading to spin_unlock_bh(). Hence pairing completed. I am not seeing any issue here. > > Also, > > as per description, looks like you are not sure actual the root-cause. > > does this patch really solved the problem? > > I don't have that ThunderX2 to verify the patch. > But I searched all the spin_lock/unlock_bh and spin_lock_irqsave/irqrestore > under driver/.../qlogic, this is the only problematic point I could figure > out. And this might be possible code path of qed_probe(). > Without testing and proper root-cause, it is tough to accept the suggested fix. --pk