From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3CB2CC433F5 for ; Fri, 28 Jan 2022 02:57:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234966AbiA1C5C (ORCPT ); Thu, 27 Jan 2022 21:57:02 -0500 Received: from mga17.intel.com ([192.55.52.151]:38587 "EHLO mga17.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229811AbiA1C5B (ORCPT ); Thu, 27 Jan 2022 21:57:01 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1643338621; x=1674874621; h=cc:subject:to:references:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=lQzpVBgtczAn5sAewXENYJpIAUiWWZKcbyftWJFiTnk=; b=ehMl2PURoXyAUIkUlgHTDB9BfqLsE4sspvkPQVACKG3fMoT1qSWUKllw 2CjWsx9+Rpb/sGPje+6nsSQOjezlC7QUoVik6ZsK9Nva+NxDPFoB8Liyj ZFtJsuBfCIbmv8OZ1AaLXYS+MwUKNBXxfxvA0fcQ3PaWbPYxMbv8mX3hz qf5Alo7q3jUg415JJtnPD6g6weBww45mXJwG/kGFXNYETs2BoJGiukWrJ Tv3GAK5BrdXHbwfJjZcHkGNAceX3ztWz1KdFa7dnTD+fw1zS59zBeNvi9 UNB+x8HPkTCXo3Ddd5F5ByVfqnEgBm2FkLqcpt+UdA8zETFZedAREtipE A==; X-IronPort-AV: E=McAfee;i="6200,9189,10240"; a="227699932" X-IronPort-AV: E=Sophos;i="5.88,322,1635231600"; d="scan'208";a="227699932" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jan 2022 18:54:14 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,322,1635231600"; d="scan'208";a="535953637" Received: from allen-box.sh.intel.com (HELO [10.239.159.118]) ([10.239.159.118]) by orsmga008.jf.intel.com with ESMTP; 27 Jan 2022 18:54:10 -0800 Cc: baolu.lu@linux.intel.com, bhelgaas@google.com, mika.westerberg@linux.intel.com, koba.ko@canonical.com, Russell Currey , Oliver O'Halloran , Lalithambika Krishnakumar , Joerg Roedel , linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state To: Kai-Heng Feng References: <20220127025418.1989642-1-kai.heng.feng@canonical.com> <0259955f-8bbb-1778-f234-398f1356db8b@linux.intel.com> From: Lu Baolu Message-ID: <11891652-40c6-f111-46b7-e96d1729815e@linux.intel.com> Date: Fri, 28 Jan 2022 10:53:07 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/27/22 7:14 PM, Kai-Heng Feng wrote: > On Thu, Jan 27, 2022 at 3:01 PM Lu Baolu wrote: >> >> On 2022/1/27 10:54, Kai-Heng Feng wrote: >>> Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in >>> hint") enables ACS, and some platforms lose its NVMe after resume from >>> S3: >>> [ 50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000 >>> [ 50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected >>> [ 50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) >>> [ 50.947830] pcieport 0000:00:1b.0: device [8086:06ac] error status/mask=00200000/00010000 >>> [ 50.947831] pcieport 0000:00:1b.0: [21] ACSViol (First) >>> [ 50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message >>> [ 50.947843] nvme nvme0: frozen state error detected, reset controller >>> >>> It happens right after ACS gets enabled during resume. >>> >>> There's another case, when Thunderbolt reaches D3cold: >>> [ 30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0 >>> [ 30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) >>> [ 30.100256] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 >>> [ 30.100262] pcieport 0000:00:1d.0: [20] UnsupReq (First) >>> [ 30.100267] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 08000052 00000000 00000000 >>> [ 30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback) >>> [ 30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback) >>> [ 30.100427] pcieport 0000:00:1d.0: AER: device recovery failed >>> >>> So disable AER service to avoid the noises from turning power rails >>> on/off when the device is in low power states (D3hot and D3cold), as >>> PCIe spec "5.2 Link State Power Management" states that TLP and DLLP >>> transmission is disabled for a Link in L2/L3 Ready (D3hot), L2 (D3cold >>> with aux power) and L3 (D3cold). >>> >>> Bugzilla:https://bugzilla.kernel.org/show_bug.cgi?id=209149 >>> Bugzilla:https://bugzilla.kernel.org/show_bug.cgi?id=215453 >>> Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint") >> >> I don't know what this fix has to do with the commit 50310600ebda. > > Commit 50310600ebda only exposed the underlying issue. Do you think > "Fixes:" tag should change to other commits? > >> Commit 50310600ebda only makes sure that PCI ACS is enabled whenever >> Intel IOMMU is on. Before this commit, PCI ACS could also be enabled >> and result in the same problem. Or anything I missed? > > The system in question didn't enable ACS before commit 50310600ebda. This commit exposed the issue on your configuration doesn't mean the fix should be back ported as far as that commit. I believe if you add intel-iommu=on in the kernel parameter, the issue still exists even you revert commit 50310600ebda or checkout a tag before it. Best regards, baolu From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 53FF0C433F5 for ; Fri, 28 Jan 2022 02:59:01 +0000 (UTC) Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4JlMc34DKqz3bbj for ; Fri, 28 Jan 2022 13:58:59 +1100 (AEDT) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=OO2yL5O5; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=none (no SPF record) smtp.mailfrom=linux.intel.com (client-ip=134.134.136.24; helo=mga09.intel.com; envelope-from=baolu.lu@linux.intel.com; receiver=) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=OO2yL5O5; dkim-atps=neutral Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4JlMbD6pj9z2ypY for ; Fri, 28 Jan 2022 13:58:15 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1643338697; x=1674874697; h=cc:subject:to:references:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=lQzpVBgtczAn5sAewXENYJpIAUiWWZKcbyftWJFiTnk=; b=OO2yL5O5GfNRIP+BHQHdQwMjXRtF4C2b0SEyqKFNIzPXow+g61Ymzpra n8VSTzpjxwotH15jqJX501NvXcZzc3qNJuJGF3bwnBOrMpPKRp4Kl5XdL q3nc7t4YWQuvA32RS7UZKi2XCHKw7LaZpPK5UNASDTrwop617t668es+Z ahI5L6TOxgyETiY02sdD0UPug9AWtT9SysCG3uGYnykqIBa5PL2bKeMAD wt7e9UiyDBwkHjtz3krtiYlSNEu6kyi6Yptkm8NLDTMwTl8kcmS6mzTeD B7KJ+SzG6xvLuRDpQWqQjqBRXX6g/LCbD85tQMPe9mLQfBVtKkozj5hid Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10240"; a="246803225" X-IronPort-AV: E=Sophos;i="5.88,322,1635231600"; d="scan'208";a="246803225" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jan 2022 18:54:13 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,322,1635231600"; d="scan'208";a="535953637" Received: from allen-box.sh.intel.com (HELO [10.239.159.118]) ([10.239.159.118]) by orsmga008.jf.intel.com with ESMTP; 27 Jan 2022 18:54:10 -0800 Subject: Re: [PATCH v2 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state To: Kai-Heng Feng References: <20220127025418.1989642-1-kai.heng.feng@canonical.com> <0259955f-8bbb-1778-f234-398f1356db8b@linux.intel.com> From: Lu Baolu Message-ID: <11891652-40c6-f111-46b7-e96d1729815e@linux.intel.com> Date: Fri, 28 Jan 2022 10:53:07 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Joerg Roedel , Lalithambika Krishnakumar , linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, koba.ko@canonical.com, Oliver O'Halloran , bhelgaas@google.com, mika.westerberg@linux.intel.com, baolu.lu@linux.intel.com Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On 1/27/22 7:14 PM, Kai-Heng Feng wrote: > On Thu, Jan 27, 2022 at 3:01 PM Lu Baolu wrote: >> >> On 2022/1/27 10:54, Kai-Heng Feng wrote: >>> Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in >>> hint") enables ACS, and some platforms lose its NVMe after resume from >>> S3: >>> [ 50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000 >>> [ 50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected >>> [ 50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) >>> [ 50.947830] pcieport 0000:00:1b.0: device [8086:06ac] error status/mask=00200000/00010000 >>> [ 50.947831] pcieport 0000:00:1b.0: [21] ACSViol (First) >>> [ 50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message >>> [ 50.947843] nvme nvme0: frozen state error detected, reset controller >>> >>> It happens right after ACS gets enabled during resume. >>> >>> There's another case, when Thunderbolt reaches D3cold: >>> [ 30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0 >>> [ 30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) >>> [ 30.100256] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 >>> [ 30.100262] pcieport 0000:00:1d.0: [20] UnsupReq (First) >>> [ 30.100267] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 08000052 00000000 00000000 >>> [ 30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback) >>> [ 30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback) >>> [ 30.100427] pcieport 0000:00:1d.0: AER: device recovery failed >>> >>> So disable AER service to avoid the noises from turning power rails >>> on/off when the device is in low power states (D3hot and D3cold), as >>> PCIe spec "5.2 Link State Power Management" states that TLP and DLLP >>> transmission is disabled for a Link in L2/L3 Ready (D3hot), L2 (D3cold >>> with aux power) and L3 (D3cold). >>> >>> Bugzilla:https://bugzilla.kernel.org/show_bug.cgi?id=209149 >>> Bugzilla:https://bugzilla.kernel.org/show_bug.cgi?id=215453 >>> Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint") >> >> I don't know what this fix has to do with the commit 50310600ebda. > > Commit 50310600ebda only exposed the underlying issue. Do you think > "Fixes:" tag should change to other commits? > >> Commit 50310600ebda only makes sure that PCI ACS is enabled whenever >> Intel IOMMU is on. Before this commit, PCI ACS could also be enabled >> and result in the same problem. Or anything I missed? > > The system in question didn't enable ACS before commit 50310600ebda. This commit exposed the issue on your configuration doesn't mean the fix should be back ported as far as that commit. I believe if you add intel-iommu=on in the kernel parameter, the issue still exists even you revert commit 50310600ebda or checkout a tag before it. Best regards, baolu