linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Alex G." <mr.nuke.me@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Len Brown <lenb@kernel.org>, Tony Luck <tony.luck@intel.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Robert Moore <robert.moore@intel.com>,
	Erik Schmauss <erik.schmauss@intel.com>,
	Tyler Baicar <tbaicar@codeaurora.org>,
	Will Deacon <will.deacon@arm.com>,
	James Morse <james.morse@arm.com>,
	Shiju Jose <shiju.jose@huawei.com>,
	"Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>,
	Dongjiu Geng <gengdongjiu@huawei.com>,
	linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-edac@vger.kernel.org, devel@acpica.org
Subject: Re: [RFC PATCH v4 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES
Date: Fri, 11 May 2018 12:01:52 -0500	[thread overview]
Message-ID: <95bcbc2d-0f8c-e51a-f0fc-08ea8c5fca26@gmail.com> (raw)
In-Reply-To: <20180511162951.GH12705@pd.tnic>

On 05/11/2018 11:29 AM, Borislav Petkov wrote:
> On Fri, May 11, 2018 at 11:12:25AM -0500, Alex G. wrote:
>>> I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not
>>> enough of a check to confirm that there actually *is* an AER driver to
>>> handle the errors. If you really want to make sure the driver is loaded
>>> and functioning, then you need an explicit registering mechanism or some
>>> other way of checking it really is there and handling errors.
>>
>> config ACPI_APEI_PCIEAER
>> 	bool "APEI PCIe AER logging/recovering support"
>> 	depends on ACPI_APEI && PCIEAER
>> 	help
>> 	  PCIe AER errors may be reported via APEI firmware first mode.
>> 	  Turn on this option to enable the corresponding support.
>>
>> PCIAER is not modularizable. QED
> 
> QED my ass.
> 
> Read the f*ck my email again: the presence of the *code* is
> not enough of a check to confirm the error has been handled.
> aer_recover_work_func() can fail as that kfifo_put() in
> aer_recover_queue() can too.
> 
> You need an *actual* confirmation that the error has been handled
> properly and *only* *then* not panic the system. Otherwise you are
> potentially leaving those errors unhandled.


"How is PCIe error severity dependent on whether the AER error reporting
 driver is enabled (and possibly not even loaded) on the system?"

Little about confirmation of error being handled was talked about either
in your **** email, or previous versions of this series.  And quite
frankly it's besides the scope of this patch.

The scope is to enable SURPRISE!!! removal of NVMe drives and PCIe
devices. For that purpose, we don't need confirmation that the error was
handled. Such a confirmation requires a rework of GHES handling, or at
least the interaction between GHES and AER, both of which I find to be
mostly satisfactory.

You can't at this point know if the error is going to be handled.
There's code further downstream to handle this. You also didn't like it
when I wanted to handle things downstream.

I understand your concern with unhandled AER errors evolving into MCE's.
That's extremely rare, but when it happens you still panic due to the
MCE. To give you an idea of the rarity, in several months of testing, I
was only able to reproduce MCEs once, and that was with a very defective
drive, and a very idiotic test case.

If you find this solution unacceptable, that's fine. We can fix it in
firmware. We can hide all the events from the OS, contain the downstream
ports, and simulate hot-remove interrupts. All in firmware, all the time.

Alex

  reply	other threads:[~2018-05-11 17:01 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20180430212836.7807-1-mr.nuke.me@gmail.com>
2018-04-30 21:33 ` [RFC PATCH v4 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-30 21:33   ` [RFC PATCH v4 2/3] acpi: apei: Rename ghes_severity() to ghes_cper_severity() Alexandru Gagniuc
2018-05-04 11:56     ` Shiju Jose
2018-05-04 23:33       ` Alex G.
2018-05-11 15:39     ` Borislav Petkov
2018-05-11 15:45       ` Alex G.
2018-05-11 15:58         ` Borislav Petkov
2018-05-11 16:12           ` Alex G.
2018-05-11 16:19             ` Borislav Petkov
2018-05-11 17:03               ` Alex G.
2018-04-30 21:33   ` [RFC PATCH v4 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-05-11 15:40     ` Borislav Petkov
2018-05-11 15:54       ` Alex G.
2018-05-11 16:02         ` Borislav Petkov
2018-05-11 16:12           ` Alex G.
2018-05-11 16:29             ` Borislav Petkov
2018-05-11 17:01               ` Alex G. [this message]
2018-05-11 17:41                 ` Borislav Petkov
2018-05-11 17:56                   ` Alex G.
2018-05-12  9:00   ` [RFC PATCH v4 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Borislav Petkov
2018-05-14 14:59 ` [PATCH v5 0/2] acpi: apei: Improve PCIe error handling with FFS Alexandru Gagniuc
2018-05-14 14:59   ` [PATCH v5 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity() Alexandru Gagniuc
2018-05-14 14:59   ` [PATCH v5 2/2] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=95bcbc2d-0f8c-e51a-f0fc-08ea8c5fca26@gmail.com \
    --to=mr.nuke.me@gmail.com \
    --cc=alex_gagniuc@dellteam.com \
    --cc=austin_bolen@dell.com \
    --cc=bp@alien8.de \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=gengdongjiu@huawei.com \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=shyam_iyer@dell.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).