All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex G." <mr.nuke.me@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
	tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
	shiju.jose@huawei.com, zjzhang@codeaurora.org,
	gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
	alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
	robert.moore@intel.com, erik.schmauss@intel.com
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Thu, 19 Apr 2018 11:26:57 -0500	[thread overview]
Message-ID: <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> (raw)
In-Reply-To: <20180419154006.GE3600@pd.tnic>


On 04/19/2018 10:40 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 09:57:07AM -0500, Alex G. wrote:
>> ghes_severity() is a one-to-one mapping from a set of unsorted
>> severities to monotonically increasing numbers. The "one-to-one" mapping
>> part of the sentence is obvious from the function name. To change it to
>> parse the entire GHES would completely destroy this, and I think it
>> would apply policy in the wrong place.
> 
> So do a wrapper or whatever. Do a ghes_compute_severity() or however you
> would wanna call it and do the iteration there.

That doesn't sound right. There isn't a formula to compute. What we're
doing is we're looking at individual error sources, and deciding what
errors we can handle based both on the error, and our ability to handle
the error.

>> Should I do that, I might have to call it something like
>> ghes_parse_and_apply_policy_to_severity(). But that misses the whole
>> point if these changes.
> 
> What policy? You simply compute the severity like we do in the mce code.

As explained above, our ability to resolve an error depends on the
interaction between the error and error handler. This is very closely
tied to the capabilities of each individual handler. I'll do it your
way, but I don't think ignoring this tight coupling is the right thing
to do.

> 
>> I would like to get to the handlers first, and then decide if things are
>> okay or not,
> 
> Why? Give me an example why you'd handle an error first and then decide
> whether we're ok or not?
> 
> Usually, the error handler decides that in one place. So what exactly
> are you trying to do differently that doesn't fit that flow?

In the NMI case you don't make it to the error handler. James and I beat
this subject to the afterlife in v1.

>> I don't want to leave people scratching their heads, but I also don't
>> want to make AER a special case without having a generic way to handle
>> these cases. People are just as susceptible to scratch their heads
>> wondering why AER is a special case and everything else crashes.
> 
> Not if it is properly done *and* documented why we applying the
> respective policy for the error type.
> 
>> Maybe it's better move the AER handling to NMI/IRQ context, since
>> ghes_handle_aer() is only scheduling the real AER andler, and is irq
>> safe. I'm scratching my head about why we're messing with IRQ work from
>> NMI context, instead of just scheduling a regular handler to take care
>> of things.
> 
> No, first pls explain what exactly you're trying to do

I realize v1 was quite a while back, so I'll take this opportunity to
restate:

At a very high level, I'm working with Dell on improving server
reliability, with a focus on NVME hotplug and surprise removal. One of
the features we don't support is surprise removal of NVME drives;
hotplug is supported with 'prepare to remove'. This is one of the
reasons NVME is not on feature parity with SAS and SATA.

My role is to solve this issue on linux, and to not worry about other
OSes. This puts me in a position to have a linux-centric view of the
problem, as opposed to the more common firmware-centric view.

Part of solving the surprise removal issue involves improving FFS error
handling. This is required because the servers which are shipping use
FFS instead of native error notifications. As part of extensive testing,
I have found the NMI handler to be the most common cause of crashes, and
hence this series.

> and then we can talk about how to do it.

Your move.

> Btw, a real-life example to accompany that intention goes a long way.

I'm not sure if this is the example you're looking for, but
take an r740xd server, and slowly unplug an Intel NVME drives at an
angle. You're likely to crash the machine.

Alex

WARNING: multiple messages have this Message-ID (diff)
From: Alexandru Gagniuc <mr.nuke.me@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
	tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
	shiju.jose@huawei.com, zjzhang@codeaurora.org,
	gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
	alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
	robert.moore@intel.com, erik.schmauss@intel.com
Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Thu, 19 Apr 2018 11:26:57 -0500	[thread overview]
Message-ID: <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> (raw)

On 04/19/2018 10:40 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 09:57:07AM -0500, Alex G. wrote:
>> ghes_severity() is a one-to-one mapping from a set of unsorted
>> severities to monotonically increasing numbers. The "one-to-one" mapping
>> part of the sentence is obvious from the function name. To change it to
>> parse the entire GHES would completely destroy this, and I think it
>> would apply policy in the wrong place.
> 
> So do a wrapper or whatever. Do a ghes_compute_severity() or however you
> would wanna call it and do the iteration there.

That doesn't sound right. There isn't a formula to compute. What we're
doing is we're looking at individual error sources, and deciding what
errors we can handle based both on the error, and our ability to handle
the error.

>> Should I do that, I might have to call it something like
>> ghes_parse_and_apply_policy_to_severity(). But that misses the whole
>> point if these changes.
> 
> What policy? You simply compute the severity like we do in the mce code.

As explained above, our ability to resolve an error depends on the
interaction between the error and error handler. This is very closely
tied to the capabilities of each individual handler. I'll do it your
way, but I don't think ignoring this tight coupling is the right thing
to do.

> 
>> I would like to get to the handlers first, and then decide if things are
>> okay or not,
> 
> Why? Give me an example why you'd handle an error first and then decide
> whether we're ok or not?
> 
> Usually, the error handler decides that in one place. So what exactly
> are you trying to do differently that doesn't fit that flow?

In the NMI case you don't make it to the error handler. James and I beat
this subject to the afterlife in v1.

>> I don't want to leave people scratching their heads, but I also don't
>> want to make AER a special case without having a generic way to handle
>> these cases. People are just as susceptible to scratch their heads
>> wondering why AER is a special case and everything else crashes.
> 
> Not if it is properly done *and* documented why we applying the
> respective policy for the error type.
> 
>> Maybe it's better move the AER handling to NMI/IRQ context, since
>> ghes_handle_aer() is only scheduling the real AER andler, and is irq
>> safe. I'm scratching my head about why we're messing with IRQ work from
>> NMI context, instead of just scheduling a regular handler to take care
>> of things.
> 
> No, first pls explain what exactly you're trying to do

I realize v1 was quite a while back, so I'll take this opportunity to
restate:

At a very high level, I'm working with Dell on improving server
reliability, with a focus on NVME hotplug and surprise removal. One of
the features we don't support is surprise removal of NVME drives;
hotplug is supported with 'prepare to remove'. This is one of the
reasons NVME is not on feature parity with SAS and SATA.

My role is to solve this issue on linux, and to not worry about other
OSes. This puts me in a position to have a linux-centric view of the
problem, as opposed to the more common firmware-centric view.

Part of solving the surprise removal issue involves improving FFS error
handling. This is required because the servers which are shipping use
FFS instead of native error notifications. As part of extensive testing,
I have found the NMI handler to be the most common cause of crashes, and
hence this series.

> and then we can talk about how to do it.

Your move.

> Btw, a real-life example to accompany that intention goes a long way.

I'm not sure if this is the example you're looking for, but
take an r740xd server, and slowly unplug an Intel NVME drives at an
angle. You're likely to crash the machine.

Alex
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-04-19 16:26 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-16 21:58 [RFC PATCH v2 0/4] acpi: apei: Improve error handling with firmware-first Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-17  9:36   ` [RFC PATCH v2 1/4] " Borislav Petkov
2018-04-17  9:36     ` [RFC,v2,1/4] " Borislav Petkov
2018-04-17 16:43     ` [RFC PATCH v2 1/4] " Alex G.
2018-04-17 16:43       ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-18 17:52   ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-18 17:52     ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:19     ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:19       ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 14:30       ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 14:30         ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:57         ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:57           ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 15:29           ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 15:29             ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 15:46             ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 15:46               ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 16:40               ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 16:40                 ` [RFC,v2,2/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-18 17:54   ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-18 17:54     ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 14:57     ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 14:57       ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:35       ` [RFC PATCH v2 3/4] " James Morse
2018-04-19 15:35         ` [Devel] " James Morse
2018-04-19 15:35         ` [RFC,v2,3/4] " James Morse
2018-04-19 16:27         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 16:27           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:40       ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 15:40         ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 16:26         ` Alex G. [this message]
2018-04-19 16:26           ` Alexandru Gagniuc
2018-04-19 16:45           ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 16:45             ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 17:40             ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 17:40               ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 19:03               ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 19:03                 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 22:55                 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 22:55                   ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-22 10:48                   ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-22 10:48                     ` [RFC,v2,3/4] " Borislav Petkov
2018-04-24  4:19                     ` [RFC PATCH v2 3/4] " Alex G.
2018-04-24  4:19                       ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 14:01                       ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 14:01                         ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 15:00                         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 15:00                           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:15                           ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:15                             ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 17:27                             ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 17:27                               ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:39                               ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:39                                 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-18 17:54   ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-18 17:54     ` [RFC,v2,4/4] " Borislav Petkov
2018-04-19 15:11     ` [RFC PATCH v2 4/4] " Alex G.
2018-04-19 15:11       ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-19 15:46       ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-19 15:46         ` [RFC,v2,4/4] " Borislav Petkov
2018-04-25 20:39 ` [RFC PATCH v3 0/3] acpi: apei: Improve PCIe error handling with firmware-first Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,1/3] " Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-26 11:19     ` [RFC PATCH v3 2/3] " Borislav Petkov
2018-04-26 11:19       ` [RFC,v3,2/3] " Borislav Petkov
2018-04-26 17:44       ` [RFC PATCH v3 2/3] " Alex G.
2018-04-26 17:44         ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 11:20     ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 11:20       ` [RFC,v3,3/3] " Borislav Petkov
2018-04-26 17:47       ` [RFC PATCH v3 3/3] " Alex G.
2018-04-26 17:47         ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 18:03         ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 18:03           ` [RFC,v3,3/3] " Borislav Petkov
2018-05-02 19:10       ` [RFC PATCH v3 3/3] " Pavel Machek
2018-05-02 19:10         ` [RFC,v3,3/3] " Pavel Machek
2018-05-02 19:29         ` [RFC PATCH v3 3/3] " Alex G.
2018-05-02 19:29           ` [RFC,v3,3/3] " Alexandru Gagniuc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com \
    --to=mr.nuke.me@gmail.com \
    --cc=alex_gagniuc@dellteam.com \
    --cc=austin_bolen@dell.com \
    --cc=bp@alien8.de \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=gengdongjiu@huawei.com \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=shyam_iyer@dell.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.