linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tyler Baicar OS <baicar@os.amperecomputing.com>
To: James Morse <james.morse@arm.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Tony Luck <tony.luck@intel.com>, Xie XiuQi <xiexiuqi@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Rafael Wysocki <rjw@rjwysocki.net>,
	Tyler Baicar <tyler@amperecomputing.com>,
	Borislav Petkov <bp@alien8.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Will Deacon <will@kernel.org>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Len Brown <lenb@kernel.org>
Subject: Re: [PATCH 0/3] ACPI / APEI: Kick the memory_failure() queue for synchronous errors
Date: Fri, 20 Mar 2020 19:19:27 +0000	[thread overview]
Message-ID: <BN3PR01MB21303880EF8D2953854401D5E3F50@BN3PR01MB2130.prod.exchangelabs.com> (raw)
In-Reply-To: <20200228174817.74278-1-james.morse@arm.com>

Hello James,

I think my one comment on patch 2 is valid, right? But for this series:

Tested-by: Tyler Baicar <baicar@os.amperecomputing.com>

Thanks,
Tyler

On Fri, Feb 28, 2020 at 12:48 PM James Morse <james.morse@arm.com> wrote:
>
> Hello!
>
> These are the remaining patches from the SDEI series[0] that fix
> a race between memory_failure() and user-space re-triggering the error
> in ghes.c.
>
>
> ghes_handle_memory_failure() calls memory_failure_queue() from
> IRQ context to schedule memory_failure()s work as it needs to sleep.
> Once the GHES machinery returns from the IRQ, it may return to user-space
> before memory_failure() runs.
>
> If the error that kicked all this off is specific to user-space, e.g. a
> load from corrupted memory, we may find ourselves taking the error
> again. If the user-space task is scheduled out, and memory_failure() runs,
> the same user-space task may be scheduled in on another CPU, which could
> also take the same error.
>
> These lead to exaggerated error counters, which may cause some threshold
> to be reached early.
>
> This can happen with any error that causes a Synchronous External Abort
> on arm64. I can't see why the same wouldn't happen with a machine-check
> handled firmware first on x86.
>
>
> This series adds a memory_failure_queue_kick() helper to
> memory-failure.c, and calls it as task-work before returning to
> user-space.
>
>
> Currently arm64 papers over this problem by ignoring ghes_notify_sea()'s
> return code as it knows there is still work to do. arm64 generates its
> own signal to user-space, which means the first task to discover an
> error will always be killed, even if the error was later handled.
> (which is no improvement on the no-RAS behaviour)
>
> As a final piece, arm64 can try to process the irq work queued by
> ghes_notify_sea() while its still in the external abort handler. A succesfull
> return value here now means the memory_failure() work will be done before we
> return to user-space, we no longer need to generate our own signal.
> This lets the original task survive the error if memory_failure() can
> recover the corrupted memory.
>
> Based on v5.6-rc2. I'm afraid it touches three different trees.
> $subject says ACPI as that is where the bulk of the diffstat is.
>
> This series may conflict in arm64 with a series from Mark Rutland to
> cleanup the daif/PMR toggling.
>
>
> This would be v9 of these patches, but after a year I figure I should
> start the numbering again. I've dropped any collected tags.
>
> Known issues:
>  * arm64's apei_claim_sea() may unwittingly re-enable debug if it takes
>    an external-abort from debug context. Patch 3 makes this worse
>    instead of fixing it. The fix would make use of helpers from Mark R's
>    series.
>
>
> Thanks,
>
> James
>
>
> [0] https://lore.kernel.org/linux-arm-kernel/20190129184902.102850-1-james.morse@arm.com/
> [1] https://lore.kernel.org/linux-acpi/1506516620-20033-3-git-send-email-xiexiuqi@huawei.com/
>
> James Morse (3):
>   mm/memory-failure: Add memory_failure_queue_kick()
>   ACPI / APEI: Kick the memory_failure() queue for synchronous errors
>   arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work
>
>  arch/arm64/kernel/acpi.c | 25 +++++++++++++++
>  arch/arm64/mm/fault.c    | 12 ++++---
>  drivers/acpi/apei/ghes.c | 68 +++++++++++++++++++++++++++++++++-------
>  include/acpi/ghes.h      |  3 ++
>  include/linux/mm.h       |  1 +
>  mm/memory-failure.c      | 15 ++++++++-
>  6 files changed, 107 insertions(+), 17 deletions(-)
>
> --
> 2.24.1

      parent reply	other threads:[~2020-03-20 19:19 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-28 17:48 [PATCH 0/3] ACPI / APEI: Kick the memory_failure() queue for synchronous errors James Morse
2020-02-28 17:48 ` [PATCH 1/3] mm/memory-failure: Add memory_failure_queue_kick() James Morse
2020-02-28 17:48 ` [PATCH 2/3] ACPI / APEI: Kick the memory_failure() queue for synchronous errors James Morse
2020-03-09 17:07   ` Tyler Baicar OS
2020-02-28 17:48 ` [PATCH 3/3] arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work James Morse
2020-03-24 16:41   ` Catalin Marinas
2020-03-20 19:19 ` Tyler Baicar OS [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN3PR01MB21303880EF8D2953854401D5E3F50@BN3PR01MB2130.prod.exchangelabs.com \
    --to=baicar@os.amperecomputing.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=rjw@rjwysocki.net \
    --cc=tony.luck@intel.com \
    --cc=tyler@amperecomputing.com \
    --cc=will@kernel.org \
    --cc=xiexiuqi@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).