All of lore.kernel.org
 help / color / mirror / Atom feed
From: Miaohe Lin <linmiaohe@huawei.com>
To: Tony Luck <tony.luck@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Shuai Xue <xueshuai@linux.alibaba.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Nicholas Piggin <npiggin@gmail.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<linuxppc-dev@lists.ozlabs.org>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults
Date: Fri, 28 Oct 2022 10:11:51 +0800	[thread overview]
Message-ID: <a60484bf-2107-8bc4-acdc-5f582f9637af@huawei.com> (raw)
In-Reply-To: <20221021200120.175753-2-tony.luck@intel.com>

On 2022/10/22 4:01, Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Thanks for your work, Tony.

> 
> ---
> Changes in V3:
>     Dan Williams
> 	Rename copy_user_highpage_mc() to copy_mc_user_highpage() for
> 	consistency with Linus' discussion on names of functions that
> 	check for machine check.
> 	Write complete functions for the have/have-not copy_mc_to_kernel
> 	cases (so grep shows there are two versions)
> 	Change __wp_page_copy_user() to return 0 for success, negative for fail
> 	[I picked -EAGAIN for both non-EHWPOISON cases]
> 
> Changes in V2:
>    Naoya Horiguchi:
> 	1) Use -EHWPOISON error code instead of minus one.
> 	2) Poison path needs also to deal with old_page
>    Tony Luck:
> 	Rewrote commit message
> 	Added some powerpc folks to Cc: list
> ---
>  include/linux/highmem.h | 24 ++++++++++++++++++++++++
>  mm/memory.c             | 30 ++++++++++++++++++++----------
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..a32c64681f03 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,30 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>  
>  #endif
>  
> +#ifdef copy_mc_to_kernel
> +static inline int copy_mc_user_highpage(struct page *to, struct page *from,
> +					unsigned long vaddr, struct vm_area_struct *vma)
> +{
> +	unsigned long ret;
> +	char *vfrom, *vto;
> +
> +	vfrom = kmap_local_page(from);
> +	vto = kmap_local_page(to);
> +	ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);

In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
__HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
with kmsan, so I can easy be wrong.

Anyway, this patch looks good to me. Thanks.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks,
Miaohe Lin



WARNING: multiple messages have this Message-ID (diff)
From: Miaohe Lin <linmiaohe@huawei.com>
To: Tony Luck <tony.luck@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Nicholas Piggin <npiggin@gmail.com>,
	Dan Williams <dan.j.williams@intel.com>,
	linuxppc-dev@lists.ozlabs.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Shuai Xue <xueshuai@linux.alibaba.com>
Subject: Re: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults
Date: Fri, 28 Oct 2022 10:11:51 +0800	[thread overview]
Message-ID: <a60484bf-2107-8bc4-acdc-5f582f9637af@huawei.com> (raw)
In-Reply-To: <20221021200120.175753-2-tony.luck@intel.com>

On 2022/10/22 4:01, Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Thanks for your work, Tony.

> 
> ---
> Changes in V3:
>     Dan Williams
> 	Rename copy_user_highpage_mc() to copy_mc_user_highpage() for
> 	consistency with Linus' discussion on names of functions that
> 	check for machine check.
> 	Write complete functions for the have/have-not copy_mc_to_kernel
> 	cases (so grep shows there are two versions)
> 	Change __wp_page_copy_user() to return 0 for success, negative for fail
> 	[I picked -EAGAIN for both non-EHWPOISON cases]
> 
> Changes in V2:
>    Naoya Horiguchi:
> 	1) Use -EHWPOISON error code instead of minus one.
> 	2) Poison path needs also to deal with old_page
>    Tony Luck:
> 	Rewrote commit message
> 	Added some powerpc folks to Cc: list
> ---
>  include/linux/highmem.h | 24 ++++++++++++++++++++++++
>  mm/memory.c             | 30 ++++++++++++++++++++----------
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..a32c64681f03 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,30 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>  
>  #endif
>  
> +#ifdef copy_mc_to_kernel
> +static inline int copy_mc_user_highpage(struct page *to, struct page *from,
> +					unsigned long vaddr, struct vm_area_struct *vma)
> +{
> +	unsigned long ret;
> +	char *vfrom, *vto;
> +
> +	vfrom = kmap_local_page(from);
> +	vto = kmap_local_page(to);
> +	ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);

In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
__HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
with kmsan, so I can easy be wrong.

Anyway, this patch looks good to me. Thanks.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks,
Miaohe Lin



  parent reply	other threads:[~2022-10-28  2:12 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 23:42 [RFC PATCH] mm, hwpoison: Recover from copy-on-write machine checks Tony Luck
2022-10-18  8:43 ` HORIGUCHI NAOYA(堀口 直也)
2022-10-18 17:52   ` Luck, Tony
2022-10-19 17:08     ` [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults Tony Luck
2022-10-19 17:08       ` Tony Luck
2022-10-19 17:45       ` Dan Williams
2022-10-19 17:45         ` Dan Williams
2022-10-19 20:30         ` Luck, Tony
2022-10-19 20:30           ` Luck, Tony
2022-10-20  1:57       ` Shuai Xue
2022-10-20  1:57         ` Shuai Xue
2022-10-20 20:05         ` Tony Luck
2022-10-20 20:05           ` Tony Luck
2022-10-21  1:38           ` Miaohe Lin
2022-10-21  1:38             ` Miaohe Lin
2022-10-21  3:57             ` Luck, Tony
2022-10-21  3:57               ` Luck, Tony
2022-10-21  1:52           ` Shuai Xue
2022-10-21  1:52             ` Shuai Xue
2022-10-21  4:08             ` Tony Luck
2022-10-21  4:08               ` Tony Luck
2022-10-21  4:11               ` David Laight
2022-10-21  4:11                 ` David Laight
2022-10-21  4:41                 ` Luck, Tony
2022-10-21  4:41                   ` Luck, Tony
2022-10-21  9:29                   ` Shuai Xue
2022-10-21  9:29                     ` Shuai Xue
2022-10-21 16:30                     ` Luck, Tony
2022-10-21 16:30                       ` Luck, Tony
2022-10-23 15:04                       ` Shuai Xue
2022-10-23 15:04                         ` Shuai Xue
2022-10-21  6:57               ` Shuai Xue
2022-10-21  6:57                 ` Shuai Xue
2022-10-21 20:01       ` [PATCH v3 0/2] Copy-on-write poison recovery Tony Luck
2022-10-21 20:01         ` Tony Luck
2022-10-21 20:01         ` [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults Tony Luck
2022-10-21 20:01           ` Tony Luck
2022-10-25  5:46           ` HORIGUCHI NAOYA(堀口 直也)
2022-10-25  5:46             ` HORIGUCHI NAOYA(堀口 直也)
2022-10-28  2:11           ` Miaohe Lin [this message]
2022-10-28  2:11             ` Miaohe Lin
2022-10-28 16:09             ` Luck, Tony
2022-10-28 16:09               ` Luck, Tony
2022-11-02 14:27               ` Alexander Potapenko
2022-11-02 14:27                 ` Alexander Potapenko
2022-11-02 14:30                 ` Alexander Potapenko
2022-11-02 14:30                   ` Alexander Potapenko
2022-10-21 20:01         ` [PATCH v3 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline Tony Luck
2022-10-21 20:01           ` Tony Luck
2022-10-28  2:28           ` Miaohe Lin
2022-10-28  2:28             ` Miaohe Lin
2022-10-28 16:13             ` Luck, Tony
2022-10-28 16:13               ` Luck, Tony
2022-10-29  1:55               ` Miaohe Lin
2022-10-29  1:55                 ` Miaohe Lin
2022-10-23 15:52         ` [PATCH v3 0/2] Copy-on-write poison recovery Shuai Xue
2022-10-23 15:52           ` Shuai Xue
2022-10-26  5:19           ` Shuai Xue
2022-10-26  5:19             ` Shuai Xue
2022-10-31 20:10         ` [PATCH v4 " Tony Luck
2022-10-31 20:10           ` Tony Luck
2022-10-31 20:10           ` [PATCH v4 1/2] mm, hwpoison: Try to recover from copy-on write faults Tony Luck
2022-10-31 20:10             ` Tony Luck
2022-10-31 20:10           ` [PATCH v4 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline Tony Luck
2022-10-31 20:10             ` Tony Luck
2023-05-18 21:49             ` Jane Chu
2023-05-18 22:10               ` Luck, Tony
2023-05-19  7:28               ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a60484bf-2107-8bc4-acdc-5f582f9637af@huawei.com \
    --to=linmiaohe@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=dan.j.williams@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=npiggin@gmail.com \
    --cc=tony.luck@intel.com \
    --cc=willy@infradead.org \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.