All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: Tony Luck <tony.luck@intel.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Matthew Wilcox <willy@infradead.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Nicholas Piggin <npiggin@gmail.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults
Date: Fri, 21 Oct 2022 14:57:26 +0800	[thread overview]
Message-ID: <fc9fa0b2-16d3-9aba-02a2-61d492bde95f@linux.alibaba.com> (raw)
In-Reply-To: <Y1IbOAvpGzA8bst1@agluck-desk3.sc.intel.com>



在 2022/10/21 PM12:08, Tony Luck 写道:
> On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/21 AM4:05, Tony Luck 写道:
>>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 2022/10/20 AM1:08, Tony Luck 写道:
> 
>>> I'm experimenting with using sched_work() to handle the call to
>>> memory_failure() (echoing what the machine check handler does using
>>> task_work)_add() to avoid the same problem of not being able to directly
>>> call memory_failure()).
>>
>> Work queues permit work to be deferred outside of the interrupt context
>> into the kernel process context. If we return to user-space before the
>> queued memory_failure() work is processed, we will take the fault again,
>> as we discussed recently.
>>
>>     commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors
>>     commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak
>>
>> So, in my opinion, we should add memory failure as a task work, like
>> do_machine_check does, e.g.
>>
>>     queue_task_work(&m, msg, kill_me_maybe);
> 
> Maybe ... but this case isn't pending back to a user instruction
> that is trying to READ the poison memory address. The task is just
> trying to WRITE to any address within the page.

Aha, I see the difference. Thank you. But I still have a question on
this. Let us discuss in your reply to David Laight.

Best Regards,
Shuai

> 
> So this is much more like a patrol scrub error found asynchronously
> by the memory controller (in this case found asynchronously by the
> Linux page copy function).  So I don't feel that it's really the
> responsibility of the current task.
> 
> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.
> 
>>> +	INIT_WORK(&p->work, do_sched_memory_failure);
>>> +	p->pfn = pfn;
>>> +	schedule_work(&p->work);
>>> +}
>>
>> I think there is already a function to do such work in mm/memory-failure.c.
>>
>> 	void memory_failure_queue(unsigned long pfn, int flags)
> 
> Also pointed out by Miaohe Lin <linmiaohe@huawei.com> ... this does
> exacly what I want, and is working well in tests so far. So perhaps
> a cleaner solution than making the kill_me_maybe() function globally
> visible.
> 
> -Tony

WARNING: multiple messages have this Message-ID (diff)
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: Tony Luck <tony.luck@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	linux-kernel@vger.kernel.org,
	Matthew Wilcox <willy@infradead.org>,
	linux-mm@kvack.org, Nicholas Piggin <npiggin@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org,
	Dan Williams <dan.j.williams@intel.com>
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults
Date: Fri, 21 Oct 2022 14:57:26 +0800	[thread overview]
Message-ID: <fc9fa0b2-16d3-9aba-02a2-61d492bde95f@linux.alibaba.com> (raw)
In-Reply-To: <Y1IbOAvpGzA8bst1@agluck-desk3.sc.intel.com>



在 2022/10/21 PM12:08, Tony Luck 写道:
> On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/21 AM4:05, Tony Luck 写道:
>>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 2022/10/20 AM1:08, Tony Luck 写道:
> 
>>> I'm experimenting with using sched_work() to handle the call to
>>> memory_failure() (echoing what the machine check handler does using
>>> task_work)_add() to avoid the same problem of not being able to directly
>>> call memory_failure()).
>>
>> Work queues permit work to be deferred outside of the interrupt context
>> into the kernel process context. If we return to user-space before the
>> queued memory_failure() work is processed, we will take the fault again,
>> as we discussed recently.
>>
>>     commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors
>>     commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak
>>
>> So, in my opinion, we should add memory failure as a task work, like
>> do_machine_check does, e.g.
>>
>>     queue_task_work(&m, msg, kill_me_maybe);
> 
> Maybe ... but this case isn't pending back to a user instruction
> that is trying to READ the poison memory address. The task is just
> trying to WRITE to any address within the page.

Aha, I see the difference. Thank you. But I still have a question on
this. Let us discuss in your reply to David Laight.

Best Regards,
Shuai

> 
> So this is much more like a patrol scrub error found asynchronously
> by the memory controller (in this case found asynchronously by the
> Linux page copy function).  So I don't feel that it's really the
> responsibility of the current task.
> 
> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.
> 
>>> +	INIT_WORK(&p->work, do_sched_memory_failure);
>>> +	p->pfn = pfn;
>>> +	schedule_work(&p->work);
>>> +}
>>
>> I think there is already a function to do such work in mm/memory-failure.c.
>>
>> 	void memory_failure_queue(unsigned long pfn, int flags)
> 
> Also pointed out by Miaohe Lin <linmiaohe@huawei.com> ... this does
> exacly what I want, and is working well in tests so far. So perhaps
> a cleaner solution than making the kill_me_maybe() function globally
> visible.
> 
> -Tony

  parent reply	other threads:[~2022-10-21  6:57 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 23:42 [RFC PATCH] mm, hwpoison: Recover from copy-on-write machine checks Tony Luck
2022-10-18  8:43 ` HORIGUCHI NAOYA(堀口 直也)
2022-10-18 17:52   ` Luck, Tony
2022-10-19 17:08     ` [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults Tony Luck
2022-10-19 17:08       ` Tony Luck
2022-10-19 17:45       ` Dan Williams
2022-10-19 17:45         ` Dan Williams
2022-10-19 20:30         ` Luck, Tony
2022-10-19 20:30           ` Luck, Tony
2022-10-20  1:57       ` Shuai Xue
2022-10-20  1:57         ` Shuai Xue
2022-10-20 20:05         ` Tony Luck
2022-10-20 20:05           ` Tony Luck
2022-10-21  1:38           ` Miaohe Lin
2022-10-21  1:38             ` Miaohe Lin
2022-10-21  3:57             ` Luck, Tony
2022-10-21  3:57               ` Luck, Tony
2022-10-21  1:52           ` Shuai Xue
2022-10-21  1:52             ` Shuai Xue
2022-10-21  4:08             ` Tony Luck
2022-10-21  4:08               ` Tony Luck
2022-10-21  4:11               ` David Laight
2022-10-21  4:11                 ` David Laight
2022-10-21  4:41                 ` Luck, Tony
2022-10-21  4:41                   ` Luck, Tony
2022-10-21  9:29                   ` Shuai Xue
2022-10-21  9:29                     ` Shuai Xue
2022-10-21 16:30                     ` Luck, Tony
2022-10-21 16:30                       ` Luck, Tony
2022-10-23 15:04                       ` Shuai Xue
2022-10-23 15:04                         ` Shuai Xue
2022-10-21  6:57               ` Shuai Xue [this message]
2022-10-21  6:57                 ` Shuai Xue
2022-10-21 20:01       ` [PATCH v3 0/2] Copy-on-write poison recovery Tony Luck
2022-10-21 20:01         ` Tony Luck
2022-10-21 20:01         ` [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults Tony Luck
2022-10-21 20:01           ` Tony Luck
2022-10-25  5:46           ` HORIGUCHI NAOYA(堀口 直也)
2022-10-25  5:46             ` HORIGUCHI NAOYA(堀口 直也)
2022-10-28  2:11           ` Miaohe Lin
2022-10-28  2:11             ` Miaohe Lin
2022-10-28 16:09             ` Luck, Tony
2022-10-28 16:09               ` Luck, Tony
2022-11-02 14:27               ` Alexander Potapenko
2022-11-02 14:27                 ` Alexander Potapenko
2022-11-02 14:30                 ` Alexander Potapenko
2022-11-02 14:30                   ` Alexander Potapenko
2022-10-21 20:01         ` [PATCH v3 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline Tony Luck
2022-10-21 20:01           ` Tony Luck
2022-10-28  2:28           ` Miaohe Lin
2022-10-28  2:28             ` Miaohe Lin
2022-10-28 16:13             ` Luck, Tony
2022-10-28 16:13               ` Luck, Tony
2022-10-29  1:55               ` Miaohe Lin
2022-10-29  1:55                 ` Miaohe Lin
2022-10-23 15:52         ` [PATCH v3 0/2] Copy-on-write poison recovery Shuai Xue
2022-10-23 15:52           ` Shuai Xue
2022-10-26  5:19           ` Shuai Xue
2022-10-26  5:19             ` Shuai Xue
2022-10-31 20:10         ` [PATCH v4 " Tony Luck
2022-10-31 20:10           ` Tony Luck
2022-10-31 20:10           ` [PATCH v4 1/2] mm, hwpoison: Try to recover from copy-on write faults Tony Luck
2022-10-31 20:10             ` Tony Luck
2022-10-31 20:10           ` [PATCH v4 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline Tony Luck
2022-10-31 20:10             ` Tony Luck
2023-05-18 21:49             ` Jane Chu
2023-05-18 22:10               ` Luck, Tony
2023-05-19  7:28               ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fc9fa0b2-16d3-9aba-02a2-61d492bde95f@linux.alibaba.com \
    --to=xueshuai@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=dan.j.williams@intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=npiggin@gmail.com \
    --cc=tony.luck@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.