All of lore.kernel.org
 help / color / mirror / Atom feed
From: zhenwei pi <pizhenwei@bytedance.com>
To: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"mst@redhat.com" <mst@redhat.com>,
	"david@redhat.com" <david@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"jasowang@redhat.com" <jasowang@redhat.com>,
	"virtualization@lists.linux-foundation.org" 
	<virtualization@lists.linux-foundation.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"peterx@redhat.com" <peterx@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: Re: [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison
Date: Mon, 30 May 2022 13:46:57 +0800	[thread overview]
Message-ID: <286dbd1f-1c62-a171-7453-d772bd98332c@bytedance.com> (raw)
In-Reply-To: <20220530050234.GA1036127@hori.linux.bs1.fc.nec.co.jp>



On 5/30/22 13:02, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Fri, May 20, 2022 at 03:06:47PM +0800, zhenwei pi wrote:
>> Origianlly, unpoison_memory() is only used by hwpoison-inject, and
>> unpoisons a page which is poisoned by hwpoison-inject too. The kernel PTE
>> entry has no change during software poison/unpoison.
>>
>> On a virtualization platform, it's possible to fix hardware corrupted page
>> by hypervisor, typically the hypervisor remaps the error HVA(host virtual
>> address). So add a new parameter 'const char *reason' to show the reason
>> called by.
>>
>> Once the corrupted page gets fixed, the guest kernel needs put page to
>> buddy. Reuse the page and hit the following issue(Intel Platinum 8260):
>>   BUG: unable to handle page fault for address: ffff888061646000
>>   #PF: supervisor write access in kernel mode
>>   #PF: error_code(0x0002) - not-present page
>>   PGD 2c01067 P4D 2c01067 PUD 61aaa063 PMD 10089b063 PTE 800fffff9e9b9062
>>   Oops: 0002 [#1] PREEMPT SMP NOPTI
>>   CPU: 2 PID: 31106 Comm: stress Kdump: loaded Tainted: G   M       OE     5.18.0-rc6.bm.1-amd64 #6
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>>   RIP: 0010:clear_page_erms+0x7/0x10
>>
>> The kernel PTE entry of the fixed page is still uncorrected, kernel hits
>> page fault during prep_new_page. So add 'bool reset_kpte' to get a change
>> to fix the PTE entry if the page is fixed by hypervisor.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   include/linux/mm.h   |  2 +-
>>   mm/hwpoison-inject.c |  2 +-
>>   mm/memory-failure.c  | 26 +++++++++++++++++++-------
>>   3 files changed, 21 insertions(+), 9 deletions(-)
>>
> 
> Do you need undoing rate limiting here?  In the original unpoison's usage,
> avoiding flood of "Unpoison: Software-unpoisoned page" messages is helpful.
> 
> And unpoison seems to be called from virtio-balloon multiple times when
> the backend is 2MB hugepages.  If it's right, printing out 512 lines of
> "Unpoison: Unpoisoned page 0xXXX by virtio-balloon" messages might not be
> so helpful?
> 

All the suggestions(include '[PATCH 1/3] memory-failure: Introduce 
memory failure notifier') are reasonable, I'll fix them in the next 
version. Thanks a lot!


-- 
zhenwei pi

WARNING: multiple messages have this Message-ID (diff)
From: zhenwei pi <pizhenwei@bytedance.com>
To: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"mst@redhat.com" <mst@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"virtualization@lists.linux-foundation.org"
	<virtualization@lists.linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>
Subject: Re: Re: [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison
Date: Mon, 30 May 2022 13:46:57 +0800	[thread overview]
Message-ID: <286dbd1f-1c62-a171-7453-d772bd98332c@bytedance.com> (raw)
In-Reply-To: <20220530050234.GA1036127@hori.linux.bs1.fc.nec.co.jp>



On 5/30/22 13:02, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Fri, May 20, 2022 at 03:06:47PM +0800, zhenwei pi wrote:
>> Origianlly, unpoison_memory() is only used by hwpoison-inject, and
>> unpoisons a page which is poisoned by hwpoison-inject too. The kernel PTE
>> entry has no change during software poison/unpoison.
>>
>> On a virtualization platform, it's possible to fix hardware corrupted page
>> by hypervisor, typically the hypervisor remaps the error HVA(host virtual
>> address). So add a new parameter 'const char *reason' to show the reason
>> called by.
>>
>> Once the corrupted page gets fixed, the guest kernel needs put page to
>> buddy. Reuse the page and hit the following issue(Intel Platinum 8260):
>>   BUG: unable to handle page fault for address: ffff888061646000
>>   #PF: supervisor write access in kernel mode
>>   #PF: error_code(0x0002) - not-present page
>>   PGD 2c01067 P4D 2c01067 PUD 61aaa063 PMD 10089b063 PTE 800fffff9e9b9062
>>   Oops: 0002 [#1] PREEMPT SMP NOPTI
>>   CPU: 2 PID: 31106 Comm: stress Kdump: loaded Tainted: G   M       OE     5.18.0-rc6.bm.1-amd64 #6
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>>   RIP: 0010:clear_page_erms+0x7/0x10
>>
>> The kernel PTE entry of the fixed page is still uncorrected, kernel hits
>> page fault during prep_new_page. So add 'bool reset_kpte' to get a change
>> to fix the PTE entry if the page is fixed by hypervisor.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   include/linux/mm.h   |  2 +-
>>   mm/hwpoison-inject.c |  2 +-
>>   mm/memory-failure.c  | 26 +++++++++++++++++++-------
>>   3 files changed, 21 insertions(+), 9 deletions(-)
>>
> 
> Do you need undoing rate limiting here?  In the original unpoison's usage,
> avoiding flood of "Unpoison: Software-unpoisoned page" messages is helpful.
> 
> And unpoison seems to be called from virtio-balloon multiple times when
> the backend is 2MB hugepages.  If it's right, printing out 512 lines of
> "Unpoison: Unpoisoned page 0xXXX by virtio-balloon" messages might not be
> so helpful?
> 

All the suggestions(include '[PATCH 1/3] memory-failure: Introduce 
memory failure notifier') are reasonable, I'll fix them in the next 
version. Thanks a lot!


-- 
zhenwei pi
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  reply	other threads:[~2022-05-30  5:51 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-20  7:06 [PATCH 0/3] recover hardware corrupted page by virtio balloon zhenwei pi
2022-05-20  7:06 ` zhenwei pi
2022-05-20  7:06 ` [PATCH 1/3] memory-failure: Introduce memory failure notifier zhenwei pi
2022-05-20  7:06   ` zhenwei pi
2022-05-30  5:09   ` HORIGUCHI NAOYA(堀口 直也)
2022-05-20  7:06 ` [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison zhenwei pi
2022-05-20  7:06   ` zhenwei pi
2022-05-30  5:02   ` HORIGUCHI NAOYA(堀口 直也)
2022-05-30  5:46     ` zhenwei pi [this message]
2022-05-30  5:46       ` zhenwei pi
2022-05-30  6:50   ` David Hildenbrand
2022-05-30  6:50     ` David Hildenbrand
2022-05-20  7:06 ` [PATCH 3/3] virtio_balloon: Introduce memory recover zhenwei pi
2022-05-20  7:06   ` zhenwei pi
2022-05-20 12:48   ` kernel test robot
2022-05-20 12:48     ` kernel test robot
2022-05-20 13:39   ` kernel test robot
2022-05-20 13:39     ` kernel test robot
2022-05-20 15:28   ` kernel test robot
2022-05-20 15:28     ` kernel test robot
2022-05-24 19:35   ` Sean Christopherson
2022-05-24 23:32     ` zhenwei pi
2022-05-24 23:32       ` zhenwei pi
2022-05-30  7:53       ` David Hildenbrand
2022-05-30  7:53         ` David Hildenbrand
2022-05-26 19:18   ` Michael S. Tsirkin
2022-05-26 19:18     ` Michael S. Tsirkin
2022-05-27  2:22     ` zhenwei pi
2022-05-27  2:22       ` zhenwei pi
2022-05-30  7:48   ` David Hildenbrand
2022-05-30  7:48     ` David Hildenbrand
2022-05-30 12:47     ` zhenwei pi
2022-05-30 12:47       ` zhenwei pi
2022-05-24 18:59 ` [PATCH 0/3] recover hardware corrupted page by virtio balloon David Hildenbrand
2022-05-24 18:59   ` David Hildenbrand
2022-05-27  3:47 ` zhenwei pi
2022-05-27  3:47   ` zhenwei pi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=286dbd1f-1c62-a171-7453-d772bd98332c@bytedance.com \
    --to=pizhenwei@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mst@redhat.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.