linux-efi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] efi/unaccepted: touch soft lockup during memory accept
@ 2024-04-11  0:49 Chen Yu
  2024-04-22 14:40 ` Chen Yu
  0 siblings, 1 reply; 7+ messages in thread
From: Chen Yu @ 2024-04-11  0:49 UTC (permalink / raw)
  To: linux-efi
  Cc: Ard Biesheuvel, Kirill A. Shutemov, Vlastimil Babka,
	Nikolay Borisov, Chao Gao, linux-kernel, Chen Yu, Hossain,
	Md Iqbal

Commit 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused
by parallel memory acceptance") has released the spinlock so
other CPUs can do memory acceptance in parallel and not
triggers softlockup on other CPUs.

However the softlock up was intermittent shown up if the memory
of the TD guest is large, and the timeout of softlockup is set
to 1 second.

The symptom is:
When the local irq is enabled at the end of accept_memory(),
the softlockup detects that the watchdog on single CPU has
not been fed for a while. That is to say, even other CPUs
will not be blocked by spinlock, the current CPU might be
stunk with local irq disabled for a while, which hurts not
only nmi watchdog but also softlockup.

Chao Gao pointed out that the memory accept could be time
costly and there was similar report before. Thus to avoid
any softlocup detection during this stage, give the
softlockup a flag to skip the timeout check at the end of
accept_memory(), by invoking touch_softlockup_watchdog().

Fixes: 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
Reported-by: "Hossain, Md Iqbal" <md.iqbal.hossain@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
v1 -> v2:
	 Refine the commit log and add fixes tag/reviewed-by tag from Kirill.
---
 drivers/firmware/efi/unaccepted_memory.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index 5b439d04079c..50f6503fe49f 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -4,6 +4,7 @@
 #include <linux/memblock.h>
 #include <linux/spinlock.h>
 #include <linux/crash_dump.h>
+#include <linux/nmi.h>
 #include <asm/unaccepted_memory.h>
 
 /* Protects unaccepted memory bitmap and accepting_list */
@@ -149,6 +150,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 	}
 
 	list_del(&range.list);
+
+	touch_softlockup_watchdog();
+
 	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] efi/unaccepted: touch soft lockup during memory accept
  2024-04-11  0:49 [PATCH v2] efi/unaccepted: touch soft lockup during memory accept Chen Yu
@ 2024-04-22 14:40 ` Chen Yu
  2024-04-24 17:12   ` Ard Biesheuvel
  0 siblings, 1 reply; 7+ messages in thread
From: Chen Yu @ 2024-04-22 14:40 UTC (permalink / raw)
  To: linux-efi, Ard Biesheuvel
  Cc: Kirill A. Shutemov, Vlastimil Babka, Nikolay Borisov, Chao Gao,
	linux-kernel, Hossain, Md Iqbal

On 2024-04-11 at 08:49:07 +0800, Chen Yu wrote:
> Commit 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused
> by parallel memory acceptance") has released the spinlock so
> other CPUs can do memory acceptance in parallel and not
> triggers softlockup on other CPUs.
> 
> However the softlock up was intermittent shown up if the memory
> of the TD guest is large, and the timeout of softlockup is set
> to 1 second.
> 
> The symptom is:
> When the local irq is enabled at the end of accept_memory(),
> the softlockup detects that the watchdog on single CPU has
> not been fed for a while. That is to say, even other CPUs
> will not be blocked by spinlock, the current CPU might be
> stunk with local irq disabled for a while, which hurts not
> only nmi watchdog but also softlockup.
> 
> Chao Gao pointed out that the memory accept could be time
> costly and there was similar report before. Thus to avoid
> any softlocup detection during this stage, give the
> softlockup a flag to skip the timeout check at the end of
> accept_memory(), by invoking touch_softlockup_watchdog().
> 
> Fixes: 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
> Reported-by: "Hossain, Md Iqbal" <md.iqbal.hossain@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> v1 -> v2:
> 	 Refine the commit log and add fixes tag/reviewed-by tag from Kirill.

Gently pinging about this patch.

thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] efi/unaccepted: touch soft lockup during memory accept
  2024-04-22 14:40 ` Chen Yu
@ 2024-04-24 17:12   ` Ard Biesheuvel
  2024-05-03 10:31     ` Ard Biesheuvel
  0 siblings, 1 reply; 7+ messages in thread
From: Ard Biesheuvel @ 2024-04-24 17:12 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-efi, Kirill A. Shutemov, Vlastimil Babka, Nikolay Borisov,
	Chao Gao, linux-kernel, Hossain, Md Iqbal

On Mon, 22 Apr 2024 at 16:40, Chen Yu <yu.c.chen@intel.com> wrote:
>
> On 2024-04-11 at 08:49:07 +0800, Chen Yu wrote:
> > Commit 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused
> > by parallel memory acceptance") has released the spinlock so
> > other CPUs can do memory acceptance in parallel and not
> > triggers softlockup on other CPUs.
> >
> > However the softlock up was intermittent shown up if the memory
> > of the TD guest is large, and the timeout of softlockup is set
> > to 1 second.
> >
> > The symptom is:
> > When the local irq is enabled at the end of accept_memory(),
> > the softlockup detects that the watchdog on single CPU has
> > not been fed for a while. That is to say, even other CPUs
> > will not be blocked by spinlock, the current CPU might be
> > stunk with local irq disabled for a while, which hurts not
> > only nmi watchdog but also softlockup.
> >
> > Chao Gao pointed out that the memory accept could be time
> > costly and there was similar report before. Thus to avoid
> > any softlocup detection during this stage, give the
> > softlockup a flag to skip the timeout check at the end of
> > accept_memory(), by invoking touch_softlockup_watchdog().
> >
> > Fixes: 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
> > Reported-by: "Hossain, Md Iqbal" <md.iqbal.hossain@intel.com>
> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> > v1 -> v2:
> >        Refine the commit log and add fixes tag/reviewed-by tag from Kirill.
>
> Gently pinging about this patch.
>

Queued up in efi/urgent now, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] efi/unaccepted: touch soft lockup during memory accept
  2024-04-24 17:12   ` Ard Biesheuvel
@ 2024-05-03 10:31     ` Ard Biesheuvel
  2024-05-03 13:47       ` Kirill A. Shutemov
  0 siblings, 1 reply; 7+ messages in thread
From: Ard Biesheuvel @ 2024-05-03 10:31 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-efi, Kirill A. Shutemov, Vlastimil Babka, Nikolay Borisov,
	Chao Gao, linux-kernel, Hossain, Md Iqbal

On Wed, 24 Apr 2024 at 19:12, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Mon, 22 Apr 2024 at 16:40, Chen Yu <yu.c.chen@intel.com> wrote:
> >
> > On 2024-04-11 at 08:49:07 +0800, Chen Yu wrote:
> > > Commit 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused
> > > by parallel memory acceptance") has released the spinlock so
> > > other CPUs can do memory acceptance in parallel and not
> > > triggers softlockup on other CPUs.
> > >
> > > However the softlock up was intermittent shown up if the memory
> > > of the TD guest is large, and the timeout of softlockup is set
> > > to 1 second.
> > >
> > > The symptom is:
> > > When the local irq is enabled at the end of accept_memory(),
> > > the softlockup detects that the watchdog on single CPU has
> > > not been fed for a while. That is to say, even other CPUs
> > > will not be blocked by spinlock, the current CPU might be
> > > stunk with local irq disabled for a while, which hurts not
> > > only nmi watchdog but also softlockup.
> > >
> > > Chao Gao pointed out that the memory accept could be time
> > > costly and there was similar report before. Thus to avoid
> > > any softlocup detection during this stage, give the
> > > softlockup a flag to skip the timeout check at the end of
> > > accept_memory(), by invoking touch_softlockup_watchdog().
> > >
> > > Fixes: 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
> > > Reported-by: "Hossain, Md Iqbal" <md.iqbal.hossain@intel.com>
> > > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > > ---
> > > v1 -> v2:
> > >        Refine the commit log and add fixes tag/reviewed-by tag from Kirill.
> >
> > Gently pinging about this patch.
> >
>
> Queued up in efi/urgent now, thanks.

OK, I was about to send this patch to Linus (and I am still going to).

However, I do wonder if sprinkling touch_softlockup_watchdog() left
and right is really the right solution here.

Looking at the backtrace, this is a page fault originating in user
space. So why do we end up calling into the hypervisor to accept a
chunk of memory large enough to trigger the softlockup watchdog? Or is
the hypercall simply taking a disproportionate amount of time?

And AIUI, touch_softlockup_watchdog() hides the fact that we are
hogging the CPU for way too long - is there any way we can at least
yield the CPU on this condition?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] efi/unaccepted: touch soft lockup during memory accept
  2024-05-03 10:31     ` Ard Biesheuvel
@ 2024-05-03 13:47       ` Kirill A. Shutemov
  2024-05-03 15:00         ` Chen Yu
  0 siblings, 1 reply; 7+ messages in thread
From: Kirill A. Shutemov @ 2024-05-03 13:47 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Chen Yu, linux-efi, Kirill A. Shutemov, Vlastimil Babka,
	Nikolay Borisov, Chao Gao, linux-kernel, Hossain, Md Iqbal

On Fri, May 03, 2024 at 12:31:12PM +0200, Ard Biesheuvel wrote:
> On Wed, 24 Apr 2024 at 19:12, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Mon, 22 Apr 2024 at 16:40, Chen Yu <yu.c.chen@intel.com> wrote:
> > >
> > > On 2024-04-11 at 08:49:07 +0800, Chen Yu wrote:
> > > > Commit 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused
> > > > by parallel memory acceptance") has released the spinlock so
> > > > other CPUs can do memory acceptance in parallel and not
> > > > triggers softlockup on other CPUs.
> > > >
> > > > However the softlock up was intermittent shown up if the memory
> > > > of the TD guest is large, and the timeout of softlockup is set
> > > > to 1 second.
> > > >
> > > > The symptom is:
> > > > When the local irq is enabled at the end of accept_memory(),
> > > > the softlockup detects that the watchdog on single CPU has
> > > > not been fed for a while. That is to say, even other CPUs
> > > > will not be blocked by spinlock, the current CPU might be
> > > > stunk with local irq disabled for a while, which hurts not
> > > > only nmi watchdog but also softlockup.
> > > >
> > > > Chao Gao pointed out that the memory accept could be time
> > > > costly and there was similar report before. Thus to avoid
> > > > any softlocup detection during this stage, give the
> > > > softlockup a flag to skip the timeout check at the end of
> > > > accept_memory(), by invoking touch_softlockup_watchdog().
> > > >
> > > > Fixes: 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
> > > > Reported-by: "Hossain, Md Iqbal" <md.iqbal.hossain@intel.com>
> > > > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > > > ---
> > > > v1 -> v2:
> > > >        Refine the commit log and add fixes tag/reviewed-by tag from Kirill.
> > >
> > > Gently pinging about this patch.
> > >
> >
> > Queued up in efi/urgent now, thanks.
> 
> OK, I was about to send this patch to Linus (and I am still going to).
> 
> However, I do wonder if sprinkling touch_softlockup_watchdog() left
> and right is really the right solution here.
> 
> Looking at the backtrace, this is a page fault originating in user
> space. So why do we end up calling into the hypervisor to accept a
> chunk of memory large enough to trigger the softlockup watchdog? Or is
> the hypercall simply taking a disproportionate amount of time?

Note that softlockup timeout was set to 1 second to trigger this. So this
is exaggerated case.

> And AIUI, touch_softlockup_watchdog() hides the fact that we are
> hogging the CPU for way too long - is there any way we can at least
> yield the CPU on this condition?

Not really. There's no magic entity that handles accept. It is done by
CPU.

There's a feature in pipeline that makes page accept interruptable in TDX
guest. It can help some cases. But if ended up in this codepath from
non-preemptable context, it won't help.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] efi/unaccepted: touch soft lockup during memory accept
  2024-05-03 13:47       ` Kirill A. Shutemov
@ 2024-05-03 15:00         ` Chen Yu
  2024-05-06  9:24           ` Kirill A. Shutemov
  0 siblings, 1 reply; 7+ messages in thread
From: Chen Yu @ 2024-05-03 15:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, linux-efi, Kirill A. Shutemov, Vlastimil Babka,
	Nikolay Borisov, Chao Gao, linux-kernel, Hossain, Md Iqbal

On 2024-05-03 at 16:47:49 +0300, Kirill A. Shutemov wrote:
> On Fri, May 03, 2024 at 12:31:12PM +0200, Ard Biesheuvel wrote:
> > On Wed, 24 Apr 2024 at 19:12, Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Mon, 22 Apr 2024 at 16:40, Chen Yu <yu.c.chen@intel.com> wrote:
> > > >
> > > > On 2024-04-11 at 08:49:07 +0800, Chen Yu wrote:
> > > > > Commit 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused
> > > > > by parallel memory acceptance") has released the spinlock so
> > > > > other CPUs can do memory acceptance in parallel and not
> > > > > triggers softlockup on other CPUs.
> > > > >
> > > > > However the softlock up was intermittent shown up if the memory
> > > > > of the TD guest is large, and the timeout of softlockup is set
> > > > > to 1 second.
> > > > >
> > > > > The symptom is:
> > > > > When the local irq is enabled at the end of accept_memory(),
> > > > > the softlockup detects that the watchdog on single CPU has
> > > > > not been fed for a while. That is to say, even other CPUs
> > > > > will not be blocked by spinlock, the current CPU might be
> > > > > stunk with local irq disabled for a while, which hurts not
> > > > > only nmi watchdog but also softlockup.
> > > > >
> > > > > Chao Gao pointed out that the memory accept could be time
> > > > > costly and there was similar report before. Thus to avoid
> > > > > any softlocup detection during this stage, give the
> > > > > softlockup a flag to skip the timeout check at the end of
> > > > > accept_memory(), by invoking touch_softlockup_watchdog().
> > > > >
> > > > > Fixes: 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
> > > > > Reported-by: "Hossain, Md Iqbal" <md.iqbal.hossain@intel.com>
> > > > > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > > > > ---
> > > > > v1 -> v2:
> > > > >        Refine the commit log and add fixes tag/reviewed-by tag from Kirill.
> > > >
> > > > Gently pinging about this patch.
> > > >
> > >
> > > Queued up in efi/urgent now, thanks.
> > 
> > OK, I was about to send this patch to Linus (and I am still going to).
> > 
> > However, I do wonder if sprinkling touch_softlockup_watchdog() left
> > and right is really the right solution here.
> > 
> > Looking at the backtrace, this is a page fault originating in user
> > space. So why do we end up calling into the hypervisor to accept a
> > chunk of memory large enough to trigger the softlockup watchdog? Or is
> > the hypercall simply taking a disproportionate amount of time?
> 
> Note that softlockup timeout was set to 1 second to trigger this. So this
> is exaggerated case.
> 
> > And AIUI, touch_softlockup_watchdog() hides the fact that we are
> > hogging the CPU for way too long - is there any way we can at least
> > yield the CPU on this condition?
> 
> Not really. There's no magic entity that handles accept. It is done by
> CPU.
> 
> There's a feature in pipeline that makes page accept interruptable in TDX
> guest. It can help some cases. But if ended up in this codepath from
> non-preemptable context, it won't help.
>

Is it possible to enable the local irq for a little while after
each arch_accept_memory(phys_start, phys_end),
and even split the [phys_start,phys_end] to smaller regions?
so the watchdog can be fed on time/tick is normal. But currently
the softlock fed at the end seems to be more easier to implement.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] efi/unaccepted: touch soft lockup during memory accept
  2024-05-03 15:00         ` Chen Yu
@ 2024-05-06  9:24           ` Kirill A. Shutemov
  0 siblings, 0 replies; 7+ messages in thread
From: Kirill A. Shutemov @ 2024-05-06  9:24 UTC (permalink / raw)
  To: Chen Yu
  Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-efi, Vlastimil Babka,
	Nikolay Borisov, Chao Gao, linux-kernel, Hossain, Md Iqbal

On Fri, May 03, 2024 at 11:00:18PM +0800, Chen Yu wrote:
> On 2024-05-03 at 16:47:49 +0300, Kirill A. Shutemov wrote:
> > On Fri, May 03, 2024 at 12:31:12PM +0200, Ard Biesheuvel wrote:
> > > On Wed, 24 Apr 2024 at 19:12, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Mon, 22 Apr 2024 at 16:40, Chen Yu <yu.c.chen@intel.com> wrote:
> > > > >
> > > > > On 2024-04-11 at 08:49:07 +0800, Chen Yu wrote:
> > > > > > Commit 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused
> > > > > > by parallel memory acceptance") has released the spinlock so
> > > > > > other CPUs can do memory acceptance in parallel and not
> > > > > > triggers softlockup on other CPUs.
> > > > > >
> > > > > > However the softlock up was intermittent shown up if the memory
> > > > > > of the TD guest is large, and the timeout of softlockup is set
> > > > > > to 1 second.
> > > > > >
> > > > > > The symptom is:
> > > > > > When the local irq is enabled at the end of accept_memory(),
> > > > > > the softlockup detects that the watchdog on single CPU has
> > > > > > not been fed for a while. That is to say, even other CPUs
> > > > > > will not be blocked by spinlock, the current CPU might be
> > > > > > stunk with local irq disabled for a while, which hurts not
> > > > > > only nmi watchdog but also softlockup.
> > > > > >
> > > > > > Chao Gao pointed out that the memory accept could be time
> > > > > > costly and there was similar report before. Thus to avoid
> > > > > > any softlocup detection during this stage, give the
> > > > > > softlockup a flag to skip the timeout check at the end of
> > > > > > accept_memory(), by invoking touch_softlockup_watchdog().
> > > > > >
> > > > > > Fixes: 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
> > > > > > Reported-by: "Hossain, Md Iqbal" <md.iqbal.hossain@intel.com>
> > > > > > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > > > > > ---
> > > > > > v1 -> v2:
> > > > > >        Refine the commit log and add fixes tag/reviewed-by tag from Kirill.
> > > > >
> > > > > Gently pinging about this patch.
> > > > >
> > > >
> > > > Queued up in efi/urgent now, thanks.
> > > 
> > > OK, I was about to send this patch to Linus (and I am still going to).
> > > 
> > > However, I do wonder if sprinkling touch_softlockup_watchdog() left
> > > and right is really the right solution here.
> > > 
> > > Looking at the backtrace, this is a page fault originating in user
> > > space. So why do we end up calling into the hypervisor to accept a
> > > chunk of memory large enough to trigger the softlockup watchdog? Or is
> > > the hypercall simply taking a disproportionate amount of time?
> > 
> > Note that softlockup timeout was set to 1 second to trigger this. So this
> > is exaggerated case.
> > 
> > > And AIUI, touch_softlockup_watchdog() hides the fact that we are
> > > hogging the CPU for way too long - is there any way we can at least
> > > yield the CPU on this condition?
> > 
> > Not really. There's no magic entity that handles accept. It is done by
> > CPU.
> > 
> > There's a feature in pipeline that makes page accept interruptable in TDX
> > guest. It can help some cases. But if ended up in this codepath from
> > non-preemptable context, it won't help.
> >
> 
> Is it possible to enable the local irq for a little while after
> each arch_accept_memory(phys_start, phys_end),
> and even split the [phys_start,phys_end] to smaller regions?
> so the watchdog can be fed on time/tick is normal. But currently
> the softlock fed at the end seems to be more easier to implement.

That's what I did initially. But Vlastimil correctly pointed that it will
lead to deadlock.

https://lore.kernel.org/all/088593ea-e001-fa87-909f-a196b1373ca4@suse.cz/

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-05-06  9:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-11  0:49 [PATCH v2] efi/unaccepted: touch soft lockup during memory accept Chen Yu
2024-04-22 14:40 ` Chen Yu
2024-04-24 17:12   ` Ard Biesheuvel
2024-05-03 10:31     ` Ard Biesheuvel
2024-05-03 13:47       ` Kirill A. Shutemov
2024-05-03 15:00         ` Chen Yu
2024-05-06  9:24           ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).