linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][RFC] PM / Hibernate: Feed NMI wathdog when creating snapshot
@ 2017-08-14 17:19 Chen Yu
  2017-08-15 12:41 ` Michal Hocko
  0 siblings, 1 reply; 5+ messages in thread
From: Chen Yu @ 2017-08-14 17:19 UTC (permalink / raw)
  To: linux-mm, linux-pm
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman,
	linux-kernel, Chen Yu, Rafael J. Wysocki, Len Brown,
	Dan Williams

There is a problem that when counting the pages for creating
the hibernation snapshot will take significant amount of
time, especially on system with large memory. Since the counting
job is performed with irq disabled, this might lead to NMI lockup.
The following warning were found on a system with 1.5TB DRAM:

[ 1124.758184] Freezing user space processes ... (elapsed 0.002 seconds) done.
[ 1124.768721] OOM killer disabled.
[ 1124.847009] PM: Preallocating image memory...
[ 1139.392042] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
[ 1139.392076] CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
[ 1139.392077] task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
[ 1139.392083] RIP: 0010:memory_bm_find_bit+0xf4/0x100
[ 1139.392084] RSP: 0018:ffffb1a3f325fc20 EFLAGS: 00000006
[ 1139.392084] RAX: 0000000000000000 RBX: 0000000013b83000 RCX: ffff9fbe89caf000
[ 1139.392085] RDX: ffffb1a3f325fc30 RSI: 0000000000003200 RDI: ffff9fbeaffffe80
[ 1139.392085] RBP: ffffb1a3f325fc40 R08: 0000000013b80000 R09: ffff9fbe89c54878
[ 1139.392085] R10: ffffb1a3f325fc2c R11: 0000000013b83200 R12: 0000000000000400
[ 1139.392086] R13: fffffd552e0c0000 R14: ffff9fc1bffd31e0 R15: 0000000000000202
[ 1139.392086] FS:  00007f3189704180(0000) GS:ffff9fbec8ec0000(0000) knlGS:0000000000000000
[ 1139.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1139.392087] CR2: 00000085da0f7398 CR3: 000001771cf9a000 CR4: 00000000007406e0
[ 1139.392088] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1139.392088] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1139.392088] PKRU: 55555554
[ 1139.392089] Call Trace:
[ 1139.392092]  ? memory_bm_set_bit+0x29/0x60
[ 1139.392094]  swsusp_set_page_free+0x2b/0x30
[ 1139.392098]  mark_free_pages+0x147/0x1c0
[ 1139.392099]  count_data_pages+0x41/0xa0
[ 1139.392101]  hibernate_preallocate_memory+0x80/0x450
[ 1139.392102]  hibernation_snapshot+0x58/0x410
[ 1139.392103]  hibernate+0x17c/0x310
[ 1139.392104]  state_store+0xdf/0xf0
[ 1139.392107]  kobj_attr_store+0xf/0x20
[ 1139.392111]  sysfs_kf_write+0x37/0x40
[ 1139.392113]  kernfs_fop_write+0x11c/0x1a0
[ 1139.392117]  __vfs_write+0x37/0x170
[ 1139.392121]  ? handle_mm_fault+0xd8/0x230
[ 1139.392122]  vfs_write+0xb1/0x1a0
[ 1139.392123]  SyS_write+0x55/0xc0
[ 1139.392126]  entry_SYSCALL_64_fastpath+0x1a/0xa5

So avoid the NMI lockup by feeding the dog in the deepest level of loop.

Reported-by: Jan Filipcewicz <jan.filipcewicz@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 mm/page_alloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..4f7ac30 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
+#include <linux/nmi.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -2561,8 +2562,10 @@ void mark_free_pages(struct zone *zone)
 			unsigned long i;
 
 			pfn = page_to_pfn(page);
-			for (i = 0; i < (1UL << order); i++)
+			for (i = 0; i < (1UL << order); i++) {
 				swsusp_set_page_free(pfn_to_page(pfn + i));
+				touch_nmi_watchdog();
+			}
 		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH][RFC] PM / Hibernate: Feed NMI wathdog when creating snapshot
  2017-08-14 17:19 [PATCH][RFC] PM / Hibernate: Feed NMI wathdog when creating snapshot Chen Yu
@ 2017-08-15 12:41 ` Michal Hocko
  2017-08-15 16:01   ` Chen Yu
  2017-08-16  4:53   ` Chen Yu
  0 siblings, 2 replies; 5+ messages in thread
From: Michal Hocko @ 2017-08-15 12:41 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-mm, linux-pm, Andrew Morton, Vlastimil Babka, Mel Gorman,
	linux-kernel, Rafael J. Wysocki, Len Brown, Dan Williams

On Tue 15-08-17 01:19:16, Chen Yu wrote:
[...]
> @@ -2561,8 +2562,10 @@ void mark_free_pages(struct zone *zone)
>  			unsigned long i;
>  
>  			pfn = page_to_pfn(page);
> -			for (i = 0; i < (1UL << order); i++)
> +			for (i = 0; i < (1UL << order); i++) {
>  				swsusp_set_page_free(pfn_to_page(pfn + i));
> +				touch_nmi_watchdog();
> +			}

this is rather excessive. Why don't you simply call touch_nmi_watchdog
once per every 1000 pages? Or once per free_list entry?

Moreover why don't you need to touch_nmi_watchdog in the loop over all
pfns in the zone (right above this loop)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][RFC] PM / Hibernate: Feed NMI wathdog when creating snapshot
  2017-08-15 12:41 ` Michal Hocko
@ 2017-08-15 16:01   ` Chen Yu
  2017-08-16 11:17     ` Michal Hocko
  2017-08-16  4:53   ` Chen Yu
  1 sibling, 1 reply; 5+ messages in thread
From: Chen Yu @ 2017-08-15 16:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-pm, Andrew Morton, Vlastimil Babka, Mel Gorman,
	linux-kernel, Rafael J. Wysocki, Len Brown, Dan Williams

Hi Michal,
On Tue, Aug 15, 2017 at 02:41:19PM +0200, Michal Hocko wrote:
> On Tue 15-08-17 01:19:16, Chen Yu wrote:
> [...]
> > @@ -2561,8 +2562,10 @@ void mark_free_pages(struct zone *zone)
> >  			unsigned long i;
> >  
> >  			pfn = page_to_pfn(page);
> > -			for (i = 0; i < (1UL << order); i++)
> > +			for (i = 0; i < (1UL << order); i++) {
> >  				swsusp_set_page_free(pfn_to_page(pfn + i));
> > +				touch_nmi_watchdog();
> > +			}
> 
> this is rather excessive. Why don't you simply call touch_nmi_watchdog
> once per every 1000 pages? Or once per free_list entry?
> 
Yes, this would be less costy, previously I thought that, since touch_nmi_watchdog()
would only update very limited amount of percpu variables and it is not costy when
comparing with the huge loop in radix tree searching by the swsusp_set_page_free(),
thus I put it in the deepest level in this loop, as this might be a safer place to avoid NMI.
> Moreover why don't you need to touch_nmi_watchdog in the loop over all
> pfns in the zone (right above this loop)?
As the NMI was triggered when checking the free_list rather than in the loop over
all pfns, it seems that the former has more possibility to catch a NMI if the
latter has already taken too much time. But yes, a safer way is to feed dog
in the latter too. I'll modify the code according to your suggestion.

Thanks,
	Yu
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][RFC] PM / Hibernate: Feed NMI wathdog when creating snapshot
  2017-08-15 12:41 ` Michal Hocko
  2017-08-15 16:01   ` Chen Yu
@ 2017-08-16  4:53   ` Chen Yu
  1 sibling, 0 replies; 5+ messages in thread
From: Chen Yu @ 2017-08-16  4:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-pm, Andrew Morton, Vlastimil Babka, Mel Gorman,
	linux-kernel, Rafael J. Wysocki, Len Brown, Dan Williams

On Tue, Aug 15, 2017 at 02:41:19PM +0200, Michal Hocko wrote:
> On Tue 15-08-17 01:19:16, Chen Yu wrote:
> [...]
> > @@ -2561,8 +2562,10 @@ void mark_free_pages(struct zone *zone)
> >  			unsigned long i;
> >  
> >  			pfn = page_to_pfn(page);
> > -			for (i = 0; i < (1UL << order); i++)
> > +			for (i = 0; i < (1UL << order); i++) {
> >  				swsusp_set_page_free(pfn_to_page(pfn + i));
> > +				touch_nmi_watchdog();
> > +			}
> 
> this is rather excessive. Why don't you simply call touch_nmi_watchdog
> once per every 1000 pages? Or once per free_list entry?
> 
> Moreover why don't you need to touch_nmi_watchdog in the loop over all
> pfns in the zone (right above this loop)?
> --
After re-checking the code, I think we can simply disable the watchdog
temporarily, thus to avoid feeding the watchdog in the loop.
I'm sending another version based on this.
Thanks,
	Yu
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][RFC] PM / Hibernate: Feed NMI wathdog when creating snapshot
  2017-08-15 16:01   ` Chen Yu
@ 2017-08-16 11:17     ` Michal Hocko
  0 siblings, 0 replies; 5+ messages in thread
From: Michal Hocko @ 2017-08-16 11:17 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-mm, linux-pm, Andrew Morton, Vlastimil Babka, Mel Gorman,
	linux-kernel, Rafael J. Wysocki, Len Brown, Dan Williams

On Wed 16-08-17 00:01:08, Chen Yu wrote:
> Hi Michal,
> On Tue, Aug 15, 2017 at 02:41:19PM +0200, Michal Hocko wrote:
[...]
> > Moreover why don't you need to touch_nmi_watchdog in the loop over all
> > pfns in the zone (right above this loop)?
> As the NMI was triggered when checking the free_list rather than in the loop over
> all pfns, it seems that the former has more possibility to catch a NMI if the
> latter has already taken too much time. But yes, a safer way is to feed dog
> in the latter too. I'll modify the code according to your suggestion.

This is a wrong approach IMHO. The whole thing has IRQ disabled.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-08-16 11:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-14 17:19 [PATCH][RFC] PM / Hibernate: Feed NMI wathdog when creating snapshot Chen Yu
2017-08-15 12:41 ` Michal Hocko
2017-08-15 16:01   ` Chen Yu
2017-08-16 11:17     ` Michal Hocko
2017-08-16  4:53   ` Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).