Re: [PATCH] mm: Fix false softlockup during pfn range removal

From: Ben Widawsky <ben.widawsky@intel.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Scargall, Steve" <steve.scargall@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>
Subject: Re: [PATCH] mm: Fix false softlockup during pfn range removal
Date: Mon, 22 Jun 2020 09:25:27 -0700	[thread overview]
Message-ID: <20200622162527.bo765xhid563u6vp@intel.com> (raw)
In-Reply-To: <f6677b9d-98c4-b959-1c03-6bdf4d81b71c@redhat.com>

On 20-06-22 09:16:20, David Hildenbrand wrote:
> On 20.06.20 01:12, Ben Widawsky wrote:
> > When working with very large nodes, poisoning the struct pages (for
> > which there will be very many) can take a very long time. If the system
> > is using voluntary preemptions, the software watchdog will not be able
> > to detect forward progress. This patch addresses this issue by offering
> > to give up time like __remove_pages() does.  This behavior was
> > introduced in v5.6 with:
> > commit d33695b16a9f ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
> 
> We try to stay <= 72 chars in the commit message (except for kernel splats).
> 

Thanks. checkpatch does complain fwiw, I just somehow missed i.

> > 
> > Alternately, init_page_poison could do this cond_resched(), but it seems
> > to me that the caller of init_page_poison() is what actually knows
> > whether or not it should relax its own priority.
> > 
> > Based on Dan's notes, I think this is perfectly safe:
> > commit f931ab479dd2 ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
> 
> We shouldn't be holding any spin locks, so it's safe. (we could think
> about doing this outside of the memory hotplug lock in the case of
> devmem, however, it's only a debugging feature so we most probably don't
> care).
> 
> > 
> > Aside from fixing the lockup, it is also a friendlier thing to do on
> > lower core systems that might wipe out large chunks of hotplug memory
> > (probably not a very common case).
> 
> It really only is an issue for devmem. Ordinary hotplugged system memory
> is not affected (onlined/offlined in memory block granularity).

Could you explain this a bit? I was fixing the issue found on PMEM systems, but
it seems like regularly memory hotplug was potentially a victim. I think one of
the reasons PMEM might be more likely is the time it takes to work with any data
structures store in the PMEM itself is slower (just a guess).

> 
> > 
> > Fixes this kind of splat:
> > [  352.142079] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
> > [  352.150067] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security rfkill ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_umad vfat fat kmem intel_rapl_msr intel_rapl_common rpcrdma sunrpc rdma_ucm ib_iser isst_if_common rdma_cm skx_edac iw_cm ib_cm x86_pkg_temp_thermal libiscsi intel_powerclamp scsi_transport_iscsi coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i40iw intel_cstate iTCO_wdt ib_uverbs iTCO_vendor_support device_dax ib_core joydev intel_uncore i2c_i801 lpc_ich ipmi_ssif ioatdma dca wmi ipmi_si ipmi_devintf ipmi_msghandler dax_pmem
> > [  352.150123]  dax_pmem_core acpi_pad acpi_power_meter drm ip_tables xfs libcrc32c nd_pmem nd_btt i40e e1000e crc32c_intel nfit
> > [  352.260774] irq event stamp: 138450
> > [  352.264692] hardirqs last  enabled at (138449): [<ffffffffa1001f26>] trace_hardirqs_on_thunk+0x1a/0x1c
> > [  352.275134] hardirqs last disabled at (138450): [<ffffffffa1001f42>] trace_hardirqs_off_thunk+0x1a/0x1c
> > [  352.285662] softirqs last  enabled at (138448): [<ffffffffa1e00347>] __do_softirq+0x347/0x456
> > [  352.295233] softirqs last disabled at (138443): [<ffffffffa10c416d>] irq_exit+0x7d/0xb0
> > [  352.304210] CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
> > [  352.313283] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
> > [  352.324308] RIP: 0010:memset_erms+0x9/0x10
> > [  352.328905] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> > [  352.349953] RSP: 0018:ffffc90021b5fd50 EFLAGS: 00010202 ORIG_RAX: ffffffffffffff13
> > [  352.358450] RAX: 00000000000000ff RBX: ffff88983ffd5000 RCX: 0000000065df0e40
> > [  352.366457] RDX: 00000003a0000000 RSI: 00000000ffffffff RDI: ffffea039b20f1c0
> > [  352.374465] RBP: ffff88983ffd6c00 R08: 0000000000000000 R09: ffffea0061000000
> > [  352.382473] R10: 0000000000000001 R11: 0000000000000000 R12: ffffea006f808000
> > [  352.390480] R13: 0000000001840000 R14: 000000000e800000 R15: ffff8997d7b764e0
> > [  352.398487] FS:  00007f724ef81780(0000) GS:ffff8997df100000(0000) knlGS:0000000000000000
> > [  352.407562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  352.414011] CR2: 00007ffcd4070768 CR3: 000001178c722002 CR4: 00000000003606e0
> > [  352.422056] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  352.430092] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  352.438115] Call Trace:
> > [  352.440865]  remove_pfn_range_from_zone+0x3a/0x380
> > [  352.446244]  ? cpumask_next+0x17/0x20
> > [  352.450361]  memunmap_pages+0x17f/0x280
> > [  352.454670]  release_nodes+0x22a/0x260
> > [  352.458888]  __device_release_driver+0x172/0x220
> > [  352.464070]  device_driver_detach+0x3e/0xa0
> > [  352.468753]  unbind_store+0x113/0x130
> > [  352.472868]  kernfs_fop_write+0xdc/0x1c0
> > [  352.477273]  vfs_write+0xde/0x1d0
> > [  352.482218]  ksys_write+0x58/0xd0
> > [  352.487207]  do_syscall_64+0x5a/0x120
> > [  352.492529]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  352.499446] RIP: 0033:0x7f724f40b5e7
> > [  352.504673] Code: Bad RIP value.
> > [  352.509484] RSP: 002b:00007ffcd40738f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [  352.519213] RAX: ffffffffffffffda RBX: 00007f724ef816a8 RCX: 00007f724f40b5e7
> > [  352.528410] RDX: 0000000000000007 RSI: 00005617d7cd1277 RDI: 0000000000000003
> > [  352.537573] RBP: 0000000000000003 R08: 00000000ffffffff R09: 00007ffcd40737d0
> > [  352.546764] R10: 0000000000000000 R11: 0000000000000246 R12: 00005617d7cd1277
> > [  352.555929] R13: 0000000000000000 R14: 0000000000000007 R15: 00005617d7cd1230
> > [  370.353742] Built 2 zonelists, mobility grouping on.  Total pages: 49050381
> > [  370.373317] Policy zone: Normal
> > [  374.948164] Built 3 zonelists, mobility grouping on.  Total pages: 49312525
> > [  375.017496] Policy zone: Normal
> 
> I'd have stripped this to a reasonable minimum.
> 
> > 
> > Fixes: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
> > Reported-by: "Scargall, Steve" <steve.scargall@intel.com>
> > Reported-by: Ben Widawsky <ben.widawsky@intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Vishal Verma <vishal.l.verma@intel.com>
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > ---
> >  mm/memory_hotplug.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 9b34e03e730a..da374cd3d45b 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -471,11 +471,20 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
> >  				      unsigned long start_pfn,
> >  				      unsigned long nr_pages)
> >  {
> > +	const unsigned long end_pfn = start_pfn + nr_pages;
> >  	struct pglist_data *pgdat = zone->zone_pgdat;
> > -	unsigned long flags;
> > +	unsigned long pfn, cur_nr_pages, flags;
> >  
> >  	/* Poison struct pages because they are now uninitialized again. */
> > -	page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * nr_pages);
> > +	for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) {
> > +		cond_resched();
> > +
> > +		/* Select all remaining pages up to the next section boundary */
> > +		cur_nr_pages =
> > +			min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn);
> 
> Nit: I'd have used the same indentation as the code you copied this from.
> 
> > +		page_init_poison(pfn_to_page(pfn),
> > +				 sizeof(struct page) * cur_nr_pages);
> > +	}
> >  
> >  #ifdef CONFIG_ZONE_DEVICE
> >  	/*
> > 
> 
> Thanks!
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> -- 
> Thanks,
> 
> David / dhildenb