From: David Hildenbrand <david@redhat.com>
To: Yang Shi <shy828301@gmail.com>,
naoya.horiguchi@nec.com, osalvador@suse.de, tdmackey@twitter.com,
willy@infradead.org, akpm@linux-foundation.org, corbet@lwn.net
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [v2 PATCH 1/3] mm: hwpoison: don't drop slab caches for offlining non-LRU page
Date: Fri, 20 Aug 2021 09:08:39 +0200 [thread overview]
Message-ID: <12bf652b-7042-25d7-4910-276a6e422111@redhat.com> (raw)
In-Reply-To: <20210819054116.266126-1-shy828301@gmail.com>
On 19.08.21 07:41, Yang Shi wrote:
> In the current implementation of soft offline, if non-LRU page is met,
> all the slab caches will be dropped to free the page then offline. But
> if the page is not slab page all the effort is wasted in vain. Even
> though it is a slab page, it is not guaranteed the page could be freed
> at all.
>
> However the side effect and cost is quite high. It does not only drop
> the slab caches, but also may drop a significant amount of page caches
> which are associated with inode caches. It could make the most
> workingset gone in order to just offline a page. And the offline is not
> guaranteed to succeed at all, actually I really doubt the success rate
> for real life workload.
>
> Furthermore the worse consequence is the system may be locked up and
> unusable since the page cache release may incur huge amount of works
> queued for memcg release.
>
> Actually we ran into such unpleasant case in our production environment.
> Firstly, the workqueue of memory_failure_work_func is locked up as
> below:
>
> BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
> pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
> in-flight: 409271:memory_failure_work_func
> pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
> workqueue mm_percpu_wq: flags=0x8
> pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> pending: vmstat_update
> workqueue cgroup_destroy: flags=0x0
> pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
> pending: css_release_work_fn
>
> There were over 12K css_release_work_fn queued, and this caused a few
> lockups due to the contention of worker pool lock with IRQ disabled, for
> example:
>
> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
> CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1
> Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
> Workqueue: events memory_failure_work_func
> RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
> Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
> RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
> RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
> RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
> R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
> R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
> FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
> Call Trace:
> __queue_work+0xd6/0x3c0
> queue_work_on+0x1c/0x30
> uncharge_batch+0x10e/0x110
> mem_cgroup_uncharge_list+0x6d/0x80
> release_pages+0x37f/0x3f0
> __pagevec_release+0x1c/0x50
> __invalidate_mapping_pages+0x348/0x380
> ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
> inode_lru_isolate+0x10a/0x160
> ? iput+0x1d0/0x1d0
> __list_lru_walk_one+0x7b/0x170
> ? iput+0x1d0/0x1d0
> list_lru_walk_one+0x4a/0x60
> prune_icache_sb+0x37/0x50
> super_cache_scan+0x123/0x1a0
> do_shrink_slab+0x10c/0x2c0
> shrink_slab+0x1f1/0x290
> drop_slab_node+0x4d/0x70
> soft_offline_page+0x1ac/0x5b0
> ? dev_mce_log+0xee/0x110
> ? notifier_call_chain+0x39/0x90
> memory_failure_work_func+0x6a/0x90
> process_one_work+0x19e/0x340
> ? process_one_work+0x340/0x340
> worker_thread+0x30/0x360
> ? process_one_work+0x340/0x340
> kthread+0x116/0x130
>
> The lockup made the machine is quite unusable. And it also made the
> most workingset gone, the reclaimabled slab caches were reduced from 12G
> to 300MB, the page caches were decreased from 17G to 4G.
>
> But the most disappointing thing is all the effort doesn't make the page
> offline, it just returns:
>
> soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
>
> It seems the aggressive behavior for non-LRU page didn't pay back, so it
> doesn't make too much sense to keep it considering the terrible side
> effect.
>
> Reported-by: David Mackey <tdmackey@twitter.com>
> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Thanks,
David / dhildenb
prev parent reply other threads:[~2021-08-20 7:08 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-19 5:41 [v2 PATCH 1/3] mm: hwpoison: don't drop slab caches for offlining non-LRU page Yang Shi
2021-08-19 5:41 ` [v2 PATCH 2/3] doc: hwpoison: correct the support for hugepage Yang Shi
2021-08-20 9:34 ` David Hildenbrand
2021-08-19 5:41 ` [v2 PATCH 3/3] mm: hwpoison: dump page for unhandlable page Yang Shi
2021-08-20 6:48 ` HORIGUCHI NAOYA(堀口 直也)
2021-08-20 18:40 ` Yang Shi
2021-08-23 5:05 ` HORIGUCHI NAOYA(堀口 直也)
2021-08-23 17:47 ` Yang Shi
2021-08-23 23:17 ` HORIGUCHI NAOYA(堀口 直也)
2021-09-22 19:37 ` Luck, Tony
2021-09-22 19:58 ` Yang Shi
2021-09-22 20:37 ` Yang Shi
2021-08-20 7:04 ` [v2 PATCH 1/3] mm: hwpoison: don't drop slab caches for offlining non-LRU page HORIGUCHI NAOYA(堀口 直也)
2021-08-20 7:08 ` David Hildenbrand [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=12bf652b-7042-25d7-4910-276a6e422111@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=osalvador@suse.de \
--cc=shy828301@gmail.com \
--cc=tdmackey@twitter.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).