All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Yang Shi <shy828301@gmail.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed
Date: Sun, 10 Mar 2024 08:23:12 +0000	[thread overview]
Message-ID: <59098a73-636a-497d-bc20-0abc90b4868c@arm.com> (raw)
In-Reply-To: <Ze01sd5T6tLm6Ep8@casper.infradead.org>

On 10/03/2024 04:23, Matthew Wilcox wrote:
> On Sat, Mar 09, 2024 at 09:38:42AM +0000, Ryan Roberts wrote:
>>> I think split_queue_len is getting out of sync with the number of items on the
>>> queue? We only decrement it if we lost the race with folio_put(). But we are
>>> unconditionally taking folios off the list here. So we are definitely out of
>>> sync until we take the lock again below. But we only put folios back on the list
>>> that failed to split. A successful split used to decrement this variable
>>> (because the folio was on _a_ list). But now it doesn't. So we are always
>>> mismatched after the first failed split?
>>
>> Oops, I meant first *sucessful* split.
> 
> Agreed, nice fix.
> 
>> I've run the full test 5 times, and haven't seen any slow down or RCU stall
>> warning. But on the 5th time, I saw the non-NULL mapping oops (your new check
>> did not trigger):
>>
>> [  944.475632] BUG: Bad page state in process usemem  pfn:252932
>> [  944.477314] page:00000000ad4feba6 refcount:0 mapcount:0
>> mapping:000000003a777cd9 index:0x1 pfn:0x252932
>> [  944.478575] aops:0x0 ino:dead000000000122
>> [  944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
>> [  944.479934] page_type: 0xffffffff()
>> [  944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90
>> fffffc00084a4c90
>> [  944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff
>> 0000000000000000
>> [  944.482475] page dumped because: non-NULL mapping
> 
>> So what do we know?
>>
>>  - the above page looks like it was the 3rd page of a large folio
>>     - words 3 and 4 are the same, meaning they are likely empty _deferred_list
>>     - pfn alignment is correct for this
>>  - The _deferred_list for all previously freed large folios was empty
>>     - but the folio could have been in the new deferred split batch?
> 
> I don't think it could be in a deferred split bacth because we hold the
> refcount at that point ...
> 
>>  - free_tail_page_prepare() zeroed mapping/_deferred_list during free
>>  - _deferred_list was subsequently reinitialized to "empty" while on free list
>>
>> So how about this for a rough hypothesis:
>>
>>
>> CPU1                                  CPU2
>> deferred_split_scan
>> list_del_init
>> folio_batch_add
>>                                       folio_put -> free
>> 		                        free_tail_page_prepare
>> 			                  is on deferred list? -> no
>> split_huge_page_to_list_to_order
>>   list_empty(folio->_deferred_list)
>>     -> yes
>>   list_del_init
>> 			                  mapping = NULL
>> 					    -> (_deferred_list.prev = NULL)
>> 			                put page on free list
>>     INIT_LIST_HEAD(entry);
>>       -> "mapping" no longer NULL
>>
>>
>> But CPU1 is holding a reference, so that could only happen if a reference was
>> put one too many times. Ugh.
> 
> Before we start blaming the CPU for doing something impossible, 

It doesn't sound completely impossible to me that there is a rare error path that accidentally folio_put()s an extra time...

> what if
> we're taking the wrong lock?  

...but yeah, equally as plausible, I guess.

> I know that seems crazy, but if page->flags
> gets corrupted to the point where we change some of the bits in the
> nid, when we free the folio, we call folio_undo_large_rmappable(),
> get the wrong ds_queue back from get_deferred_split_queue(), take the
> wrong split_queue_lock, corrupt the deferred list of a different node,
> and bad things happen?
> 
> I don't think we can detect that folio->nid has become corrupted in the
> page allocation/freeing code (can we?), but we can tell if a folio is
> on the wrong ds_queue in deferred_split_scan():
> 
>         list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>                                                         _deferred_list) {
> +		VM_BUG_ON_FOLIO(folio_nid(folio) != sc->nid, folio);
> +		VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
>                 list_del_init(&folio->_deferred_list);
> 
> (also testing the hypothesis that somehow a split folio has ended up
> on the deferred split list)

OK, ran with these checks, and get the following oops:

[  411.719461] page:0000000059c1826b refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x8c6a40
[  411.720807] page:0000000059c1826b refcount:0 mapcount:-128 mapping:0000000000000000 index:0x1 pfn:0x8c6a40
[  411.721792] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
[  411.722453] page_type: 0xffffff7f(buddy)
[  411.722870] raw: 0bfffc0000000000 fffffc001227e808 fffffc002a857408 0000000000000000
[  411.723672] raw: 0000000000000001 0000000000000004 00000000ffffff7f 0000000000000000
[  411.724470] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
[  411.725176] ------------[ cut here ]------------
[  411.725642] kernel BUG at include/linux/mm.h:1191!
[  411.726341] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[  411.727021] Modules linked in:
[  411.727329] CPU: 40 PID: 2704 Comm: usemem Not tainted 6.8.0-rc5-00391-g44b0dc848590-dirty #45
[  411.728179] Hardware name: linux,dummy-virt (DT)
[  411.728657] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  411.729381] pc : __dump_page+0x450/0x4a8
[  411.729789] lr : __dump_page+0x450/0x4a8
[  411.730187] sp : ffff80008b97b6f0
[  411.730525] x29: ffff80008b97b6f0 x28: 00000000000000e2 x27: ffff80008b97b988
[  411.731227] x26: ffff80008b97b988 x25: ffff800082105000 x24: 0000000000000001
[  411.731926] x23: 0000000000000000 x22: 0000000000000001 x21: fffffc00221a9000
[  411.732630] x20: fffffc00221a9000 x19: fffffc00221a9000 x18: ffffffffffffffff
[  411.733331] x17: 3030303030303030 x16: 2066376666666666 x15: 076c076f07660721
[  411.734035] x14: 0728074f0749074c x13: 076c076f07660721 x12: 0000000000000000
[  411.734757] x11: 0720072007200729 x10: ffff0013f5e756c0 x9 : ffff80008014b604
[  411.735473] x8 : 00000000ffffbfff x7 : ffff0013f5e756c0 x6 : 0000000000000000
[  411.736198] x5 : ffff0013a5a24d88 x4 : 0000000000000000 x3 : 0000000000000000
[  411.736923] x2 : 0000000000000000 x1 : ffff0000c2849b80 x0 : 000000000000003e
[  411.737621] Call trace:
[  411.737870]  __dump_page+0x450/0x4a8
[  411.738229]  dump_page+0x2c/0x70
[  411.738551]  deferred_split_scan+0x258/0x368
[  411.738973]  do_shrink_slab+0x184/0x750
[  411.739355]  shrink_slab+0x4d4/0x9c0
[  411.739729]  shrink_node+0x214/0x860
[  411.740098]  do_try_to_free_pages+0xd0/0x560
[  411.740540]  try_to_free_mem_cgroup_pages+0x14c/0x330
[  411.741048]  try_charge_memcg+0x1cc/0x788
[  411.741456]  __mem_cgroup_charge+0x6c/0xd0
[  411.741884]  __handle_mm_fault+0x1000/0x1a28
[  411.742306]  handle_mm_fault+0x7c/0x418
[  411.742698]  do_page_fault+0x100/0x690
[  411.743080]  do_translation_fault+0xb4/0xd0
[  411.743508]  do_mem_abort+0x4c/0xa8
[  411.743876]  el0_da+0x54/0xb8
[  411.744177]  el0t_64_sync_handler+0xe4/0x158
[  411.744602]  el0t_64_sync+0x190/0x198
[  411.744976] Code: f000de00 912c4021 9126a000 97f79727 (d4210000) 
[  411.745573] ---[ end trace 0000000000000000 ]---

The new VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); is firing, but then when dump_page() does this:

	if (compound) {
		pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
				head, compound_order(head),
				folio_entire_mapcount(folio),
				folio_nr_pages_mapped(folio),
				atomic_read(&folio->_pincount));
	}

VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); inside folio_entire_mapcount() fires so we have a nested oops.

So the very first line is from the first oops and the rest is from the second. I guess we are racing with the page being freed? I find the change in mapcount interesting; 0 -> -128. Not sure why this would happen?

Given the NID check didn't fire, I wonder if this points more towards extra folio_put than corrupt folio nid?



  reply	other threads:[~2024-03-10  8:23 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 01/18] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 02/18] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 03/18] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 04/18] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 05/18] memcg: Add mem_cgroup_uncharge_folios() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 06/18] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 07/18] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 08/18] mm: use __page_cache_release() in folios_put() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 09/18] mm: Handle large folios in free_unref_folios() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
2024-03-06 13:42   ` Ryan Roberts
2024-03-06 16:09     ` Matthew Wilcox
2024-03-06 16:19       ` Ryan Roberts
2024-03-06 17:41         ` Ryan Roberts
2024-03-06 18:41           ` Zi Yan
2024-03-06 19:55             ` Matthew Wilcox
2024-03-06 21:55               ` Matthew Wilcox
2024-03-07  8:56                 ` Ryan Roberts
2024-03-07 13:50                   ` Yin, Fengwei
2024-03-07 14:05                     ` Re: Matthew Wilcox
2024-03-07 15:24                       ` Re: Ryan Roberts
2024-03-07 16:24                         ` Re: Ryan Roberts
2024-03-07 23:02                           ` Re: Matthew Wilcox
2024-03-08  1:06                       ` Re: Yin, Fengwei
2024-03-07 17:33                   ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox
2024-03-07 18:35                     ` Ryan Roberts
2024-03-07 20:42                       ` Matthew Wilcox
2024-03-08 11:44                     ` Ryan Roberts
2024-03-08 12:09                       ` Ryan Roberts
2024-03-08 14:21                         ` Ryan Roberts
2024-03-08 15:11                           ` Matthew Wilcox
2024-03-08 16:03                             ` Matthew Wilcox
2024-03-08 17:13                               ` Ryan Roberts
2024-03-08 18:09                                 ` Ryan Roberts
2024-03-08 18:18                                   ` Matthew Wilcox
2024-03-09  4:34                                     ` Andrew Morton
2024-03-09  4:52                                       ` Matthew Wilcox
2024-03-09  8:05                                         ` Ryan Roberts
2024-03-09 12:33                                           ` Ryan Roberts
2024-03-10 13:38                                             ` Matthew Wilcox
2024-03-08 15:33                         ` Matthew Wilcox
2024-03-09  6:09                       ` Matthew Wilcox
2024-03-09  7:59                         ` Ryan Roberts
2024-03-09  8:18                           ` Ryan Roberts
2024-03-09  9:38                             ` Ryan Roberts
2024-03-10  4:23                               ` Matthew Wilcox
2024-03-10  8:23                                 ` Ryan Roberts [this message]
2024-03-10 11:08                                   ` Matthew Wilcox
2024-03-10 11:01       ` Ryan Roberts
2024-03-10 11:11         ` Matthew Wilcox
2024-03-10 16:31           ` Ryan Roberts
2024-03-10 19:57             ` Matthew Wilcox
2024-03-10 19:59             ` Ryan Roberts
2024-03-10 20:46               ` Matthew Wilcox
2024-03-10 21:52                 ` Matthew Wilcox
2024-03-11  9:01                   ` Ryan Roberts
2024-03-11 12:26                     ` Matthew Wilcox
2024-03-11 12:36                       ` Ryan Roberts
2024-03-11 15:50                         ` Matthew Wilcox
2024-03-11 16:14                           ` Ryan Roberts
2024-03-11 17:49                             ` Matthew Wilcox
2024-03-12 11:57                               ` Ryan Roberts
2024-03-11 19:26                             ` Matthew Wilcox
2024-03-10 11:14         ` Ryan Roberts
2024-02-27 17:42 ` [PATCH v3 11/18] mm: Free folios in a batch in shrink_folio_list() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 12/18] mm: Free folios directly in move_folios_to_lru() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 13/18] memcg: Remove mem_cgroup_uncharge_list() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 14/18] mm: Remove free_unref_page_list() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 15/18] mm: Remove lru_to_page() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 16/18] mm: Convert free_pages_and_swap_cache() to use folios_put() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 17/18] mm: Use a folio in __collapse_huge_page_copy_succeeded() Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 18/18] mm: Convert free_swap_cache() to take a folio Matthew Wilcox (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59098a73-636a-497d-bc20-0abc90b4868c@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.