Re: [External] Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

From: Usama Arif <usama.arif@bytedance.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, muchun.song@linux.dev,
	songmuchun@bytedance.com, fam.zheng@bytedance.com,
	liangma@liangbit.com, punit.agrawal@bytedance.com,
	Konrad Dybcio <konrad.dybcio@linaro.org>
Subject: Re: [External] Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages
Date: Tue, 10 Oct 2023 18:01:23 +0100	[thread overview]
Message-ID: <6b1d9860-3581-0b99-4fb7-4c1f5a2a05f3@bytedance.com> (raw)
In-Reply-To: <20231010012345.GA108129@monkey>



On 10/10/2023 02:23, Mike Kravetz wrote:
> On 10/09/23 15:56, Usama Arif wrote:
>> Calling prep_and_add_allocated_folios when allocating gigantic pages
>> at boot time causes the kernel to crash as folio_list is empty
>> and iterating it causes a NULL pointer dereference. Call this only
>> for non-gigantic pages when folio_list has entires.
> 
> Thanks!
> 
> However, are you sure the issue is the result of iterating through a
> NULL list?  For reference, the routine prep_and_add_allocated_folios is:
> 

Yes, you are right, it wasnt an issue with the list, but the lock. If I 
do the below diff it boots.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 73803d62066a..f428af13e98a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2178,18 +2178,19 @@ static struct folio 
*alloc_fresh_hugetlb_folio(struct hstate *h,
  static void prep_and_add_allocated_folios(struct hstate *h,
                                         struct list_head *folio_list)
  {
+       unsigned long flags;
         struct folio *folio, *tmp_f;

         /* Send list for bulk vmemmap optimization processing */
         hugetlb_vmemmap_optimize_folios(h, folio_list);

         /* Add all new pool pages to free lists in one lock cycle */
-       spin_lock_irq(&hugetlb_lock);
+       spin_lock_irqsave(&hugetlb_lock, flags);
         list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
                 __prep_account_new_huge_page(h, folio_nid(folio));
                 enqueue_hugetlb_folio(h, folio);
         }
-       spin_unlock_irq(&hugetlb_lock);
+       spin_unlock_irqrestore(&hugetlb_lock, flags);
  }

  /*


FYI, this was an x86 VM with kvm enabled.

Thanks,
Usama

> static void prep_and_add_allocated_folios(struct hstate *h,
> 					struct list_head *folio_list)
> {
> 	struct folio *folio, *tmp_f;
> 
> 	/* Add all new pool pages to free lists in one lock cycle */
> 	spin_lock_irq(&hugetlb_lock);
> 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> 		__prep_account_new_huge_page(h, folio_nid(folio));
> 		enqueue_hugetlb_folio(h, folio);
> 	}
> 	spin_unlock_irq(&hugetlb_lock);
> }
> 
> If folio_list is empty, then the only code that should be executed is
> acquiring the lock, notice the list is empty, release the lock.
> 
> In the case of gigantic pages addressed below, I do see the warning:
> 
> [    0.055140] DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
> [    0.055149] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4345 lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [    0.055153] Modules linked in:
> [    0.055155] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc4+ #40
> [    0.055157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [    0.055158] RIP: 0010:lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [    0.055160] Code: 00 85 c0 0f 84 5e ff ff ff 8b 0d a7 20 74 01 85 c9 0f 85 50 ff ff ff 48 c7 c6 48 25 42 82 48 c7 c7 70 7f 40 82 e8 18 10 f7 ff <0f> 0b 5b e9 e0 d8 af 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> [    0.055162] RSP: 0000:ffffffff82603d40 EFLAGS: 00010086 ORIG_RAX: 0000000000000000
> [    0.055164] RAX: 0000000000000000 RBX: ffffffff827911e0 RCX: 0000000000000000
> [    0.055165] RDX: 0000000000000004 RSI: ffffffff8246b3e1 RDI: 00000000ffffffff
> [    0.055166] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
> [    0.055166] R10: ffffffffffffffff R11: 284e4f5f4e524157 R12: 0000000000000001
> [    0.055167] R13: ffffffff82eb6316 R14: ffffffff82603d70 R15: ffffffff82ee5f70
> [    0.055169] FS:  0000000000000000(0000) GS:ffff888277c00000(0000) knlGS:0000000000000000
> [    0.055170] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.055171] CR2: ffff88847ffff000 CR3: 000000000263a000 CR4: 00000000000200b0
> [    0.055174] Call Trace:
> [    0.055174]  <TASK>
> [    0.055175]  ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [    0.055177]  ? __warn+0x81/0x170
> [    0.055181]  ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [    0.055182]  ? report_bug+0x18d/0x1c0
> [    0.055186]  ? early_fixup_exception+0x92/0xb0
> [    0.055189]  ? early_idt_handler_common+0x2f/0x40
> [    0.055194]  ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [    0.055196]  trace_hardirqs_on+0x10/0xa0
> [    0.055198]  _raw_spin_unlock_irq+0x24/0x50
> [    0.055201]  hugetlb_hstate_alloc_pages+0x311/0x3e0
> [    0.055206]  hugepages_setup+0x220/0x2c0
> [    0.055210]  unknown_bootoption+0x98/0x1d0
> [    0.055213]  parse_args+0x152/0x440
> [    0.055216]  ? __pfx_unknown_bootoption+0x10/0x10
> [    0.055220]  start_kernel+0x1af/0x6c0
> [    0.055222]  ? __pfx_unknown_bootoption+0x10/0x10
> [    0.055225]  x86_64_start_reservations+0x14/0x30
> [    0.055227]  x86_64_start_kernel+0x74/0x80
> [    0.055229]  secondary_startup_64_no_verify+0x166/0x16b
> [    0.055234]  </TASK>
> [    0.055235] irq event stamp: 0
> [    0.055236] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [    0.055238] hardirqs last disabled at (0): [<0000000000000000>] 0x0
> [    0.055239] softirqs last  enabled at (0): [<0000000000000000>] 0x0
> [    0.055240] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [    0.055240] ---[ end trace 0000000000000000 ]---
> 
> This is because interrupts are not enabled this early in boot, and the
> spin_unlock_irq() would incorrectly enable interrupts too early.  I wonder
> if this 'warning' could translate to a panic or NULL deref under certain
> configurations?
> 
> Konrad, I am interested to see if this addresses your booting problem.  But,
> your stack trace is a bit different.  My 'guess' is that this will not address
> your issue.  If it does not, can you try the following patch?  This
> applies to next-20231009.