From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7CE60C5475B for ; Fri, 8 Mar 2024 14:21:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DADFA6B02D1; Fri, 8 Mar 2024 09:21:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D5D266B02D4; Fri, 8 Mar 2024 09:21:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BFD306B02D5; Fri, 8 Mar 2024 09:21:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id AE9516B02D1 for ; Fri, 8 Mar 2024 09:21:37 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 85E4F40B9D for ; Fri, 8 Mar 2024 14:21:37 +0000 (UTC) X-FDA: 81874085034.14.B27F38E Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf24.hostedemail.com (Postfix) with ESMTP id 6CA5218001A for ; Fri, 8 Mar 2024 14:21:35 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709907695; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JJ8Bjwnlr3S6pLGLKZUDOU4sA+bYJn4OEqC/LtpPC+w=; b=Q80qc6AU35qR1DpICdNnuteC1WrSTLH/oWQZOzNsXSwxtyv84Zb7eOj1rYIbO70F2JXICe iu3WNVEgYMbA48E4Tv3h5umtqruSfaMmhrYtyuOpMu61UzosyEC5vUA2mS8gK7E+HuAINc jNFxwAK9QymAInyQx90/pvf8AVH0yCk= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709907695; a=rsa-sha256; cv=none; b=EBydVX2HAzn758GTkMo98ypHBPGOlyZFavKlqIK9cc6TyjeqbL3+WvkP2jW5XjQb/Mqh6i v/xPzGTUDUjkQ4W6U85C8V41UCQ3W7ueKc2QDtYUvA0QblQeALqU4xefizr7uLIttMK0Ls Wq01b7S0ythB6ZjjpkaTQWinQNww+dU= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E7DA5C15; Fri, 8 Mar 2024 06:22:10 -0800 (PST) Received: from [10.57.70.163] (unknown [10.57.70.163]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 72B1D3F762; Fri, 8 Mar 2024 06:21:32 -0800 (PST) Message-ID: <644c2f60-dbb0-4fdb-8505-96f8101b2399@arm.com> Date: Fri, 8 Mar 2024 14:21:30 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Content-Language: en-GB From: Ryan Roberts To: Matthew Wilcox Cc: Zi Yan , Andrew Morton , linux-mm@kvack.org, Yang Shi , Huang Ying References: <20240227174254.710559-1-willy@infradead.org> <20240227174254.710559-11-willy@infradead.org> <367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com> <85cc26ed-6386-4d6b-b680-1e5fba07843f@arm.com> <36bdda72-2731-440e-ad15-39b845401f50@arm.com> <03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com> <7415b36c-b5d3-4655-92e1-b303104bf4a9@arm.com> In-Reply-To: <7415b36c-b5d3-4655-92e1-b303104bf4a9@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 6CA5218001A X-Stat-Signature: atqi81ie3a1zejtj9jgjpx573snhqkcw X-Rspam-User: X-HE-Tag: 1709907695-655163 X-HE-Meta: U2FsdGVkX1/uYNtCNnAQGEHjNlGJvKFH7LLLSgY3PQTHxMAhued3DzbdDHfeJU9RSTt/t4W4lwLZR2PGVkwKzbIVPp6yc+Sw5Im2/CbBcb/Vxx/71DmVak5hl8ifAkK3WOkuQuAV763eKYvrGZsMfaNqVfOkiIiB3pCFa6eEDL8SWBVE/8+az9GCSGDC+SpPnkQ28hcCUVC6f0B+sJIqTWpTXR6Eke7sUI0ZKinGfmHHA+CLHMraP/yjL+WPSlQkC3FV7ha1X43rPamP72E2w1x2BsyuayW6XTcX7Ulf6hlgeCwBSv1Q5Y7kFCZn3pZpsB2w5iy2xaGLg89VzBodCt9IKOIvkbG4Y1kOANv5Eqbq2Y+yk6tEQOwF8eim7R9s/fMignesv0T1qT6C0E9oEEdor0VOmfocRNhC3osW4mWfOjD24oncBkvI4qVpnkJCHzB1t/abRIVlsFwswRPeIisY0xHcs2Irqp+5NQE0G+hN67npRG6WseAOSlbG+fkB4GeWSqacKQcQknLa+jL3npXhIVe0aiZdsC/V5ES4qy/+zLKXl4ctlVjFqtz6GFDFjDe5am3EJCN+0udBGfYB0g+V5odpaIye13CgfiSUwy1uyS1vNz2iTchJo2zSAzoITlyH2xCm5N5ObLSsdbSwwO+qJ3NYsdSydoU1Z8Ff8ORzXDbuqqgnr8kaUDURpIuyNc2xv0xoZ6Y720Yg/AF3x1YrnT3pdA4saT0OOmCz9m3i1HqJaHFxSou1JDf08V4FsSvLUms6Q0phqiJo3F+3BjVvBva770rITm5xLXhQJYdzHs6OlzPErBwInh2No8w1+3HMuO4QuQhayQVrQk4eNp2/53kNmL/u5j9nN3KJWijeU3KF2yMfKzooujg3ogEbXDy4gnuIHZ1gp2EHyoZy0FuMze8CMeHWg7kAQCHhjD0VZa4cfYwhgsbMb9OWFerZujC34EedLKFY/99e2/F u0+5hIWb km9AuTEuuF0zy00inqM8b/EBZq0VcbAx5NT6LBsvgvbCmr8ckhi8UV3+XQd/UBGVgw0GW7BGpJdwBiknF46OpGrtXcqumMI1kFyCNoR9+G39oukfi+AnAGCtaa5jrdrEXB626NND+68spAmoIt6LbNSmAOw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 08/03/2024 12:09, Ryan Roberts wrote: > On 08/03/2024 11:44, Ryan Roberts wrote: >>> The thought occurs that we don't need to take the folios off the list. >>> I don't know that will fix anything, but this will fix your "running out >>> of memory" problem -- I forgot to drop the reference if folio_trylock() >>> failed. Of course, I can't call folio_put() inside the lock, so may >>> as well move the trylock back to the second loop. >>> >>> Again, compile-tessted only. >>> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index fd745bcc97ff..4a2ab17f802d 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, >>> struct pglist_data *pgdata = NODE_DATA(sc->nid); >>> struct deferred_split *ds_queue = &pgdata->deferred_split_queue; >>> unsigned long flags; >>> - LIST_HEAD(list); >>> + struct folio_batch batch; >>> struct folio *folio, *next; >>> int split = 0; >>> >>> @@ -3321,36 +3321,31 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, >>> ds_queue = &sc->memcg->deferred_split_queue; >>> #endif >>> >>> + folio_batch_init(&batch); >>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>> - /* Take pin on all head pages to avoid freeing them under us */ >>> + /* Take ref on all folios to avoid freeing them under us */ >>> list_for_each_entry_safe(folio, next, &ds_queue->split_queue, >>> _deferred_list) { >>> - if (folio_try_get(folio)) { >>> - list_move(&folio->_deferred_list, &list); >>> - } else { >>> - /* We lost race with folio_put() */ >>> - list_del_init(&folio->_deferred_list); >>> - ds_queue->split_queue_len--; >>> + if (!folio_try_get(folio)) >>> + continue; >>> + if (folio_batch_add(&batch, folio) == 0) { >>> + --sc->nr_to_scan; >>> + break; >>> } >>> if (!--sc->nr_to_scan) >>> break; >>> } >>> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >>> >>> - list_for_each_entry_safe(folio, next, &list, _deferred_list) { >>> + while ((folio = folio_batch_next(&batch)) != NULL) { >>> if (!folio_trylock(folio)) >>> - goto next; >>> - /* split_huge_page() removes page from list on success */ >>> + continue; >>> if (!split_folio(folio)) >>> split++; >>> folio_unlock(folio); >>> -next: >>> - folio_put(folio); >>> } >>> >>> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>> - list_splice_tail(&list, &ds_queue->split_queue); >>> - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >>> + folios_put(&batch); >>> >>> /* >>> * Stop shrinker if we didn't split any page, but the queue is empty. >> >> >> OK I've tested this; the good news is that I haven't seen any oopses or memory >> leaks. The bad news is that it still takes an absolute age (hours) to complete >> the same test that without "mm: Allow non-hugetlb large folios to be batch >> processed" took a couple of mins. And during that time, the system is completely >> unresponsive - serial terminal doesn't work - can't even break in with sysreq. >> And sometimes I see RCU stall warnings. >> >> Dumping all the CPU back traces with gdb, all the cores (except one) are >> contending on the the deferred split lock. >> >> A couple of thoughts: >> >> - Since we are now taking a maximum of 15 folios into a batch, >> deferred_split_scan() is called much more often (in a tight loop from >> do_shrink_slab()). Could it be that we are just trying to take the lock so much >> more often now? I don't think it's quite that simple because we take the lock >> for every single folio when adding it to the queue, so the dequeing cost should >> still be a factor of 15 locks less. >> >> - do_shrink_slab() might be calling deferred_split_scan() in a tight loop with >> deferred_split_scan() returning 0 most of the time. If there are still folios on >> the deferred split list but deferred_split_scan() was unable to lock any folios >> then it will return 0, not SHRINK_STOP, so do_shrink_slab() will keep calling >> it, essentially live locking. Has your patch changed the duration of the folio >> being locked? I don't think so... >> >> - Ahh, perhaps its as simple as your fix has removed the code that removed the >> folio from the deferred split queue if it fails to get a reference? That could >> mean we end up returning 0 instead of SHRINK_STOP too. I'll have play. >> > > I tested the last idea by adding this back in: > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index d46897d7ea7f..50b07362923a 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3327,8 +3327,12 @@ static unsigned long deferred_split_scan(struct shrinker > *shrink, > /* Take ref on all folios to avoid freeing them under us */ > list_for_each_entry_safe(folio, next, &ds_queue->split_queue, > _deferred_list) { > - if (!folio_try_get(folio)) > + if (!folio_try_get(folio)) { > + /* We lost race with folio_put() */ > + list_del_init(&folio->_deferred_list); > + ds_queue->split_queue_len--; > continue; > + } > if (folio_batch_add(&batch, folio) == 0) { > --sc->nr_to_scan; > break; > > The test now gets further than where it was previously getting live-locked, but > I then get a new oops (this is just yesterday's mm-unstable with your fix v2 and > the above change): > > [ 247.788985] BUG: Bad page state in process usemem pfn:ae58c2 > [ 247.789617] page: refcount:0 mapcount:0 mapping:00000000dc16b680 index:0x1 > pfn:0xae58c2 > [ 247.790129] aops:0x0 ino:dead000000000122 > [ 247.790394] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff) > [ 247.790821] page_type: 0xffffffff() > [ 247.791052] raw: 0bfffc0000000000 0000000000000000 fffffc002a963090 > fffffc002a963090 > [ 247.791546] raw: 0000000000000001 0000000000000000 00000000ffffffff > 0000000000000000 > [ 247.792258] page dumped because: non-NULL mapping > [ 247.792567] Modules linked in: > [ 247.792772] CPU: 0 PID: 2052 Comm: usemem Not tainted > 6.8.0-rc5-00456-g52fd6cd3bee5 #30 > [ 247.793300] Hardware name: linux,dummy-virt (DT) > [ 247.793680] Call trace: > [ 247.793894] dump_backtrace+0x9c/0x100 > [ 247.794200] show_stack+0x20/0x38 > [ 247.794460] dump_stack_lvl+0x90/0xb0 > [ 247.794726] dump_stack+0x18/0x28 > [ 247.794964] bad_page+0x88/0x128 > [ 247.795196] get_page_from_freelist+0xdc4/0x1280 > [ 247.795520] __alloc_pages+0xe8/0x1038 > [ 247.795781] alloc_pages_mpol+0x90/0x278 > [ 247.796059] vma_alloc_folio+0x70/0xd0 > [ 247.796320] __handle_mm_fault+0xc40/0x19a0 > [ 247.796610] handle_mm_fault+0x7c/0x418 > [ 247.796908] do_page_fault+0x100/0x690 > [ 247.797231] do_translation_fault+0xb4/0xd0 > [ 247.797584] do_mem_abort+0x4c/0xa8 > [ 247.797874] el0_da+0x54/0xb8 > [ 247.798123] el0t_64_sync_handler+0xe4/0x158 > [ 247.798473] el0t_64_sync+0x190/0x198 > [ 247.815597] Disabling lock debugging due to kernel taint > > And then into RCU stalls after that. I have seen a similar non-NULL mapping oops > yesterday. But with the deferred split fix in place, I can now see this reliably. > > My sense is that the first deferred split issue is now fully resolved once the > extra code above is reinserted, but we still have a second problem. Thoughts? > > Perhaps I can bisect this given it seems pretty reproducible. OK few more bits of information: bisect lands back on the same patch it always does; "mm: Allow non-hugetlb large folios to be batch processed". Without this change, I can't reproduce the above oops. With that change present, if I "re-narrow" the window as you suggested, I also can't reproduce the problem. As far as I can tell, mapping is zeroed when the page is freed, and the same page checks are run at at that point too. So mapping must be written to while the page is in the buddy? Perhaps something thinks its still a tail page during split, but the buddy thinks its been freed? Also the mapping value 00000000dc16b680 is not a valid kernel address, I don't think. So surprised that get_kernel_nofault(host, &mapping->host) works. > > Thanks, > Ryan >