From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F6C1C54E41 for ; Sat, 9 Mar 2024 09:38:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A0F526B0071; Sat, 9 Mar 2024 04:38:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C03C6B0072; Sat, 9 Mar 2024 04:38:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8879F6B0074; Sat, 9 Mar 2024 04:38:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 78F156B0071 for ; Sat, 9 Mar 2024 04:38:49 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 284B61209B5 for ; Sat, 9 Mar 2024 09:38:49 +0000 (UTC) X-FDA: 81877001178.17.4730C88 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf05.hostedemail.com (Postfix) with ESMTP id 01B56100007 for ; Sat, 9 Mar 2024 09:38:46 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709977127; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=81Catf7CbHZ596hJONDnUeqWaX4CtZaERKyoXBAfTVE=; b=FK4DOraYoYcaTFMXeOpLTXbfn7mzeImxvOD5sV7Ne+U75r+4th6gZeGkW3UwoI/lU3Ut47 DpE4OCxDmMK236rlGytK5E83urxMAnqckD+I+rvqY1ZlQlugdCQTohdK68uyDHcDA2iGeD 0wdsliCUwLgKsmzfliouqyfnderV4W4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709977127; a=rsa-sha256; cv=none; b=v2QTfPUQmSO/fmmyR8MT5iANundviw0DUbw0hXaBB4gc0blG3vcXO9Q8xuMH4ooZpiZr3o C6a4+Hd9wQRzENBsds1XV9StfBOTl2fZkeXoOxPrAFSHVZ9coP5PPiRAIXyhShBFl5oHpi 5Ma1K31miLZv51mR2qNN0HVtWZ1hPe4= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 21B5AC15; Sat, 9 Mar 2024 01:39:22 -0800 (PST) Received: from [10.57.67.186] (unknown [10.57.67.186]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 546323F73F; Sat, 9 Mar 2024 01:38:44 -0800 (PST) Message-ID: <08c01e9d-beda-435c-93ac-f303a89379df@arm.com> Date: Sat, 9 Mar 2024 09:38:42 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Content-Language: en-GB From: Ryan Roberts To: Matthew Wilcox Cc: Zi Yan , Andrew Morton , linux-mm@kvack.org, Yang Shi , Huang Ying References: <367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com> <85cc26ed-6386-4d6b-b680-1e5fba07843f@arm.com> <36bdda72-2731-440e-ad15-39b845401f50@arm.com> <03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com> <090c9d68-9296-4338-9afa-5369bb1db66c@arm.com> <95f91a7f-13dd-4212-97ce-cf0b4060828a@arm.com> In-Reply-To: <95f91a7f-13dd-4212-97ce-cf0b4060828a@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: eui67u7nwub4q9zcihogyhq7cj3szqyo X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 01B56100007 X-Rspam-User: X-HE-Tag: 1709977126-268173 X-HE-Meta: U2FsdGVkX1+pEVl3awWGfTFsmy2mM714+S8Hx2AGQ3v3YkrYGfCt1BHqikDQR0By1DgL3Wh1ue6H2ZC11Zr+74IAaVJPKuGdJ94f4y6FvJ7nIRJI6QybtHXsTet4zDOhDbVBorlkK4eRXLkmmxkxadXhywqouKuvumss1Y38smt/tnJP1mvXT++yzRegZPioLT4HqyI2R5rnjYoVnwsi7l8E3NbudyR02s9Z99Z0oERY888P600jEjWV8JjUyFRZlIV1uclhXGZIOsjxLMVCeoLprCIf3uyDkFqfe/WIsEvbKjrYzZg0y85GdcyHWvVkklTRFPtyhIrnkgHOH0DAt92c6Th/rpYwnCfGRV01e/j9K+58Dp4SeRKomceuIbtwn7/MdFS4R6TtVVcgkMIlng32ngq9eUUlJVY5rxGGaDS6h21qpYRk4YixDr1FHdMHMplsVsZbc/tZ8/EMjmBCf27K1ElgnSfkOZ5rtKDGfOFuE+hd4R1/9OUg89iC8Q2ENJLW83VfG8tTO3xbKXOh6rSo/vASYFdkUuV1oIPs+1q7VSH8s9n/uY5Mnx9e/tUoP8kREWVHfOC2uuDgXOUOoU3o2aPgf0sbMAFRggp/XizeSJ1ZzoKTE8C0jDe2z1iKxOJEEo5KS96plyTCzgLuzljXvgEhN2hNHYsodDk47OF2rfZnGwbGX9GQzHHXCzxk9nrjgorObHBi3QJrlTHhrlNpHsmacrVbwFIuPPCFc2jV+MHaWhJwC+keSa4RS8c0g+T4IdlJ9PDbyau4xeXuIMZk1xpnsDiL+3HAjDjb+BGsdJd5XqPHPs9fk6IMkYlIRldjdKvyxKPbB6qwJsEds5vgVGb6j4hNTe/+W6itm1bHfpijz8ep8yST0QzZjN0s14v028efjJglAWqZBAGnbhX8Qbw/9kIkOSgE6XIL1MQEBOo7GgVUOxSxdpDa2MOgAWbjbHJki91o2uCpAdB qvHe/KEr HhlG2+9e5USTrcemtq8a7OcU49TgBfcKiQ+wn5SNGTi1nGFU+oDblSYEnVM3ujpLr8gE9q7xuHTQWVPr8qSwlgP+UWi6d299PwUPkByrDb2Uw7lAxoQsrUoXsIp5ONVfvfIXvcfJiLEY4b7Dwr8Pjfq80gQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [...] >>> >>> If so, we need to take the folios off the list (or otherwise mark them) >>> so that they can't be processed by more than one CPU at a time. And >>> that leads me to this patch (yes, folio_prep_large_rmappable() is >>> now vestigial, but removing it increases the churn a bit much for this >>> stage of debugging) >> >> Looks sensible on first review. I'll do some testing now to see if I can >> re-triger the non-NULL mapping issue. Will get back to you in the next couple of >> hours. >> [...] >>> - if (folio_try_get(folio)) { >>> - list_move(&folio->_deferred_list, &list); >>> - } else { >>> + list_del_init(&folio->_deferred_list); >>> + sc->nr_to_scan--; >>> + if (!folio_try_get(folio)) { >>> /* We lost race with folio_put() */ >>> - list_del_init(&folio->_deferred_list); >>> ds_queue->split_queue_len--; > > I think split_queue_len is getting out of sync with the number of items on the > queue? We only decrement it if we lost the race with folio_put(). But we are > unconditionally taking folios off the list here. So we are definitely out of > sync until we take the lock again below. But we only put folios back on the list > that failed to split. A successful split used to decrement this variable > (because the folio was on _a_ list). But now it doesn't. So we are always > mismatched after the first failed split? Oops, I meant first *sucessful* split. > > I'll fix this up before I start testing. > I've run the full test 5 times, and haven't seen any slow down or RCU stall warning. But on the 5th time, I saw the non-NULL mapping oops (your new check did not trigger): [ 944.475632] BUG: Bad page state in process usemem pfn:252932 [ 944.477314] page:00000000ad4feba6 refcount:0 mapcount:0 mapping:000000003a777cd9 index:0x1 pfn:0x252932 [ 944.478575] aops:0x0 ino:dead000000000122 [ 944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff) [ 944.479934] page_type: 0xffffffff() [ 944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90 fffffc00084a4c90 [ 944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 944.482475] page dumped because: non-NULL mapping [ 944.482938] Modules linked in: [ 944.483238] CPU: 9 PID: 3805 Comm: usemem Not tainted 6.8.0-rc5-00391-g44b0dc848590 #39 [ 944.484308] Hardware name: linux,dummy-virt (DT) [ 944.484981] Call trace: [ 944.485315] dump_backtrace+0x9c/0x100 [ 944.485840] show_stack+0x20/0x38 [ 944.486300] dump_stack_lvl+0x90/0xb0 [ 944.486833] dump_stack+0x18/0x28 [ 944.487178] bad_page+0x88/0x128 [ 944.487523] __rmqueue_pcplist+0x24c/0xa18 [ 944.488001] get_page_from_freelist+0x278/0x1288 [ 944.488579] __alloc_pages+0xec/0x1018 [ 944.489097] alloc_pages_mpol+0x11c/0x278 [ 944.489455] vma_alloc_folio+0x70/0xd0 [ 944.489988] do_huge_pmd_anonymous_page+0xb8/0xda8 [ 944.490819] __handle_mm_fault+0xd00/0x1a28 [ 944.491497] handle_mm_fault+0x7c/0x418 [ 944.491978] do_page_fault+0x100/0x690 [ 944.492357] do_translation_fault+0xb4/0xd0 [ 944.492968] do_mem_abort+0x4c/0xa8 [ 944.493353] el0_da+0x54/0xb8 [ 944.493818] el0t_64_sync_handler+0xe4/0x158 [ 944.494581] el0t_64_sync+0x190/0x198 [ 944.495218] Disabling lock debugging due to kernel taint So what do we know? - the above page looks like it was the 3rd page of a large folio - words 3 and 4 are the same, meaning they are likely empty _deferred_list - pfn alignment is correct for this - The _deferred_list for all previously freed large folios was empty - but the folio could have been in the new deferred split batch? - free_tail_page_prepare() zeroed mapping/_deferred_list during free - _deferred_list was subsequently reinitialized to "empty" while on free list So how about this for a rough hypothesis: CPU1 CPU2 deferred_split_scan list_del_init folio_batch_add folio_put -> free free_tail_page_prepare is on deferred list? -> no split_huge_page_to_list_to_order list_empty(folio->_deferred_list) -> yes list_del_init mapping = NULL -> (_deferred_list.prev = NULL) put page on free list INIT_LIST_HEAD(entry); -> "mapping" no longer NULL But CPU1 is holding a reference, so that could only happen if a reference was put one too many times. Ugh. FYI, this is the fixed-up fix patch I have: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fd745bcc97ff..fde016451e1a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio) { if (!folio || !folio_test_large(folio)) return; - if (folio_order(folio) > 1) - INIT_LIST_HEAD(&folio->_deferred_list); folio_set_large_rmappable(folio); } @@ -3312,7 +3310,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, struct pglist_data *pgdata = NODE_DATA(sc->nid); struct deferred_split *ds_queue = &pgdata->deferred_split_queue; unsigned long flags; - LIST_HEAD(list); + struct folio_batch batch; struct folio *folio, *next; int split = 0; @@ -3321,36 +3319,39 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, ds_queue = &sc->memcg->deferred_split_queue; #endif + folio_batch_init(&batch); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - /* Take pin on all head pages to avoid freeing them under us */ + /* Take ref on all folios to avoid freeing them under us */ list_for_each_entry_safe(folio, next, &ds_queue->split_queue, _deferred_list) { - if (folio_try_get(folio)) { - list_move(&folio->_deferred_list, &list); - } else { - /* We lost race with folio_put() */ - list_del_init(&folio->_deferred_list); - ds_queue->split_queue_len--; - } - if (!--sc->nr_to_scan) + list_del_init(&folio->_deferred_list); + ds_queue->split_queue_len--; + sc->nr_to_scan--; + if (folio_try_get(folio) && + folio_batch_add(&batch, folio) == 0) + break; + if (!sc->nr_to_scan) break; } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); - list_for_each_entry_safe(folio, next, &list, _deferred_list) { + while ((folio = folio_batch_next(&batch)) != NULL) { if (!folio_trylock(folio)) - goto next; - /* split_huge_page() removes page from list on success */ + continue; if (!split_folio(folio)) split++; folio_unlock(folio); -next: - folio_put(folio); } spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - list_splice_tail(&list, &ds_queue->split_queue); + while ((folio = folio_batch_next(&batch)) != NULL) { + if (!folio_test_large(folio)) + continue; + list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); + ds_queue->split_queue_len++; + } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + folios_put(&batch); /* * Stop shrinker if we didn't split any page, but the queue is empty. diff --git a/mm/internal.h b/mm/internal.h index 1dfdc3bde1b0..14c21d06f233 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -432,6 +432,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order) atomic_set(&folio->_entire_mapcount, -1); atomic_set(&folio->_nr_pages_mapped, 0); atomic_set(&folio->_pincount, 0); + if (order > 1) + INIT_LIST_HEAD(&folio->_deferred_list); } static inline void prep_compound_tail(struct page *head, int tail_idx) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 025ad1a7df7b..fc9c7ca24c4c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) break; case 2: /* - * the second tail page: ->mapping is - * deferred_list.next -- ignore value. + * the second tail page: ->mapping is deferred_list.next */ + if (unlikely(!list_empty(&folio->_deferred_list))) { + bad_page(page, "still on deferred list"); + goto out; + } break; default: if (page->mapping != TAIL_MAPPING) {