From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 581A6C54791 for ; Sun, 10 Mar 2024 04:23:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 36FB76B0072; Sat, 9 Mar 2024 23:23:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 31F9D6B0074; Sat, 9 Mar 2024 23:23:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E7826B0075; Sat, 9 Mar 2024 23:23:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0B3E56B0072 for ; Sat, 9 Mar 2024 23:23:20 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CA7A11206FD for ; Sun, 10 Mar 2024 04:23:19 +0000 (UTC) X-FDA: 81879834918.13.1FD03DC Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf11.hostedemail.com (Postfix) with ESMTP id 8332D40006 for ; Sun, 10 Mar 2024 04:23:17 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=scZtXaVM; dmarc=none; spf=none (imf11.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710044598; a=rsa-sha256; cv=none; b=oPoJueHVny3Cpif7taY+Jb7ob/LIxBnfBiHEH/RHwgjGHHrHeJ5qNY1LbQZlfPBDPeKBmH 9I36UWcuOxn9xsEaGC1DaMcCRuPNextJQAgqniurASVIdqgZbziPPJBEl/xybT+3++qLL1 vYWAfpkbDRu9Dd43kzPm68TYcDc5LSE= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=scZtXaVM; dmarc=none; spf=none (imf11.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710044598; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=t/C5CA6hQ+svKszM/7RdCASJ6UkKHokYHB16kudpioQ=; b=fodgYPdFMFEGnQl8/lP0kMGUCBoc9LI3yrBOaHn8GB1gZ2bo6JhzIRqEIMCz4dRImIbCBJ bEKr1GCn0WC/K+lfRRurlBJ3M2I2TGkLuzbHgv6SJvd+mifUMBxV4ObP+BBeLJDG7MkcDn eAjDo8RuNbCy9OG4gz9EHSuET4aJAcE= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=t/C5CA6hQ+svKszM/7RdCASJ6UkKHokYHB16kudpioQ=; b=scZtXaVM2mC8/v5oimLccQuTaC c4SNhF0mIQ8i/nRXB8XPg6am3XgnoKouvT3XyPoxhb+sz5KEWOghNOs/M86pYjefYvwolO4gPbt9S TRuHjzk7Pgrn5XvtdX6ctXHXozMI6kjOHCZePS9PcxpTMDCaEH105R41AxQJ4lxuRtrXpxw/tfGIu l7bdMeGjaoKjaKD1iu/z+otf7tm3RJBFd733P4gV2Wn+wqma8M/7q8cpLgFCb3LAOmfff8xUvlvVq TLxEubrAP4hf3PSybLOkJUwUZcWURSnPeUUgZFEhMiVnhg1ZRZnAv/yJl3hVLe2htpF4lQ+jtUnbh jj6Tuu6A==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rjAiP-0000000FAoa-1YRb; Sun, 10 Mar 2024 04:23:13 +0000 Date: Sun, 10 Mar 2024 04:23:13 +0000 From: Matthew Wilcox To: Ryan Roberts Cc: Zi Yan , Andrew Morton , linux-mm@kvack.org, Yang Shi , Huang Ying Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Message-ID: References: <03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com> <090c9d68-9296-4338-9afa-5369bb1db66c@arm.com> <95f91a7f-13dd-4212-97ce-cf0b4060828a@arm.com> <08c01e9d-beda-435c-93ac-f303a89379df@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <08c01e9d-beda-435c-93ac-f303a89379df@arm.com> X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 8332D40006 X-Stat-Signature: thb7ft3jb6ek8yoygayd3bjso649xkgr X-HE-Tag: 1710044597-522786 X-HE-Meta: U2FsdGVkX18YkOT/mUa3LKvJCwoBVAsBlhvqtWxPsqJ6Y+vaDYiK72JyL6BDdua8cPZy1xbQxtkSfZ/ychhJ+zYyRqn9JI0cXyroNXA5H/wCG88JFv+L93LO3Ldw5uQXS257fHPC5drtxJWh+HttT9xyubqIeGzXbdUczhN3TodIv5ZMpRrSqh+6E18q3DWCph1+mKJ0jRvngTEJZAV4hAqYWyUiwdkyt+LCOhWRJi1h8X56QKfkMbbiLxcg+HYL5h9wS742FeC87Ld8eUpfzwPV4+tGBduHAb+fVEFIiYdq21zhcEZR6VB0iQuU3iY2NxXEwkwvHM+vMCO8AKsduu9Z31N81e1g5Ewm/F6KJMEfGxkYUBpILtdOnbpKmDfSczrPW1aNlZwrBwzsxXLeMtrpZp3yv0eyZRvWGBcE7DmB1AoZAoA8HzZ2X4ng+K91cAKfMV2aCHGZrC0QEffNtFmasgok1xVnGSDpIko2Ygc78hj0t06M/X4nQ0p2tHQopEdts4Wc0yPR7qFRHN8RVTSr8U1uEzEicCsTd3RuARkqXEYeRDA0feDnkJb28m8knRWjICyZLED6Tv64J4kSJXo/+/WBme4M34Ie3gnBlG5WIs83aL/ia9Xh4jEYL+8qj9s2YepYKnBHDGtB15BrO0HUo/UGGY1yvxV+6w+ukjfKFt0hNW0/yPPKgKXVakcB2QKMsD4b2L4rS9sNR3kCpY/w5tJHuIiW1priJhoCrF/l9YYDP2/9Opfp1AznRAoqCDBkb5LdFOoJsJ+8pxXx5vwHcS7K/K75f77DK3tghbVn/Sz8qOxvcJEzAQX/jKMD8c3SeI3whm7/GjFhX7YCl2swQ+ID806vvAis8Xw1rcJ5tlR5VmZm4Nh3wDsdQXL0OcKTE7by6KoaCx4QwODERQRht/5b8Kqy5VWhsGvnSuI8yLkY3agR29Pbdu6OO7juQu5ZdaILZfkZvzEEInT FB3QCtcP MHZizR+2V+u9cSjI1XoA/cB/FBB8h4pAx9zEbXx3h8eynNG/ubCuKDo48zs6K3nHL9Rzlx9fUBBb5naXAz7WN3mi8oMghy1aB2WlcGs7seStJkP1DKPus+5qm8VcLEWeju454dZUdVj4F4Yc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Mar 09, 2024 at 09:38:42AM +0000, Ryan Roberts wrote: > > I think split_queue_len is getting out of sync with the number of items on the > > queue? We only decrement it if we lost the race with folio_put(). But we are > > unconditionally taking folios off the list here. So we are definitely out of > > sync until we take the lock again below. But we only put folios back on the list > > that failed to split. A successful split used to decrement this variable > > (because the folio was on _a_ list). But now it doesn't. So we are always > > mismatched after the first failed split? > > Oops, I meant first *sucessful* split. Agreed, nice fix. > I've run the full test 5 times, and haven't seen any slow down or RCU stall > warning. But on the 5th time, I saw the non-NULL mapping oops (your new check > did not trigger): > > [ 944.475632] BUG: Bad page state in process usemem pfn:252932 > [ 944.477314] page:00000000ad4feba6 refcount:0 mapcount:0 > mapping:000000003a777cd9 index:0x1 pfn:0x252932 > [ 944.478575] aops:0x0 ino:dead000000000122 > [ 944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff) > [ 944.479934] page_type: 0xffffffff() > [ 944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90 > fffffc00084a4c90 > [ 944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff > 0000000000000000 > [ 944.482475] page dumped because: non-NULL mapping > So what do we know? > > - the above page looks like it was the 3rd page of a large folio > - words 3 and 4 are the same, meaning they are likely empty _deferred_list > - pfn alignment is correct for this > - The _deferred_list for all previously freed large folios was empty > - but the folio could have been in the new deferred split batch? I don't think it could be in a deferred split bacth because we hold the refcount at that point ... > - free_tail_page_prepare() zeroed mapping/_deferred_list during free > - _deferred_list was subsequently reinitialized to "empty" while on free list > > So how about this for a rough hypothesis: > > > CPU1 CPU2 > deferred_split_scan > list_del_init > folio_batch_add > folio_put -> free > free_tail_page_prepare > is on deferred list? -> no > split_huge_page_to_list_to_order > list_empty(folio->_deferred_list) > -> yes > list_del_init > mapping = NULL > -> (_deferred_list.prev = NULL) > put page on free list > INIT_LIST_HEAD(entry); > -> "mapping" no longer NULL > > > But CPU1 is holding a reference, so that could only happen if a reference was > put one too many times. Ugh. Before we start blaming the CPU for doing something impossible, what if we're taking the wrong lock? I know that seems crazy, but if page->flags gets corrupted to the point where we change some of the bits in the nid, when we free the folio, we call folio_undo_large_rmappable(), get the wrong ds_queue back from get_deferred_split_queue(), take the wrong split_queue_lock, corrupt the deferred list of a different node, and bad things happen? I don't think we can detect that folio->nid has become corrupted in the page allocation/freeing code (can we?), but we can tell if a folio is on the wrong ds_queue in deferred_split_scan(): list_for_each_entry_safe(folio, next, &ds_queue->split_queue, _deferred_list) { + VM_BUG_ON_FOLIO(folio_nid(folio) != sc->nid, folio); + VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); list_del_init(&folio->_deferred_list); (also testing the hypothesis that somehow a split folio has ended up on the deferred split list) This wouldn't catch the splat above early, I don't think, but it might trigger early enough with your workload that it'd be useful information. (I reviewed the patch you're currently testing with and it matches with what I think we should be doing)