From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 581A6C54791
	for <linux-mm@archiver.kernel.org>; Sun, 10 Mar 2024 04:23:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 36FB76B0072; Sat,  9 Mar 2024 23:23:20 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 31F9D6B0074; Sat,  9 Mar 2024 23:23:20 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1E7826B0075; Sat,  9 Mar 2024 23:23:20 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 0B3E56B0072
	for <linux-mm@kvack.org>; Sat,  9 Mar 2024 23:23:20 -0500 (EST)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id CA7A11206FD
	for <linux-mm@kvack.org>; Sun, 10 Mar 2024 04:23:19 +0000 (UTC)
X-FDA: 81879834918.13.1FD03DC
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf11.hostedemail.com (Postfix) with ESMTP id 8332D40006
	for <linux-mm@kvack.org>; Sun, 10 Mar 2024 04:23:17 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=scZtXaVM;
	dmarc=none;
	spf=none (imf11.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710044598; a=rsa-sha256;
	cv=none;
	b=oPoJueHVny3Cpif7taY+Jb7ob/LIxBnfBiHEH/RHwgjGHHrHeJ5qNY1LbQZlfPBDPeKBmH
	9I36UWcuOxn9xsEaGC1DaMcCRuPNextJQAgqniurASVIdqgZbziPPJBEl/xybT+3++qLL1
	vYWAfpkbDRu9Dd43kzPm68TYcDc5LSE=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=scZtXaVM;
	dmarc=none;
	spf=none (imf11.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710044598;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=t/C5CA6hQ+svKszM/7RdCASJ6UkKHokYHB16kudpioQ=;
	b=fodgYPdFMFEGnQl8/lP0kMGUCBoc9LI3yrBOaHn8GB1gZ2bo6JhzIRqEIMCz4dRImIbCBJ
	bEKr1GCn0WC/K+lfRRurlBJ3M2I2TGkLuzbHgv6SJvd+mifUMBxV4ObP+BBeLJDG7MkcDn
	eAjDo8RuNbCy9OG4gz9EHSuET4aJAcE=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=t/C5CA6hQ+svKszM/7RdCASJ6UkKHokYHB16kudpioQ=; b=scZtXaVM2mC8/v5oimLccQuTaC
	c4SNhF0mIQ8i/nRXB8XPg6am3XgnoKouvT3XyPoxhb+sz5KEWOghNOs/M86pYjefYvwolO4gPbt9S
	TRuHjzk7Pgrn5XvtdX6ctXHXozMI6kjOHCZePS9PcxpTMDCaEH105R41AxQJ4lxuRtrXpxw/tfGIu
	l7bdMeGjaoKjaKD1iu/z+otf7tm3RJBFd733P4gV2Wn+wqma8M/7q8cpLgFCb3LAOmfff8xUvlvVq
	TLxEubrAP4hf3PSybLOkJUwUZcWURSnPeUUgZFEhMiVnhg1ZRZnAv/yJl3hVLe2htpF4lQ+jtUnbh
	jj6Tuu6A==;
Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rjAiP-0000000FAoa-1YRb;
	Sun, 10 Mar 2024 04:23:13 +0000
Date: Sun, 10 Mar 2024 04:23:13 +0000
From: Matthew Wilcox <willy@infradead.org>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>, Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Yang Shi <shy828301@gmail.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch
 processed
Message-ID: <Ze01sd5T6tLm6Ep8@casper.infradead.org>
References: <03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com>
 <ZejKRs0Ft9w2nm0s@casper.infradead.org>
 <ZejmTM1XbE0mPA2A@casper.infradead.org>
 <da729e0b-4eae-451c-baec-e58a3b5a2752@arm.com>
 <Zen6VDC5B_SN4zpR@casper.infradead.org>
 <e8911b30-e96b-486c-a92a-c3513facc12e@arm.com>
 <Zev9DwUPDdV3pbfT@casper.infradead.org>
 <090c9d68-9296-4338-9afa-5369bb1db66c@arm.com>
 <95f91a7f-13dd-4212-97ce-cf0b4060828a@arm.com>
 <08c01e9d-beda-435c-93ac-f303a89379df@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <08c01e9d-beda-435c-93ac-f303a89379df@arm.com>
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 8332D40006
X-Stat-Signature: thb7ft3jb6ek8yoygayd3bjso649xkgr
X-HE-Tag: 1710044597-522786
X-HE-Meta: U2FsdGVkX18YkOT/mUa3LKvJCwoBVAsBlhvqtWxPsqJ6Y+vaDYiK72JyL6BDdua8cPZy1xbQxtkSfZ/ychhJ+zYyRqn9JI0cXyroNXA5H/wCG88JFv+L93LO3Ldw5uQXS257fHPC5drtxJWh+HttT9xyubqIeGzXbdUczhN3TodIv5ZMpRrSqh+6E18q3DWCph1+mKJ0jRvngTEJZAV4hAqYWyUiwdkyt+LCOhWRJi1h8X56QKfkMbbiLxcg+HYL5h9wS742FeC87Ld8eUpfzwPV4+tGBduHAb+fVEFIiYdq21zhcEZR6VB0iQuU3iY2NxXEwkwvHM+vMCO8AKsduu9Z31N81e1g5Ewm/F6KJMEfGxkYUBpILtdOnbpKmDfSczrPW1aNlZwrBwzsxXLeMtrpZp3yv0eyZRvWGBcE7DmB1AoZAoA8HzZ2X4ng+K91cAKfMV2aCHGZrC0QEffNtFmasgok1xVnGSDpIko2Ygc78hj0t06M/X4nQ0p2tHQopEdts4Wc0yPR7qFRHN8RVTSr8U1uEzEicCsTd3RuARkqXEYeRDA0feDnkJb28m8knRWjICyZLED6Tv64J4kSJXo/+/WBme4M34Ie3gnBlG5WIs83aL/ia9Xh4jEYL+8qj9s2YepYKnBHDGtB15BrO0HUo/UGGY1yvxV+6w+ukjfKFt0hNW0/yPPKgKXVakcB2QKMsD4b2L4rS9sNR3kCpY/w5tJHuIiW1priJhoCrF/l9YYDP2/9Opfp1AznRAoqCDBkb5LdFOoJsJ+8pxXx5vwHcS7K/K75f77DK3tghbVn/Sz8qOxvcJEzAQX/jKMD8c3SeI3whm7/GjFhX7YCl2swQ+ID806vvAis8Xw1rcJ5tlR5VmZm4Nh3wDsdQXL0OcKTE7by6KoaCx4QwODERQRht/5b8Kqy5VWhsGvnSuI8yLkY3agR29Pbdu6OO7juQu5ZdaILZfkZvzEEInT
 FB3QCtcP
 MHZizR+2V+u9cSjI1XoA/cB/FBB8h4pAx9zEbXx3h8eynNG/ubCuKDo48zs6K3nHL9Rzlx9fUBBb5naXAz7WN3mi8oMghy1aB2WlcGs7seStJkP1DKPus+5qm8VcLEWeju454dZUdVj4F4Yc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sat, Mar 09, 2024 at 09:38:42AM +0000, Ryan Roberts wrote:
> > I think split_queue_len is getting out of sync with the number of items on the
> > queue? We only decrement it if we lost the race with folio_put(). But we are
> > unconditionally taking folios off the list here. So we are definitely out of
> > sync until we take the lock again below. But we only put folios back on the list
> > that failed to split. A successful split used to decrement this variable
> > (because the folio was on _a_ list). But now it doesn't. So we are always
> > mismatched after the first failed split?
> 
> Oops, I meant first *sucessful* split.

Agreed, nice fix.

> I've run the full test 5 times, and haven't seen any slow down or RCU stall
> warning. But on the 5th time, I saw the non-NULL mapping oops (your new check
> did not trigger):
> 
> [  944.475632] BUG: Bad page state in process usemem  pfn:252932
> [  944.477314] page:00000000ad4feba6 refcount:0 mapcount:0
> mapping:000000003a777cd9 index:0x1 pfn:0x252932
> [  944.478575] aops:0x0 ino:dead000000000122
> [  944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
> [  944.479934] page_type: 0xffffffff()
> [  944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90
> fffffc00084a4c90
> [  944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff
> 0000000000000000
> [  944.482475] page dumped because: non-NULL mapping

> So what do we know?
> 
>  - the above page looks like it was the 3rd page of a large folio
>     - words 3 and 4 are the same, meaning they are likely empty _deferred_list
>     - pfn alignment is correct for this
>  - The _deferred_list for all previously freed large folios was empty
>     - but the folio could have been in the new deferred split batch?

I don't think it could be in a deferred split bacth because we hold the
refcount at that point ...

>  - free_tail_page_prepare() zeroed mapping/_deferred_list during free
>  - _deferred_list was subsequently reinitialized to "empty" while on free list
> 
> So how about this for a rough hypothesis:
> 
> 
> CPU1                                  CPU2
> deferred_split_scan
> list_del_init
> folio_batch_add
>                                       folio_put -> free
> 		                        free_tail_page_prepare
> 			                  is on deferred list? -> no
> split_huge_page_to_list_to_order
>   list_empty(folio->_deferred_list)
>     -> yes
>   list_del_init
> 			                  mapping = NULL
> 					    -> (_deferred_list.prev = NULL)
> 			                put page on free list
>     INIT_LIST_HEAD(entry);
>       -> "mapping" no longer NULL
> 
> 
> But CPU1 is holding a reference, so that could only happen if a reference was
> put one too many times. Ugh.

Before we start blaming the CPU for doing something impossible, what if
we're taking the wrong lock?  I know that seems crazy, but if page->flags
gets corrupted to the point where we change some of the bits in the
nid, when we free the folio, we call folio_undo_large_rmappable(),
get the wrong ds_queue back from get_deferred_split_queue(), take the
wrong split_queue_lock, corrupt the deferred list of a different node,
and bad things happen?

I don't think we can detect that folio->nid has become corrupted in the
page allocation/freeing code (can we?), but we can tell if a folio is
on the wrong ds_queue in deferred_split_scan():

        list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
                                                        _deferred_list) {
+		VM_BUG_ON_FOLIO(folio_nid(folio) != sc->nid, folio);
+		VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
                list_del_init(&folio->_deferred_list);

(also testing the hypothesis that somehow a split folio has ended up
on the deferred split list)

This wouldn't catch the splat above early, I don't think, but it might
trigger early enough with your workload that it'd be useful information.

(I reviewed the patch you're currently testing with and it matches with
what I think we should be doing)