From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 792BFC4332F for ; Tue, 25 Jan 2022 09:32:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1574161AbiAYJc2 (ORCPT ); Tue, 25 Jan 2022 04:32:28 -0500 Received: from smtp-out2.suse.de ([195.135.220.29]:37470 "EHLO smtp-out2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1573411AbiAYJXa (ORCPT ); Tue, 25 Jan 2022 04:23:30 -0500 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 41CCB1F381; Tue, 25 Jan 2022 09:23:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1643102594; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=IHLfqC26/ZySpJ9o3h2r57QMCxwt9mzMwoe0dLzT+l4=; b=fNpDwIzNQmpJcHrAmeo5BdlQH9uBbrKEITzB9n96768XwfQ/D0ML9HSalSbl7tbYiMNB/R CsmgzS2jvwybMB7QdmMJ6cSasmzpVrAbeZ+JI9rstx59nPhpuZC5ACRMi4VWroMLUGJgzi QQ3wFiVabi1M5UOuW7Ys3QuTwkiIj48= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id F21FCA3B81; Tue, 25 Jan 2022 09:23:13 +0000 (UTC) Date: Tue, 25 Jan 2022 10:23:13 +0100 From: Michal Hocko To: Minchan Kim Cc: Andrew Morton , David Hildenbrand , linux-mm , LKML , Suren Baghdasaryan , John Dias Subject: Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 24-01-22 14:22:03, Minchan Kim wrote: [...] > CPU 0 CPU 1 > > lru_cache_disable lru_cache_disable > ret = atomic_inc_return;(ret = 1) > > ret = atomic_inc_return;(ret = 2) > > lru_add_drain_all(true); > lru_add_drain_all(false) > mutex_lock() is holding > mutex_lock() is waiting > > IPI with !force_all_cpus > ... > ... > IPI done but it skipped some CPUs > > .. > .. > > > Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it > introduces race of lru_disable_count so some pages on cores > which didn't run the IPI could accept upcoming pages into per-cpu > cache. Yes, that is certainly possible but the question is whether it really matters all that much. The race would require also another racer to be adding a page to an _empty_ pcp list at the same time. pagevec_add_and_need_flush 1) pagevec_add # add to pcp list 2) lru_cache_disabled atomic_read(lru_disable_count) = 0 # no flush but the page is on pcp There is no strong memory ordering between 1 and 2 and that is why we need an IPI to enforce it in general IIRC. But lru_cache_disable is not a strong synchronization primitive. It aims at providing a best effort means to reduce false positives, right? IMHO it doesn't make much sense to aim for perfection because all users of this interface already have to live with temporary failures and pcp caches is not the only reason to fail - e.g. short lived page pins. That being said, I would rather live with a best effort and simpler implementation approach rather than aim for perfection in this case. The scheme is already quite complex and another lock in the mix doesn't make it any easier to follow. If others believe that another lock makes the implementation more straightforward I will not object but I would go with the following. diff --git a/mm/swap.c b/mm/swap.c index ae8d56848602..c140c3743b9e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0); */ void lru_cache_disable(void) { - atomic_inc(&lru_disable_count); + int count = atomic_inc_return(&lru_disable_count); + #ifdef CONFIG_SMP /* * lru_add_drain_all in the force mode will schedule draining on @@ -931,8 +932,28 @@ void lru_cache_disable(void) * The atomic operation doesn't need to have stronger ordering * requirements because that is enforeced by the scheduling * guarantees. + * Please note that there is a potential for a race condition: + * CPU0 CPU1 CPU2 + * pagevec_add_and_need_flush + * pagevec_add # to the empty list + * lru_cache_disabled + * atomic_read # 0 + * lru_cache_disable lru_cache_disable + * atomic_inc_return (1) + * atomic_inc_return (2) + * __lru_add_drain_all(true) + * __lru_add_drain_all(false) + * mutex_lock + * mutex_lock + * # skip cpu0 (pagevec_add not visible yet) + * mutex_unlock + * # fail because of pcp(0) pin + * queue_work_on(0) + * + * but the scheme is a best effort and the above race quite unlikely + * to matter in real life. */ - __lru_add_drain_all(true); + __lru_add_drain_all(count == 1); #else lru_add_and_bh_lrus_drain(); #endif -- Michal Hocko SUSE Labs