Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable

From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	John Dias <joaodias@google.com>
Subject: Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable
Date: Tue, 25 Jan 2022 13:06:17 -0800	[thread overview]
Message-ID: <YfBmSaMa826ZhFT4@google.com> (raw)
In-Reply-To: <Ye/Bgc1bH979cXwy@dhcp22.suse.cz>

On Tue, Jan 25, 2022 at 10:23:13AM +0100, Michal Hocko wrote:
> On Mon 24-01-22 14:22:03, Minchan Kim wrote:
> [...]
> >  CPU 0                               CPU 1
> > 
> >  lru_cache_disable                  lru_cache_disable
> >    ret = atomic_inc_return;(ret = 1)
> >                                      
> >                                     ret = atomic_inc_return;(ret = 2)
> >                                     
> >    lru_add_drain_all(true);         
> >                                     lru_add_drain_all(false)
> >                                     mutex_lock() is holding
> >    mutex_lock() is waiting
> > 
> >                                     IPI with !force_all_cpus
> >                                     ...
> >                                     ...
> >                                     IPI done but it skipped some CPUs
> >                
> >      ..
> >      ..
> >  
> > 
> > Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it
> > introduces race of lru_disable_count so some pages on cores
> > which didn't run the IPI could accept upcoming pages into per-cpu
> > cache.
> 
> Yes, that is certainly possible but the question is whether it really
> matters all that much. The race would require also another racer to be
> adding a page to an _empty_ pcp list at the same time.
> 
> pagevec_add_and_need_flush
>   1) pagevec_add # add to pcp list
>   2) lru_cache_disabled
>     atomic_read(lru_disable_count) = 0
>   # no flush but the page is on pcp
> 
> There is no strong memory ordering between 1 and 2 and that is why we
> need an IPI to enforce it in general IIRC.

Correct.

> 
> But lru_cache_disable is not a strong synchronization primitive. It aims
> at providing a best effort means to reduce false positives, right? IMHO

Nope. d479960e44f27, mm: disable LRU pagevec during the migration temporarily

Originally, it was designed to close the race fundamentally.

> it doesn't make much sense to aim for perfection because all users of
> this interface already have to live with temporary failures and pcp
> caches is not the only reason to fail - e.g. short lived page pins.

short lived pages are true but that doesn't mean we need to make the
allocation faster. As I mentioned, the IPI takes up to hundreds
milliseconds easily once CPUs are fully booked. If we reduce the
cost, we could spend the time more productive works. I am working
on making CMA more determinstic and this patch is one of parts.

> 
> That being said, I would rather live with a best effort and simpler
> implementation approach rather than aim for perfection in this case.
> The scheme is already quite complex and another lock in the mix doesn't

lru_add_drain_all already hides the whole complexity inside and
lru_cache_disable adds A simple synchroniztion to keep ordering
on top of it. That's natural SW stack and I don't see too complication
here.

> make it any easier to follow. If others believe that another lock makes

Disagree. lru_cache_disable is designed to guarantee closing the race
you are opening again so the other code in allocator since disabling
per-cpu cache doesn't need to consider the race at all. It's more
simple/deterministic and we could make other stuff based on it(e.g.,
zone->pcp). 

> the implementation more straightforward I will not object but I would go
> with the following.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index ae8d56848602..c140c3743b9e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
>   */
>  void lru_cache_disable(void)
>  {
> -	atomic_inc(&lru_disable_count);
> +	int count = atomic_inc_return(&lru_disable_count);
> +
>  #ifdef CONFIG_SMP
>  	/*
>  	 * lru_add_drain_all in the force mode will schedule draining on
> @@ -931,8 +932,28 @@ void lru_cache_disable(void)
>  	 * The atomic operation doesn't need to have stronger ordering
>  	 * requirements because that is enforeced by the scheduling
>  	 * guarantees.
> +	 * Please note that there is a potential for a race condition:
> +	 * CPU0				CPU1			CPU2
> +	 * pagevec_add_and_need_flush
> +	 *   pagevec_add # to the empty list
> +	 *   lru_cache_disabled
> +	 *     atomic_read # 0
> +	 *   				lru_cache_disable	lru_cache_disable
> +	 *				  atomic_inc_return (1)
> +	 *				  			  atomic_inc_return (2)
> +	 *				  __lru_add_drain_all(true)
> +	 *				  			  __lru_add_drain_all(false)
> +	 *				  			    mutex_lock
> +	 *				    mutex_lock
> +	 *				    			    # skip cpu0 (pagevec_add not visible yet)
> +	 *				    			    mutex_unlock
> +	 *				    			 # fail because of pcp(0) pin
> +	 *				    queue_work_on(0)
> +	 *
> +	 * but the scheme is a best effort and the above race quite unlikely
> +	 * to matter in real life.
>  	 */
> -	__lru_add_drain_all(true);
> +	__lru_add_drain_all(count == 1);
>  #else
>  	lru_add_and_bh_lrus_drain();
>  #endif
> -- 
> Michal Hocko
> SUSE Labs