Re: [RFC 1/2] mm: disable LRU pagevec during the migration temporarily

From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	cgoldswo@codeaurora.org, linux-fsdevel@vger.kernel.org,
	willy@infradead.org, david@redhat.com, vbabka@suse.cz,
	viro@zeniv.linux.org.uk, joaodias@google.com
Subject: Re: [RFC 1/2] mm: disable LRU pagevec during the migration temporarily
Date: Wed, 17 Feb 2021 12:51:25 -0800	[thread overview]
Message-ID: <YC2BzdHxF0xEdNxH@google.com> (raw)
In-Reply-To: <YCzm/3GIy1EJlBi2@dhcp22.suse.cz>

On Wed, Feb 17, 2021 at 10:50:55AM +0100, Michal Hocko wrote:
> On Wed 17-02-21 09:59:54, Michal Hocko wrote:
> > On Tue 16-02-21 09:03:47, Minchan Kim wrote:
> [...]
> > >  /*
> > >   * migrate_prep() needs to be called before we start compiling a list of pages
> > >   * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> > > @@ -64,11 +80,27 @@
> > >   */
> > >  void migrate_prep(void)
> > >  {
> > > +	unsigned int cpu;
> > > +
> > > +	spin_lock(&migrate_pending_lock);
> > > +	migrate_pending_count++;
> > > +	spin_unlock(&migrate_pending_lock);
> > 
> > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > this really something that we have to microoptimize for? atomic_read is
> > a simple READ_ONCE on many archs.
> 
> Or you rather wanted to prevent from read memory barrier to enfore the
> ordering.

Yub.

> 
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
> > > +
> > > +		INIT_WORK(work, read_migrate_pending);
> > > +		queue_work_on(cpu, mm_percpu_wq, work);
> > > +	}
> > > +
> > > +	for_each_online_cpu(cpu)
> > > +		flush_work(&per_cpu(migrate_pending_work, cpu));
> > 
> > I also do not follow this scheme. Where is the IPI you are mentioning
> > above?
> 
> Thinking about it some more I think you mean the rescheduling IPI here?

True.

> 
> > > +	/*
> > > +	 * From now on, every online cpu will see uptodate
> > > +	 * migarte_pending_work.
> > > +	 */
> > >  	/*
> > >  	 * Clear the LRU lists so pages can be isolated.
> > > -	 * Note that pages may be moved off the LRU after we have
> > > -	 * drained them. Those pages will fail to migrate like other
> > > -	 * pages that may be busy.
> > >  	 */
> > >  	lru_add_drain_all();
> > 
> > Overall, this looks rather heavy weight to my taste. Have you tried to
> > play with a simple atomic counter approach? atomic_read when adding to
> > the cache and atomic_inc inside migrate_prep followed by lrdu_add_drain.

I'd like to avoid atomic operation if we could.

> 
> If you really want a strong ordering then it should be sufficient to
> simply alter lru_add_drain_all to force draining all CPUs. This will
> make sure no new pages are added to the pcp lists and you will also sync
> up anything that has accumulated because of a race between atomic_read
> and inc:
> diff --git a/mm/swap.c b/mm/swap.c
> index 2cca7141470c..91600d7bb7a8 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -745,7 +745,7 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>   * Calling this function with cpu hotplug locks held can actually lead
>   * to obscure indirect dependencies via WQ context.
>   */
> -void lru_add_drain_all(void)
> +void lru_add_drain_all(bool force_all_cpus)
>  {
>  	/*
>  	 * lru_drain_gen - Global pages generation number
> @@ -820,7 +820,8 @@ void lru_add_drain_all(void)
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>  
> -		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> +		if (force_all_cpus ||
> +		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
>  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||

Yub, that's a idea.
How about this?

diff --git a/mm/migrate.c b/mm/migrate.c
index a69da8aaeccd..2531642dd9ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,14 @@
 
 #include "internal.h"
 
+static DEFINE_SPINLOCK(migrate_pending_lock);
+static unsigned long migrate_pending_count;
+
+bool migrate_pending(void)
+{
+	return migrate_pending_count;
+}
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -64,13 +72,20 @@
  */
 void migrate_prep(void)
 {
+	unsigned int cpu;
+
+	/*
+	 * lru_add_drain_all's IPI will make sure no new pages are added
+	 * to the pcp lists and drain them all.
+	 */
+	spin_lock(&migrate_pending_lock);
+	migrate_pending_count++;
+	spin_unlock(&migrate_pending_lock);
+
 	/*
 	 * Clear the LRU lists so pages can be isolated.
-	 * Note that pages may be moved off the LRU after we have
-	 * drained them. Those pages will fail to migrate like other
-	 * pages that may be busy.
 	 */
-	lru_add_drain_all();
+	lru_add_drain_all(true);
 }
 
 /* Do the necessary work of migrate_prep but not if it involves other CPUs */
@@ -79,6 +94,15 @@ void migrate_prep_local(void)
 	lru_add_drain();
 }
 
+void migrate_finish(void)
+{
+	int cpu;
+
+	spin_lock(&migrate_pending_lock);
+	migrate_pending_count--;
+	spin_unlock(&migrate_pending_lock);
+}

A odd here is there are no barrier for migrate_finish for
updating migrate_pending_count so other CPUs will see
stale value until they encounters the barrier by chance.
However, it wouldn't be a big deal, IMHO.

What do you think?