Re: [PATCH RFC] mm: migrate: Fix races of __find_get_block() and page migration

From: Mel Gorman <mgorman@suse.de>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, mhocko@suse.cz, stable@vger.kernel.org
Subject: Re: [PATCH RFC] mm: migrate: Fix races of __find_get_block() and page migration
Date: Fri, 12 Jul 2019 13:39:35 +0100	[thread overview]
Message-ID: <20190712123935.GK13484@suse.de> (raw)
In-Reply-To: <20190712112056.GA24009@quack2.suse.cz>

On Fri, Jul 12, 2019 at 01:20:56PM +0200, Jan Kara wrote:
> On Fri 12-07-19 11:10:43, Mel Gorman wrote:
> > On Fri, Jul 12, 2019 at 11:17:46AM +0200, Jan Kara wrote:
> > > On Thu 11-07-19 17:04:55, Andrew Morton wrote:
> > > > On Thu, 11 Jul 2019 14:58:38 +0200 Jan Kara <jack@suse.cz> wrote:
> > > > 
> > > > > buffer_migrate_page_norefs() can race with bh users in a following way:
> > > > > 
> > > > > CPU1					CPU2
> > > > > buffer_migrate_page_norefs()
> > > > >   buffer_migrate_lock_buffers()
> > > > >   checks bh refs
> > > > >   spin_unlock(&mapping->private_lock)
> > > > > 					__find_get_block()
> > > > > 					  spin_lock(&mapping->private_lock)
> > > > > 					  grab bh ref
> > > > > 					  spin_unlock(&mapping->private_lock)
> > > > >   move page				  do bh work
> > > > > 
> > > > > This can result in various issues like lost updates to buffers (i.e.
> > > > > metadata corruption) or use after free issues for the old page.
> > > > > 
> > > > > Closing this race window is relatively difficult. We could hold
> > > > > mapping->private_lock in buffer_migrate_page_norefs() until we are
> > > > > finished with migrating the page but the lock hold times would be rather
> > > > > big. So let's revert to a more careful variant of page migration requiring
> > > > > eviction of buffers on migrated page. This is effectively
> > > > > fallback_migrate_page() that additionally invalidates bh LRUs in case
> > > > > try_to_free_buffers() failed.
> > > > 
> > > > Is this premature optimization?  Holding ->private_lock while messing
> > > > with the buffers would be the standard way of addressing this.  The
> > > > longer hold times *might* be an issue, but we don't know this, do we? 
> > > > If there are indeed such problems then they could be improved by, say,
> > > > doing more of the newpage preparation prior to taking ->private_lock.
> > > 
> > > I didn't check how long the private_lock hold times would actually be, it
> > > just seems there's a lot of work done before the page is fully migrated a
> > > we could release the lock. And since the lock blocks bh lookup,
> > > set_page_dirty(), etc. for the whole device, it just seemed as a bad idea.
> > > I don't think much of a newpage setup can be moved outside of private_lock
> > > - in particular page cache replacement, page copying, page state migration
> > > all need to be there so that bh code doesn't get confused.
> > > 
> > > But I guess it's fair to measure at least ballpark numbers of what the lock
> > > hold times would be to get idea whether the contention concern is
> > > substantiated or not.
> > > 
> > 
> > I think it would be tricky to measure and quantify how much the contention
> > is an issue. While it would be possible to construct a microbenchmark that
> > should illustrate the problem, it would tell us relatively little about
> > how much of a problem it is generally. It would be relatively difficult
> > to detect the contention and stalls in block lookups due to migration
> > would be tricky to spot. Careful use of lock_stat might help but
> > enabling that has consequences of its own.
> > 
> > However, a rise in allocation failures due to dirty pages not being
> > migrated is relatively easy to detect and the consequences are relatively
> > benign -- failed high-order allocation that is usually ok versus a stall
> > on block lookups that could have a wider general impact.
> > 
> > On that basis, I think the patch you proposed is the more appropriate as
> > a fix for the race which has the potential for data corruption. So;
> > 
> > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Thanks. And I agree with you that detecting failed migrations is generally
> easier than detecting private_lock contention. Anyway, out of curiosity, I
> did run thpfioscale workload in mmtests with some additional metadata
> workload on the system to increase proportion of bdev page cache and added
> tracepoints to see how long the relevant part of __buffer_migrate_page()
> lasts (patch attached). The longest duration of the critical section was
> 311 us which is significant. But that was an outlier by far. The most of
> times critical section lasted couple of us. The full histogram is here:
> 
> [min - 0.000006]: 2907 93.202950%
> (0.000006 - 0.000011]: 105 3.366464%
> (0.000011 - 0.000016]: 36 1.154216%
> (0.000016 - 0.000021]: 45 1.442770%
> (0.000021 - 0.000026]: 13 0.416800%
> (0.000026 - 0.000031]: 4 0.128246%
> (0.000031 - 0.000036]: 2 0.064123%
> (0.000036 - 0.000041]: 2 0.064123%
> (0.000041 - 0.000046]: 1 0.032062%
> (0.000046 - 0.000050]: 1 0.032062%
> (0.000050 - 0.000055]: 0 0.000000%
> (0.000055 - 0.000060]: 0 0.000000%
> (0.000060 - 0.000065]: 0 0.000000%
> (0.000065 - 0.000070]: 0 0.000000%
> (0.000070 - 0.000075]: 2 0.064123%
> (0.000075 - 0.000080]: 0 0.000000%
> (0.000080 - 0.000085]: 0 0.000000%
> (0.000085 - 0.000090]: 0 0.000000%
> (0.000090 - 0.000095]: 0 0.000000%
> (0.000095 - max]: 1 0.032062%
> 
> So although I still think that just failing the migration if we cannot
> invalidate buffer heads is a safer choice, just extending the private_lock
> protected section does not seem as bad as I was afraid.
> 

That does not seem too bad and your revised patch looks functionally
fine. I'd leave out the tracepoints though because a perf probe would have
got roughly the same data and the tracepoint may be too specific to track
another class of problem. Whether the tracepoint survives or not and
with a changelog added;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Andrew, which version do you want to go with, the original version or
this one that holds private_lock for slightly longer during migration?

> From fd584fb6fa6d48e1fae1077d2ab0021ae9c98edf Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 10 Jul 2019 11:31:01 +0200
> Subject: [PATCH] mm: migrate: Fix race with __find_get_block()
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/trace/events/migrate.h | 42 ++++++++++++++++++++++++++++++++++++++++++
>  mm/migrate.c                   |  8 +++++++-
>  2 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 705b33d1e395..15473a508216 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -70,6 +70,48 @@ TRACE_EVENT(mm_migrate_pages,
>  		__print_symbolic(__entry->mode, MIGRATE_MODE),
>  		__print_symbolic(__entry->reason, MIGRATE_REASON))
>  );
> +
> +TRACE_EVENT(mm_migrate_buffers_begin,
> +
> +	TP_PROTO(struct address_space *mapping, unsigned long index),
> +
> +	TP_ARGS(mapping, index),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long,	mapping)
> +		__field(unsigned long,	index)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mapping	= (unsigned long)mapping;
> +		__entry->index		= index;
> +	),
> +
> +	TP_printk("mapping=%lx index=%lu", __entry->mapping, __entry->index)
> +);
> +
> +TRACE_EVENT(mm_migrate_buffers_end,
> +
> +	TP_PROTO(struct address_space *mapping, unsigned long index, int rc),
> +
> +	TP_ARGS(mapping, index, rc),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long,	mapping)
> +		__field(unsigned long,	index)
> +		__field(int,		rc)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mapping	= (unsigned long)mapping;
> +		__entry->index		= index;
> +		__entry->rc		= rc;
> +	),
> +
> +	TP_printk("mapping=%lx index=%lu rc=%d", __entry->mapping, __entry->index, __entry->rc)
> +);
> +
> +
>  #endif /* _TRACE_MIGRATE_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e9594bc0d406..bce0f1b03a60 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -763,6 +763,7 @@ static int __buffer_migrate_page(struct address_space *mapping,
>  recheck_buffers:
>  		busy = false;
>  		spin_lock(&mapping->private_lock);
> +		trace_mm_migrate_buffers_begin(mapping, page->index);
>  		bh = head;
>  		do {
>  			if (atomic_read(&bh->b_count)) {
> @@ -771,12 +772,13 @@ static int __buffer_migrate_page(struct address_space *mapping,
>  			}
>  			bh = bh->b_this_page;
>  		} while (bh != head);
> -		spin_unlock(&mapping->private_lock);
>  		if (busy) {
>  			if (invalidated) {
>  				rc = -EAGAIN;
>  				goto unlock_buffers;
>  			}
> +			trace_mm_migrate_buffers_end(mapping, page->index, -EAGAIN);
> +			spin_unlock(&mapping->private_lock);
>  			invalidate_bh_lrus();
>  			invalidated = true;
>  			goto recheck_buffers;
> @@ -809,6 +811,10 @@ static int __buffer_migrate_page(struct address_space *mapping,
>  
>  	rc = MIGRATEPAGE_SUCCESS;
>  unlock_buffers:
> +	if (check_refs) {
> +		trace_mm_migrate_buffers_end(mapping, page->index, rc);
> +		spin_unlock(&mapping->private_lock);
> +	}
>  	bh = head;
>  	do {
>  		unlock_buffer(bh);
> -- 
> 2.16.4
> 

-- 
Mel Gorman
SUSE Labs