Re: [patch] vmscan: fix zone shrinking exit when scan work is done

From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Michal Hocko <mhocko@suse.cz>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done
Date: Thu, 10 Feb 2011 13:33:24 +0000	[thread overview]
Message-ID: <20110210133323.GH17873@csn.ul.ie> (raw)
In-Reply-To: <20110210124838.GU3347@random.random>

On Thu, Feb 10, 2011 at 01:48:38PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> > We should not be ending up in a situation with the LRU list of only
> > page_evictable pages and that situation persisting causing excessive (or
> > infinite) looping. As unevictable pages are encountered on the LRU list,
> > they should be moved to the unevictable lists by putback_lru_page().  Are you
> > aware of a situation where this becomes broken?
> > 
> > I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> > are all get moved. In this case, nr_scanned is positive and we continue
> > to scan but this is expected and desirable: Reclaim/compaction needs more
> > pages to be freed before it starts compaction. If it stops scanning early,
> > then it would just fail the allocation later. This is what the "NOTE" is about.
> > 
> > I prefer Johannes' fix for the observed problem.
> 
> should_continue_reclaim is only needed for compaction. It tries to
> free enough pages so that compaction can succeed in its defrag
> attempt.

Correct.

> So breaking the loop faster isn't going to cause failures for
> 0 order pages.

Also true, I commented on this in the "Note" your patch deletes and a
suggestion on how an alternative would be to break early unless GFP_REPEAT.

> My worry is that we loop too much in shrink_zone just
> for compaction even when we don't do any progress. shrink_zone would
> never scan more than SWAP_CLUSTER_MAX pages, before this change.

Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
scanning was still pretty high. The other costs of lumpy reclaim would hide
it of course.

> Now
> it can loop over the whole lru as long as we're scanning stuff.

True, the alternative being failing the allocation. Returning sooner is of
course an option, but it would be preferable to see a case where the logic
after Johannes' patch is failing.

> Ok to
> overboost shrink_zone if we're making progress to allow compaction at
> the next round, but if we don't visibly make progress, I'm concerned
> that it may be too aggressive to scan the whole list. The performance
> benefit of having an hugepage isn't as huge as scanning all pages in
> the lru when before we would have broken the loop and declared failure
> after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
> in a order 0 allocation.

What about other cases such as order-1 allocations for stack or order-3
allocations for those network cards using jumbo frames without
scatter/gather?

Don't get me wrong, I see your point but I'm wondering if there really are
cases where we routinely scan an entire LRU list of unevictable pages that
are somehow not being migrated properly to the unevictable lists. If
this is happening, we are also in trouble for reclaiming for order-0
pages, right?

> The fix may help of course, maybe it's enough
> for his case I don't know, but I don't see it making a whole lot of
> difference, except now it will stop when the lru is practically empty
> which clearly is an improvement. I think we shouldn't be so worried
> about succeeding compaction, the important thing is we don't waste
> time in compaction if there's not enough free memory but
> compaction_suitable used by both logics should be enough for that.
> 
> I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_

It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
as an alternative to how we could break early while still being agressive
when required. The only reason it's not that way now is because a) I didn't
consider an LRU mostly full of unevictable pages to be the normal case and b)
for allocations such as order-3 that are preferable not to fail.

> flag or similar that increases how compaction is strict in succeeding,
> up to scanning the whole lru in one go in order to make some free
> memory for compaction to succeed.
> 
> Going ahead with the scan until compaction_suitable is true instead
> makes sense when there's absence of memory pressure and nr_reclaimed
> is never zero.
> 
> Maybe we should try a bit more times than just nr_reclaim but going
> over the whole lru, sounds a bit extreme.
> 

Where should be draw the line? We could come up with ratio of the lists
depending on priority but it'd be hard to measure the gain or loss
without having a profile of a problem case to look at.

> The issue isn't just for unevictable pages, that will be refiled
> during the scan but it will also happen in presence of lots of
> referenced pages. For example if we don't apply my fix, the current
> code can take down all young bits in all ptes in one go in the whole
> system before returning from shrink_zone, that is too much in my view,
> and losing all that information in one go (not even to tell the cost
> associated with losing it) can hardly be offseted by the improvement
> given by 1 more hugepage.
> 
> But please let me know if I've misread something...
> 

I don't think you have misread anything but if we're going to weaken
this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.
preserve being aggressive if set. I'm also not convinced we routinely get
into a situation where the LRU consists of almost all unevictable pages
and if we are in this situation, that is a serious problem on its own. It
would also be preferable if we could get latency figures on alloc_pages for
hugepage-sized allocations and a count of how many are succeeding or failing
to measure the impact (if any).

-- 
Mel Gorman
SUSE Labs