All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Huang Ying <ying.huang@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Kemi Wang <kemi.wang@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Andi Kleen <ak@linux.intel.com>, Michal Hocko <mhocko@suse.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Matthew Wilcox <willy@infradead.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>
Subject: Re: [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure
Date: Wed, 21 Mar 2018 23:01:40 +0800	[thread overview]
Message-ID: <20180321150140.GA1838@intel.com> (raw)
In-Reply-To: <12a89171-27b8-af4f-450e-41e5775683c5@suse.cz>

Hi Vlastimil,

Thanks for taking the time to reivew the patch, I appreciate that.

On Wed, Mar 21, 2018 at 01:55:01PM +0100, Vlastimil Babka wrote:
> On 03/20/2018 09:54 AM, Aaron Lu wrote:
> > Profile on Intel Skylake server shows the most time consuming part
> > under zone->lock on allocation path is accessing those to-be-returned
> > page's "struct page" on the free_list inside zone->lock. One explanation
> > is, different CPUs are releasing pages to the head of free_list and
> > those page's 'struct page' may very well be cache cold for the allocating
> > CPU when it grabs these pages from free_list' head. The purpose here
> > is to avoid touching these pages one by one inside zone->lock.
> > 
> > One idea is, we just take the requested number of pages off free_list
> > with something like list_cut_position() and then adjust nr_free of
> > free_area accordingly inside zone->lock and other operations like
> > clearing PageBuddy flag for these pages are done outside of zone->lock.
> > 
> > list_cut_position() needs to know where to cut, that's what the new
> > 'struct cluster' meant to provide. All pages on order 0's free_list
> > belongs to a cluster so when a number of pages is needed, the cluster
> > to which head page of free_list belongs is checked and then tail page
> > of the cluster could be found. With tail page, list_cut_position() can
> > be used to drop the cluster off free_list. The 'struct cluster' also has
> > 'nr' to tell how many pages this cluster has so nr_free of free_area can
> > be adjusted inside the lock too.
> > 
> > This caused a race window though: from the moment zone->lock is dropped
> > till these pages' PageBuddy flags get cleared, these pages are not in
> > buddy but still have PageBuddy flag set.
> > 
> > This doesn't cause problems for users that access buddy pages through
> > free_list. But there are other users, like move_freepages() which is
> > used to move a pageblock pages from one migratetype to another in
> > fallback allocation path, will test PageBuddy flag of a page derived
> > from PFN. The end result could be that for pages in the race window,
> > they are moved back to free_list of another migratetype. For this
> > reason, a synchronization function zone_wait_cluster_alloc() is
> > introduced to wait till all pages are in correct state. This function
> > is meant to be called with zone->lock held, so after this function
> > returns, we do not need to worry about new pages becoming racy state.
> > 
> > Another user is compaction, where it will scan a pageblock for
> > migratable candidates. In this process, pages derived from PFN will
> > be checked for PageBuddy flag to decide if it is a merge skipped page.
> > To avoid a racy page getting merged back into buddy, the
> > zone_wait_and_disable_cluster_alloc() function is introduced to:
> > 1 disable clustered allocation by increasing zone->cluster.disable_depth;
> > 2 wait till the race window pass by calling zone_wait_cluster_alloc().
> > This function is also meant to be called with zone->lock held so after
> > it returns, all pages are in correct state and no more cluster alloc
> > will be attempted till zone_enable_cluster_alloc() is called to decrease
> > zone->cluster.disable_depth.
> 
> I'm sorry, but I feel the added complexity here is simply too large to
> justify the change. Especially if the motivation seems to be just the
> microbenchmark. It would be better if this was motivated by a real
> workload where zone lock contention was identified as the main issue,
> and we would see the improvements on the workload. We could also e.g.
> find out that the problem can be avoided at a different level.

One thing I'm aware of is there is some app that consumes a ton of
memory and when it misbehaves or crashes, it takes some 10-20 minutes to
have it exit(munmap() takes a long time to free all those consumed
memory).

THP could help a lot, but it's beyond my understanding why they didn't
use it.

> 
> Besides complexity, it may also add overhead to the non-contended case,
> i.e. the atomic operations on in_progress. This goes against recent page
> allocation optimizations by Mel Gorman etc.
> 
> Would perhaps prefetching the next page in freelist (in
> remove_from_buddy()) help here?

I'm afraid there isn't enough a window for prefetch() to actually make
a difference, but I could give it a try.

We also considered prefetching free_list before taking the lock but
that prefetch could be useless(i.e. the prefetched page can be taken by
another CPU and gone after we actually acquired the lock) and iterate
the list outside lock can be dangerous.

> > The two patches could eliminate zone->lock contention entirely but at
> > the same time, pgdat->lru_lock contention rose to 82%. Final performance
> > increased about 8.3%.
> > 
> > Suggested-by: Ying Huang <ying.huang@intel.com>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  Documentation/vm/struct_page_field |   5 +
> >  include/linux/mm_types.h           |   2 +
> >  include/linux/mmzone.h             |  35 +++++
> >  mm/compaction.c                    |   4 +
> >  mm/internal.h                      |  34 +++++
> >  mm/page_alloc.c                    | 288 +++++++++++++++++++++++++++++++++++--
> >  6 files changed, 360 insertions(+), 8 deletions(-)

  reply	other threads:[~2018-03-21 15:00 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-20  8:54 [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Aaron Lu
2018-03-20  8:54 ` [RFC PATCH v2 1/4] mm/page_alloc: use helper functions to add/remove a page to/from buddy Aaron Lu
2018-03-20 11:35   ` Vlastimil Babka
2018-03-20 13:50     ` Aaron Lu
2018-03-20  8:54 ` [RFC PATCH v2 2/4] mm/__free_one_page: skip merge for order-0 page unless compaction failed Aaron Lu
2018-03-20 11:45   ` Vlastimil Babka
2018-03-20 14:11     ` Aaron Lu
2018-03-21  7:53       ` Vlastimil Babka
2018-03-22 17:15       ` Matthew Wilcox
2018-03-22 18:39         ` Daniel Jordan
2018-03-22 18:50           ` Matthew Wilcox
2018-03-20 22:58   ` Figo.zhang
2018-03-21  1:59     ` Aaron Lu
2018-03-21  4:21       ` Figo.zhang
2018-03-21  4:53         ` Aaron Lu
2018-03-21  5:59           ` Figo.zhang
2018-03-21  7:42             ` Aaron Lu
2018-03-20  8:54 ` [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure Aaron Lu
2018-03-20 22:29   ` Figo.zhang
2018-03-21  1:52     ` Aaron Lu
2018-03-21 12:55   ` Vlastimil Babka
2018-03-21 15:01     ` Aaron Lu [this message]
2018-03-29 19:16       ` Daniel Jordan
2018-03-20  8:54 ` [RFC PATCH v2 4/4] mm/free_pcppages_bulk: reduce overhead of cluster operation on free path Aaron Lu
2018-03-21 17:44 ` [RFC PATCH v2 0/4] Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free Daniel Jordan
2018-03-22  1:30   ` Aaron Lu
2018-03-22 11:20     ` Daniel Jordan
2018-03-29 19:19 ` Daniel Jordan
2018-03-30  1:42   ` Aaron Lu
2018-03-30 14:27     ` Daniel Jordan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180321150140.GA1838@intel.com \
    --to=aaron.lu@intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=kemi.wang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.