linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Yu Zhao <yuzhao@google.com>
Cc: Jens Axboe <axboe@kernel.dk>, SeongJae Park <sj38.park@gmail.com>,
	Linux-MM <linux-mm@kvack.org>, Andi Kleen <ak@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Benjamin Manes <ben.manes@gmail.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Miaohe Lin <linmiaohe@huawei.com>,
	Michael Larabel <michael@michaellarabel.com>,
	Michal Hocko <mhocko@suse.com>,
	Michel Lespinasse <michel@lespinasse.org>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
	Rong Chen <rong.a.chen@intel.com>,
	SeongJae Park <sjpark@amazon.de>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
	Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org,
	Kernel Page Reclaim v2 <page-reclaim@google.com>
Subject: Re: [PATCH v2 00/16] Multigenerational LRU Framework
Date: Thu, 15 Apr 2021 11:36:17 +1000	[thread overview]
Message-ID: <20210415013617.GU1990290@dread.disaster.area> (raw)
In-Reply-To: <CAOUHufa5id9mmjud-UQd4agLCtmDypdNDStkxgoQxsUoh8Qcsg@mail.gmail.com>

On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > Profiles would be interesting, because it sounds to me like reclaim
> > > > *might* be batching page cache removal better (e.g. fewer, larger
> > > > batches) and so spending less time contending on the mapping tree
> > > > lock...
> > > >
> > > > IOWs, I suspect this result might actually be a result of less lock
> > > > contention due to a change in batch processing characteristics of
> > > > the new algorithm rather than it being a "better" algorithm...
> > >
> > > I appreciate the profile. But there is no batching in
> > > __remove_mapping() -- it locks the mapping for each page, and
> > > therefore the lock contention penalizes the mainline and this patchset
> > > equally. It looks worse on your system because the four kswapd threads
> > > from different nodes were working on the same file.
> >
> > I think you misunderstand exactly what I mean by "batching" here.
> > I'm not talking about doing multiple pieces of work under a single
> > lock. What I mean is that the overall amount of work done in a
> > single reclaim scan (i.e a "reclaim batch") is packaged differently.
> >
> > We already batch up page reclaim via building a page list and then
> > passing it to shrink_page_list() to process the batch of pages in a
> > single pass. Each page in this page list batch then calls
> > remove_mapping() to pull the page form the LRU, we have a run of
> > contention between the foreground read() thread and the background
> > kswapd.
> >
> > If the size or nature of the pages in the batch passed to
> > shrink_page_list() changes, then the amount of time a reclaim batch
> > is going to put pressure on the mapping tree lock will also change.
> > That's the "change in batching behaviour" I'm referring to here. I
> > haven't read through the patchset to determine if you change the
> > shrink_page_list() algorithm, but it likely changes what is passed
> > to be reclaimed and that in turn changes the locking patterns that
> > fall out of shrink_page_list...
> 
> Ok, if we are talking about the size of the batch passed to
> shrink_page_list(), both the mainline and this patchset cap it at
> SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
> running fio/io_uring, it's safe to say both use 32.

You're still looking at micro-scale behaviour, not the larger-scale
batching effects. Are we passing SWAP_CLUSTER_MAX groups of pages to
shrinker_page_list() at a different rate?

When I say "batch of work" when talking about the page cache cycling
*500 thousand pages a second* through the cache, I'm not talking
about batches of 32 pages. I'm talking about the entire batch of
work kswapd does in an invocation cycle.

Is it scanning 100k pages 10 times a second? or 10k pages a hundred
times a second? How long does a batch take to run? how long does is
sleep between processing batches? Is there any change in these
metrics as a result of the multi-gen LRU patches?

Basically, we're looking at how access to the mapping lock is
changing the contention profile, and whether that is signficant or
not. I suspect it is, because when you have highly contended locks
and you do something external that reduces unrelated lock
contention, it's because that external thing is taking more time to
do and so there's less time to spend hitting locks hard...

As such, I don't think this test is a good measure of the multi-gen
LRU patches at all - performance is dominated by the severity of
lock contention external to the LRU scanning algorithm, and it's
hard to infer anything through suck lock contention....

> I don't want to paste everything here -- they'd clutter. Please see
> all the detailed profiles in the attachment. Let me know if their
> formats are no to your liking. I still have the raw perf.data.

Which makes the discussion thread just about impossible to follow or
comment on. Please just post the relevant excerpt of the stack
profile that you are commenting on.

> > > And I plan to reach out to other communities, e.g., PostgreSQL, to
> > > benchmark the patchset. I heard they have been complaining about the
> > > buffered io performance under memory pressure. Any other benchmarks
> > > you'd suggest?
> > >
> > > BTW, you might find another surprise in how less frequently slab
> > > shrinkers are called under memory pressure, because this patchset is a
> > > lot better at finding pages to reclaim and therefore doesn't overkill
> > > slabs.
> >
> > That's actually very likely to be a Bad Thing and cause unexpected
> > perofrmance and OOM based regressions. When the machine finally runs
> > out of page cache it can easily reclaim, it's going to get stuck
> > with long tail latencies reclaiming huge slab caches as they've had
> > no substantial ongoing pressure put on them to keep them in balance
> > with the overall memory pressure the system is under...
> 
> Well. It does use the existing equation. That is if it scans X% of
> pages, then it scans X% of slab objects. But 1) it often finds pages
> to reclaim at a lower X% 2) the pages it reclaims are less likely to
> refault. So the side effect is the overall slab objects it scans also
> reduce. I do see your point but don't see any options, at the moment.

You'll have to rebalance the memory reclaim algorithms to either:

a) make the shrinkers more aggressive so they do more reclaim when
called less often, or

b) lower the threshold at which shrinkers are called.

Keeping the slab caches in balance with page cache memory pressure
is fairly important for the performance of workloads that generate
inode and dentry cache load, especially those that don't actually
generate page cache pressure. This is the hardest part about making
fundamental changes to memory reclaim behaviour: ensuring that the
system remains balanced over a wide range of differing workloads and
reacts sanely to sudden step changes in workload behaviour...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2021-04-15  1:36 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-13  6:56 Yu Zhao
2021-04-13  6:56 ` [PATCH v2 01/16] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-04-13  6:56 ` [PATCH v2 02/16] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-04-13  6:56 ` [PATCH v2 03/16] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-04-13  6:56 ` [PATCH v2 04/16] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-04-13  6:56 ` [PATCH v2 05/16] mm/swap.c: export activate_page() Yu Zhao
2021-04-13  6:56 ` [PATCH v2 06/16] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-04-13  6:56 ` [PATCH v2 07/16] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-04-13  6:56 ` [PATCH v2 08/16] mm: multigenerational lru: groundwork Yu Zhao
2021-04-13  6:56 ` [PATCH v2 09/16] mm: multigenerational lru: activation Yu Zhao
2021-04-13  6:56 ` [PATCH v2 10/16] mm: multigenerational lru: mm_struct list Yu Zhao
2021-04-14 14:36   ` Matthew Wilcox
2021-04-13  6:56 ` [PATCH v2 11/16] mm: multigenerational lru: aging Yu Zhao
2021-04-13  6:56 ` [PATCH v2 12/16] mm: multigenerational lru: eviction Yu Zhao
2021-04-13  6:56 ` [PATCH v2 13/16] mm: multigenerational lru: page reclaim Yu Zhao
2021-04-13  6:56 ` [PATCH v2 14/16] mm: multigenerational lru: user interface Yu Zhao
2021-04-13  6:56 ` [PATCH v2 15/16] mm: multigenerational lru: Kconfig Yu Zhao
2021-04-13  6:56 ` [PATCH v2 16/16] mm: multigenerational lru: documentation Yu Zhao
2021-04-13  7:51 ` [PATCH v2 00/16] Multigenerational LRU Framework SeongJae Park
2021-04-13 16:13   ` Jens Axboe
2021-04-13 16:42     ` SeongJae Park
2021-04-13 23:14     ` Dave Chinner
2021-04-14  2:29       ` Rik van Riel
     [not found]         ` <CAOUHufafMcaG8sOS=1YMy2P_6p0R1FzP16bCwpUau7g1-PybBQ@mail.gmail.com>
2021-04-14  6:15           ` Huang, Ying
2021-04-14  7:58             ` Yu Zhao
2021-04-14  8:27               ` Huang, Ying
2021-04-14 13:51                 ` Rik van Riel
2021-04-14 15:56                   ` Andi Kleen
2021-04-14 15:58                   ` [page-reclaim] " Shakeel Butt
2021-04-14 18:45                   ` Yu Zhao
2021-04-14 15:51           ` Andi Kleen
2021-04-14 15:58             ` Rik van Riel
2021-04-14 19:14               ` Yu Zhao
2021-04-14 19:41                 ` Rik van Riel
2021-04-14 20:08                   ` Yu Zhao
2021-04-14 19:04             ` Yu Zhao
2021-04-15  3:00               ` Andi Kleen
2021-04-15  7:13                 ` Yu Zhao
2021-04-15  8:19                   ` Huang, Ying
2021-04-15  9:57                   ` Michel Lespinasse
2021-04-24  2:33                     ` Yu Zhao
2021-04-24  3:30                       ` Andi Kleen
2021-04-24  4:16                         ` Yu Zhao
2021-04-14  3:40       ` Yu Zhao
2021-04-14  4:50         ` Dave Chinner
2021-04-14  7:16           ` Yu Zhao
2021-04-14 10:00             ` Yu Zhao
2021-04-15  1:36             ` Dave Chinner [this message]
2021-04-24 21:21               ` Yu Zhao
2021-04-14 14:43       ` Jens Axboe
2021-04-14 19:42         ` Yu Zhao
2021-04-15  1:21         ` Dave Chinner
2021-04-14 17:43 ` Johannes Weiner
2021-04-27 10:35   ` Yu Zhao
2021-04-29 23:46 ` Konstantin Kharlamov
2021-04-30  6:37   ` Konstantin Kharlamov
2021-04-30 19:31     ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210415013617.GU1990290@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=ben.manes@gmail.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@lists.01.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=michael@michaellarabel.com \
    --cc=michel@lespinasse.org \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=rong.a.chen@intel.com \
    --cc=shy828301@gmail.com \
    --cc=sj38.park@gmail.com \
    --cc=sjpark@amazon.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    --subject='Re: [PATCH v2 00/16] Multigenerational LRU Framework' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).