Re: [Patch] mm: Increase pagevec size on large system

From: Michal Hocko <mhocko@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vladimir Davydov <vdavydov@virtuozzo.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Ying Huang <ying.huang@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [Patch] mm: Increase pagevec size on large system
Date: Wed, 1 Jul 2020 12:05:57 +0200	[thread overview]
Message-ID: <20200701100557.GQ2369@dhcp22.suse.cz> (raw)
In-Reply-To: <20200630172713.496590a923744c0e0160d36b@linux-foundation.org>

On Tue 30-06-20 17:27:13, Andrew Morton wrote:
> On Mon, 29 Jun 2020 09:57:42 -0700 Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged.
> > Andrew, do you have a preference?
> > 
> > I was assuming that for people who really care about saving the kernel memory
> > usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming
> > up with a better scheme.
> > 
> > Otherwise, we will have to adjust the pagevec size when we actually 
> > found out how many CPUs we have brought online.  It seems like a lot
> > of added complexity for going that route.
> 
> Even if we were to do this, the worst-case stack usage on the largest
> systems might be an issue.  If it isn't then we might as well hard-wire
> it to 31 elements anyway,

I am not sure this is really a matter of how large the machine is. For
example in the writeout paths this really depends on how complex the IO
stack is much more.

Direct memory reclaim is also a very sensitive stack context. As we are
not doing any writeout anymore I believe a large part of the on stack fs
usage is not really relevant. There seem to be only few on stack users
inside mm and they shouldn't be part of the memory reclaim AFAICS.
I have simply did
$ git grep "^[[:space:]]*struct pagevec[[:space:]][^*]"
and fortunately there weren't that many hits to get an idea about the
usage. There is some usage in the graphic stack that should be double
check though.

Btw. I think that pvec is likely a suboptimal data structure for many
on stack users. It allows only very few slots to batch. Something like
mmu_gather which can optimistically increase the batch sounds like
something that would be worth

The main question is whether the improvement is  visible on any
non-artificial workloads. If yes then the quick fix
is likely the best way forward. If this is mostly a microbench thingy
then I would be happier to see a more longterm solution. E.g. scale
pcp pagevec sizes on the machine size or even use something better than
pvec (e.g. lru_deactivate_file could scale much more and I am not sure
pcp aspect is really improving anything - why don't we simply invalidate
all gathered pages at once at the end of invalidate_mapping_pages?).
-- 
Michal Hocko
SUSE Labs