linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch] mm: Increase pagevec size on large system
@ 2020-06-26 21:23 Tim Chen
  2020-06-27  3:13 ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Tim Chen @ 2020-06-26 21:23 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko
  Cc: Tim Chen, Dave Hansen, Ying Huang, linux-mm, linux-kernel

Pages to be added to a LRU are first cached in pagevec before being
added to LRU in a batch via pagevec_lru_move_fn.  By adding the pages
in a batch with pagevec, the contention on LRU lock is mitigated.
Currently the pagevec size is defined to be 15.

We found during testing on a large SMT system with 2 sockets, 48 cores
and 96 CPU threads, the pagevec size of 15 is too small for workload
that caused frequent page additions to LRU.

With pmbench, 8.9% of the CPU cycles are spent contending
for the LRU lock.

   12.92%  pmbench          [kernel.kallsyms]         [k] queued_spin_lock_slowpath
            |
             --12.92%--0x5555555582f2
                       |
                        --12.92%--page_fault
                                  do_page_fault
                                  __do_page_fault
                                  handle_mm_fault
                                  __handle_mm_fault
                                  |
                                  |--8.90%--__lru_cache_add
                                  |          pagevec_lru_move_fn
                                  |          |
                                  |           --8.90%--_raw_spin_lock_irqsave
                                  |                     queued_spin_lock_slowpat

Enlarge the pagevec size to 31 to reduce LRU lock contention for
large systems.

The LRU lock contention is reduced from 8.9% of total CPU cycles
to 2.2% of CPU cyles.  And the pmbench throughput increases
from 88.8 Mpages/sec to 95.1 Mpages/sec.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/pagevec.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 081d934eda64..466ebcdd190d 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -11,8 +11,16 @@
 
 #include <linux/xarray.h>
 
+#if CONFIG_NR_CPUS > 64
+/*
+ * Use larger size to reduce lru lock contention on large system.
+ * 31 pointers + header align the pagevec structure to a power of two
+ */
+#define PAGEVEC_SIZE	31
+#else
 /* 15 pointers + header align the pagevec structure to a power of two */
 #define PAGEVEC_SIZE	15
+#endif
 
 struct page;
 struct address_space;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Patch] mm: Increase pagevec size on large system
  2020-06-26 21:23 [Patch] mm: Increase pagevec size on large system Tim Chen
@ 2020-06-27  3:13 ` Matthew Wilcox
  2020-06-27  3:47   ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-06-27  3:13 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Dave Hansen, Ying Huang, linux-mm, linux-kernel

On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote:
> Enlarge the pagevec size to 31 to reduce LRU lock contention for
> large systems.
> 
> The LRU lock contention is reduced from 8.9% of total CPU cycles
> to 2.2% of CPU cyles.  And the pmbench throughput increases
> from 88.8 Mpages/sec to 95.1 Mpages/sec.

The downside here is that pagevecs are often stored on the stack (eg
truncate_inode_pages_range()) as well as being used for the LRU list.
On a 64-bit system, this increases the stack usage from 128 to 256 bytes
for this array.

I wonder if we could do something where we transform the ones on the
stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC
the ones used for the LRUs.  There's plenty of space in the header to
add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable
length struct.

Or maybe our stacks are now big enough that we just don't care.
What do you think?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch] mm: Increase pagevec size on large system
  2020-06-27  3:13 ` Matthew Wilcox
@ 2020-06-27  3:47   ` Andrew Morton
  2020-06-29 16:57     ` Tim Chen
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2020-06-27  3:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tim Chen, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Dave Hansen, Ying Huang, linux-mm, linux-kernel

On Sat, 27 Jun 2020 04:13:04 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote:
> > Enlarge the pagevec size to 31 to reduce LRU lock contention for
> > large systems.
> > 
> > The LRU lock contention is reduced from 8.9% of total CPU cycles
> > to 2.2% of CPU cyles.  And the pmbench throughput increases
> > from 88.8 Mpages/sec to 95.1 Mpages/sec.
> 
> The downside here is that pagevecs are often stored on the stack (eg
> truncate_inode_pages_range()) as well as being used for the LRU list.
> On a 64-bit system, this increases the stack usage from 128 to 256 bytes
> for this array.
> 
> I wonder if we could do something where we transform the ones on the
> stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC
> the ones used for the LRUs.  There's plenty of space in the header to
> add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable
> length struct.
> 
> Or maybe our stacks are now big enough that we just don't care.
> What do you think?

And I wonder how useful CONFIG_NR_CPUS is for making this decision. 
Presumably a lot of general-purpose kernel builds have CONFIG_NR_CPUS
much larger than the actual number of CPUs.

I can't think of much of a fix for this, apart from making it larger on
all kernels, Is there a downside to this?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch] mm: Increase pagevec size on large system
  2020-06-27  3:47   ` Andrew Morton
@ 2020-06-29 16:57     ` Tim Chen
  2020-07-01  0:27       ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Tim Chen @ 2020-06-29 16:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox
  Cc: Vladimir Davydov, Johannes Weiner, Michal Hocko, Dave Hansen,
	Ying Huang, linux-mm, linux-kernel



On 6/26/20 8:47 PM, Andrew Morton wrote:
> On Sat, 27 Jun 2020 04:13:04 +0100 Matthew Wilcox <willy@infradead.org> wrote:
> 
>> On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote:
>>> Enlarge the pagevec size to 31 to reduce LRU lock contention for
>>> large systems.
>>>
>>> The LRU lock contention is reduced from 8.9% of total CPU cycles
>>> to 2.2% of CPU cyles.  And the pmbench throughput increases
>>> from 88.8 Mpages/sec to 95.1 Mpages/sec.
>>
>> The downside here is that pagevecs are often stored on the stack (eg
>> truncate_inode_pages_range()) as well as being used for the LRU list.
>> On a 64-bit system, this increases the stack usage from 128 to 256 bytes
>> for this array.
>>
>> I wonder if we could do something where we transform the ones on the
>> stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC
>> the ones used for the LRUs.  There's plenty of space in the header to
>> add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable
>> length struct.
>>
>> Or maybe our stacks are now big enough that we just don't care.
>> What do you think?
> 
> And I wonder how useful CONFIG_NR_CPUS is for making this decision. 
> Presumably a lot of general-purpose kernel builds have CONFIG_NR_CPUS
> much larger than the actual number of CPUs.
> 
> I can't think of much of a fix for this, apart from making it larger on
> all kernels, Is there a downside to this?
> 

Thanks for Matthew and Andrew's feedbacks.

I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged.
Andrew, do you have a preference?

I was assuming that for people who really care about saving the kernel memory
usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming
up with a better scheme.

Otherwise, we will have to adjust the pagevec size when we actually 
found out how many CPUs we have brought online.  It seems like a lot
of added complexity for going that route.

Tim

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch] mm: Increase pagevec size on large system
  2020-06-29 16:57     ` Tim Chen
@ 2020-07-01  0:27       ` Andrew Morton
  2020-07-01 10:05         ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2020-07-01  0:27 UTC (permalink / raw)
  To: Tim Chen
  Cc: Matthew Wilcox, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Dave Hansen, Ying Huang, linux-mm, linux-kernel

On Mon, 29 Jun 2020 09:57:42 -0700 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> 
> 
> I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged.
> Andrew, do you have a preference?
> 
> I was assuming that for people who really care about saving the kernel memory
> usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming
> up with a better scheme.
> 
> Otherwise, we will have to adjust the pagevec size when we actually 
> found out how many CPUs we have brought online.  It seems like a lot
> of added complexity for going that route.

Even if we were to do this, the worst-case stack usage on the largest
systems might be an issue.  If it isn't then we might as well hard-wire
it to 31 elements anyway,

I dunno.  An extra 128 bytes of stack doesn't sound toooo bad, and the
performance benefit is significant.  Perhaps we just go with the
original patch.  If there are any on-stack pagevecs in the page reclaim
path then perhaps we could create a new mini-pagevec for just those.  or
look at simply removing the pagevec optimization in there altogether.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Patch] mm: Increase pagevec size on large system
  2020-07-01  0:27       ` Andrew Morton
@ 2020-07-01 10:05         ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2020-07-01 10:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Matthew Wilcox, Vladimir Davydov, Johannes Weiner,
	Dave Hansen, Ying Huang, linux-mm, linux-kernel

On Tue 30-06-20 17:27:13, Andrew Morton wrote:
> On Mon, 29 Jun 2020 09:57:42 -0700 Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged.
> > Andrew, do you have a preference?
> > 
> > I was assuming that for people who really care about saving the kernel memory
> > usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming
> > up with a better scheme.
> > 
> > Otherwise, we will have to adjust the pagevec size when we actually 
> > found out how many CPUs we have brought online.  It seems like a lot
> > of added complexity for going that route.
> 
> Even if we were to do this, the worst-case stack usage on the largest
> systems might be an issue.  If it isn't then we might as well hard-wire
> it to 31 elements anyway,

I am not sure this is really a matter of how large the machine is. For
example in the writeout paths this really depends on how complex the IO
stack is much more.

Direct memory reclaim is also a very sensitive stack context. As we are
not doing any writeout anymore I believe a large part of the on stack fs
usage is not really relevant. There seem to be only few on stack users
inside mm and they shouldn't be part of the memory reclaim AFAICS.
I have simply did
$ git grep "^[[:space:]]*struct pagevec[[:space:]][^*]"
and fortunately there weren't that many hits to get an idea about the
usage. There is some usage in the graphic stack that should be double
check though.

Btw. I think that pvec is likely a suboptimal data structure for many
on stack users. It allows only very few slots to batch. Something like
mmu_gather which can optimistically increase the batch sounds like
something that would be worth

The main question is whether the improvement is  visible on any
non-artificial workloads. If yes then the quick fix
is likely the best way forward. If this is mostly a microbench thingy
then I would be happier to see a more longterm solution. E.g. scale
pcp pagevec sizes on the machine size or even use something better than
pvec (e.g. lru_deactivate_file could scale much more and I am not sure
pcp aspect is really improving anything - why don't we simply invalidate
all gathered pages at once at the end of invalidate_mapping_pages?).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-07-01 10:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26 21:23 [Patch] mm: Increase pagevec size on large system Tim Chen
2020-06-27  3:13 ` Matthew Wilcox
2020-06-27  3:47   ` Andrew Morton
2020-06-29 16:57     ` Tim Chen
2020-07-01  0:27       ` Andrew Morton
2020-07-01 10:05         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).