All of lore.kernel.org
 help / color / mirror / Atom feed
* memcgroup lruvec_lru_size scaling issue
@ 2019-10-14 17:17 Tim Chen
  2019-10-14 17:37 ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Tim Chen @ 2019-10-14 17:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Honglei Wang, Johannes Weiner, linux-mm, Andrew Morton, Dave Hansen

We were running a database benchmark in mem cgroup and found that lruvec_lru_size
is taking up a huge chunk of CPU cycles (about 25% of our kernel time) on 5.3 kernel.

The main issue is the loop in lruvec_page_state_local called by lruvec_lru_size
in the mem cgroup path:

for_each_possible_cpu(cpu)
	x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);

It is costly looping through all the cpus to get the lru vec size info.
And doing this on our workload with 96 cpu threads and 500 mem cgroups
makes things much worse.  We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
which is a lot of data structures to be running through all the time.

Hongwei's patch
(https://lore.kernel.org/linux-mm/991b4719-a2a0-9efe-de02-56a928752fe3@oracle.com/)
restores the previous method for computing lru_size and is much more efficient in getting the lru_size.
We got a 20% throughput improvement in our database benchmark with Hongwei's patch, and
lruvec_lru_size's cpu overhead completely disappeared from the cpu profile.

We'll like to see Hongwei's patch getting merged.

The problem can also be reproduced by running simple multi-threaded pmbench benchmark
with a fast Optane SSD swap (see profile below).


6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
            |
            |--3.07%--lruvec_lru_size
            |          |
            |          |--2.11%--cpumask_next
            |          |          |
            |          |           --1.66%--find_next_bit
            |          |
            |           --0.57%--call_function_interrupt
            |                     |
            |                      --0.55%--smp_call_function_interrupt
            |
            |--1.59%--0x441f0fc3d009
            |          _ops_rdtsc_init_base_freq
            |          access_histogram
            |          page_fault
            |          __do_page_fault
            |          handle_mm_fault
            |          __handle_mm_fault
            |          |
            |           --1.54%--do_swap_page
            |                     swapin_readahead
            |                     swap_cluster_readahead
            |                     |
            |                      --1.53%--read_swap_cache_async
            |                                __read_swap_cache_async
            |                                alloc_pages_vma
            |                                __alloc_pages_nodemask
            |                                __alloc_pages_slowpath
            |                                try_to_free_pages
            |                                do_try_to_free_pages
            |                                shrink_node
            |                                shrink_node_memcg
            |                                |
            |                                |--0.77%--lruvec_lru_size
            |                                |
            |                                 --0.76%--inactive_list_is_low
            |                                           |
            |                                            --0.76%--lruvec_lru_size
            |
             --1.50%--measure_read
                       page_fault
                       __do_page_fault
                       handle_mm_fault
                       __handle_mm_fault
                       do_swap_page
                       swapin_readahead
                       swap_cluster_readahead
                       |
                        --1.48%--read_swap_cache_async
                                  __read_swap_cache_async
                                  alloc_pages_vma
                                  __alloc_pages_nodemask
                                  __alloc_pages_slowpath
                                  try_to_free_pages
                                  do_try_to_free_pages
                                  shrink_node
                                  shrink_node_memcg
                                  |
                                  |--0.75%--inactive_list_is_low
                                  |          |
                                  |           --0.75%--lruvec_lru_size
                                  |
                                   --0.73%--lruvec_lru_size


Thanks.

Tim


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 17:17 memcgroup lruvec_lru_size scaling issue Tim Chen
@ 2019-10-14 17:37 ` Michal Hocko
  2019-10-14 17:49   ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2019-10-14 17:37 UTC (permalink / raw)
  To: Tim Chen
  Cc: Honglei Wang, Johannes Weiner, linux-mm, Andrew Morton, Dave Hansen

On Mon 14-10-19 10:17:45, Tim Chen wrote:
> We were running a database benchmark in mem cgroup and found that lruvec_lru_size
> is taking up a huge chunk of CPU cycles (about 25% of our kernel time) on 5.3 kernel.
> 
> The main issue is the loop in lruvec_page_state_local called by lruvec_lru_size
> in the mem cgroup path:
> 
> for_each_possible_cpu(cpu)
> 	x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
> 
> It is costly looping through all the cpus to get the lru vec size info.
> And doing this on our workload with 96 cpu threads and 500 mem cgroups
> makes things much worse.  We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
> which is a lot of data structures to be running through all the time.

Why does the number of cgroup matter?

> Hongwei's patch
> (https://lore.kernel.org/linux-mm/991b4719-a2a0-9efe-de02-56a928752fe3@oracle.com/)
> restores the previous method for computing lru_size and is much more efficient in getting the lru_size.
> We got a 20% throughput improvement in our database benchmark with Hongwei's patch, and
> lruvec_lru_size's cpu overhead completely disappeared from the cpu profile.
> 
> We'll like to see Hongwei's patch getting merged.

The main problem with the patch was a lack of justification. If the
performance approvement is this large (I am quite surprised TBH) then I
would obviously not have any objections. Care to send a patch with the
complete changelog?

> The problem can also be reproduced by running simple multi-threaded pmbench benchmark
> with a fast Optane SSD swap (see profile below).
> 
> 
> 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
>             |
>             |--3.07%--lruvec_lru_size
>             |          |
>             |          |--2.11%--cpumask_next
>             |          |          |
>             |          |           --1.66%--find_next_bit
>             |          |
>             |           --0.57%--call_function_interrupt
>             |                     |
>             |                      --0.55%--smp_call_function_interrupt
>             |
>             |--1.59%--0x441f0fc3d009
>             |          _ops_rdtsc_init_base_freq
>             |          access_histogram
>             |          page_fault
>             |          __do_page_fault
>             |          handle_mm_fault
>             |          __handle_mm_fault
>             |          |
>             |           --1.54%--do_swap_page
>             |                     swapin_readahead
>             |                     swap_cluster_readahead
>             |                     |
>             |                      --1.53%--read_swap_cache_async
>             |                                __read_swap_cache_async
>             |                                alloc_pages_vma
>             |                                __alloc_pages_nodemask
>             |                                __alloc_pages_slowpath
>             |                                try_to_free_pages
>             |                                do_try_to_free_pages
>             |                                shrink_node
>             |                                shrink_node_memcg
>             |                                |
>             |                                |--0.77%--lruvec_lru_size
>             |                                |
>             |                                 --0.76%--inactive_list_is_low
>             |                                           |
>             |                                            --0.76%--lruvec_lru_size
>             |
>              --1.50%--measure_read
>                        page_fault
>                        __do_page_fault
>                        handle_mm_fault
>                        __handle_mm_fault
>                        do_swap_page
>                        swapin_readahead
>                        swap_cluster_readahead
>                        |
>                         --1.48%--read_swap_cache_async
>                                   __read_swap_cache_async
>                                   alloc_pages_vma
>                                   __alloc_pages_nodemask
>                                   __alloc_pages_slowpath
>                                   try_to_free_pages
>                                   do_try_to_free_pages
>                                   shrink_node
>                                   shrink_node_memcg
>                                   |
>                                   |--0.75%--inactive_list_is_low
>                                   |          |
>                                   |           --0.75%--lruvec_lru_size
>                                   |
>                                    --0.73%--lruvec_lru_size
> 
> 
> Thanks.
> 
> Tim

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 17:37 ` Michal Hocko
@ 2019-10-14 17:49   ` Dave Hansen
  2019-10-14 17:59     ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2019-10-14 17:49 UTC (permalink / raw)
  To: Michal Hocko, Tim Chen
  Cc: Honglei Wang, Johannes Weiner, linux-mm, Andrew Morton

On 10/14/19 10:37 AM, Michal Hocko wrote:
>> for_each_possible_cpu(cpu)
>> 	x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
>>
>> It is costly looping through all the cpus to get the lru vec size info.
>> And doing this on our workload with 96 cpu threads and 500 mem cgroups
>> makes things much worse.  We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
>> which is a lot of data structures to be running through all the time.
> Why does the number of cgroup matter?

I was thinking purely of the cache footprint.  If it's reading
pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
be 18k of data for the whole system and the caching would be pretty
efficient and all 18k would probably survive a tight page fault loop in
the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
fit in the L1 and probably wouldn't survive a tight page fault loop if
both logical threads were banging on different cgroups.

It's just a theory, but it's why I noted the number of cgroups when I
initially saw this show up in profiles.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 17:49   ` Dave Hansen
@ 2019-10-14 17:59     ` Michal Hocko
  2019-10-14 18:06       ` Tim Chen
  2019-10-14 18:11       ` Dave Hansen
  0 siblings, 2 replies; 12+ messages in thread
From: Michal Hocko @ 2019-10-14 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Tim Chen, Honglei Wang, Johannes Weiner, linux-mm, Andrew Morton

On Mon 14-10-19 10:49:49, Dave Hansen wrote:
> On 10/14/19 10:37 AM, Michal Hocko wrote:
> >> for_each_possible_cpu(cpu)
> >> 	x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
> >>
> >> It is costly looping through all the cpus to get the lru vec size info.
> >> And doing this on our workload with 96 cpu threads and 500 mem cgroups
> >> makes things much worse.  We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
> >> which is a lot of data structures to be running through all the time.
> > Why does the number of cgroup matter?
> 
> I was thinking purely of the cache footprint.  If it's reading
> pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
> bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
> be 18k of data for the whole system and the caching would be pretty
> efficient and all 18k would probably survive a tight page fault loop in
> the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
> fit in the L1 and probably wouldn't survive a tight page fault loop if
> both logical threads were banging on different cgroups.
> 
> It's just a theory, but it's why I noted the number of cgroups when I
> initially saw this show up in profiles.

Yes, the cache traffic might be really high but I still find it a bit
surprising that it makes such a large footprint because this should be
mostly called from slow paths (reclaim) and the real work done should
just be larger - at least that's my intuition which might be quite off
here. How much is that 25% of the system time in the total time btw?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 17:59     ` Michal Hocko
@ 2019-10-14 18:06       ` Tim Chen
  2019-10-14 18:31         ` Michal Hocko
  2019-10-14 18:11       ` Dave Hansen
  1 sibling, 1 reply; 12+ messages in thread
From: Tim Chen @ 2019-10-14 18:06 UTC (permalink / raw)
  To: Michal Hocko, Dave Hansen
  Cc: Honglei Wang, Johannes Weiner, linux-mm, Andrew Morton

On 10/14/19 10:59 AM, Michal Hocko wrote:
> On Mon 14-10-19 10:49:49, Dave Hansen wrote:
>> On 10/14/19 10:37 AM, Michal Hocko wrote:
>>>> for_each_possible_cpu(cpu)
>>>> 	x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
>>>>
>>>> It is costly looping through all the cpus to get the lru vec size info.
>>>> And doing this on our workload with 96 cpu threads and 500 mem cgroups
>>>> makes things much worse.  We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
>>>> which is a lot of data structures to be running through all the time.
>>> Why does the number of cgroup matter?
>>
>> I was thinking purely of the cache footprint.  If it's reading
>> pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
>> bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
>> be 18k of data for the whole system and the caching would be pretty
>> efficient and all 18k would probably survive a tight page fault loop in
>> the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
>> fit in the L1 and probably wouldn't survive a tight page fault loop if
>> both logical threads were banging on different cgroups.
>>
>> It's just a theory, but it's why I noted the number of cgroups when I
>> initially saw this show up in profiles.
> 
> Yes, the cache traffic might be really high but I still find it a bit
> surprising that it makes such a large footprint because this should be
> mostly called from slow paths (reclaim) and the real work done should
> just be larger - at least that's my intuition which might be quite off
> here. How much is that 25% of the system time in the total time btw?
> 

About 7% of total cpu cycles.

Tim


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 17:59     ` Michal Hocko
  2019-10-14 18:06       ` Tim Chen
@ 2019-10-14 18:11       ` Dave Hansen
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2019-10-14 18:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Honglei Wang, Johannes Weiner, linux-mm, Andrew Morton

On 10/14/19 10:59 AM, Michal Hocko wrote:
>> It's just a theory, but it's why I noted the number of cgroups when I
>> initially saw this show up in profiles.
> Yes, the cache traffic might be really high but I still find it a bit
> surprising that it makes such a large footprint because this should be
> mostly called from slow paths (reclaim) and the real work done should
> just be larger - at least that's my intuition which might be quite off
> here. How much is that 25% of the system time in the total time btw?

The workload is a very read-heavy workload doing a lot of I/O.
Basically all the threads are going into direct reclaim a *LOT*.

The workload was roughly 50/50 user/kernel.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 18:06       ` Tim Chen
@ 2019-10-14 18:31         ` Michal Hocko
  2019-10-14 22:14           ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2019-10-14 18:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Dave Hansen, Honglei Wang, Johannes Weiner, linux-mm, Andrew Morton

On Mon 14-10-19 11:06:24, Tim Chen wrote:
> On 10/14/19 10:59 AM, Michal Hocko wrote:
> > On Mon 14-10-19 10:49:49, Dave Hansen wrote:
> >> On 10/14/19 10:37 AM, Michal Hocko wrote:
> >>>> for_each_possible_cpu(cpu)
> >>>> 	x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
> >>>>
> >>>> It is costly looping through all the cpus to get the lru vec size info.
> >>>> And doing this on our workload with 96 cpu threads and 500 mem cgroups
> >>>> makes things much worse.  We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
> >>>> which is a lot of data structures to be running through all the time.
> >>> Why does the number of cgroup matter?
> >>
> >> I was thinking purely of the cache footprint.  If it's reading
> >> pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
> >> bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
> >> be 18k of data for the whole system and the caching would be pretty
> >> efficient and all 18k would probably survive a tight page fault loop in
> >> the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
> >> fit in the L1 and probably wouldn't survive a tight page fault loop if
> >> both logical threads were banging on different cgroups.
> >>
> >> It's just a theory, but it's why I noted the number of cgroups when I
> >> initially saw this show up in profiles.
> > 
> > Yes, the cache traffic might be really high but I still find it a bit
> > surprising that it makes such a large footprint because this should be
> > mostly called from slow paths (reclaim) and the real work done should
> > just be larger - at least that's my intuition which might be quite off
> > here. How much is that 25% of the system time in the total time btw?
> > 
> 
> About 7% of total cpu cycles.

OK, 7% on a highly direct reclaim bound workload sounds more realistic.
Please put all that to the changelog. It would be also great to see
whether that really scales with the number of cgroups if that is easy to
check with your benchmark.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 18:31         ` Michal Hocko
@ 2019-10-14 22:14           ` Andrew Morton
  2019-10-15  6:19             ` Michal Hocko
  2019-10-15 18:23             ` Tim Chen
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2019-10-14 22:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Dave Hansen, Honglei Wang, Johannes Weiner, linux-mm

On Mon, 14 Oct 2019 20:31:07 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> Please put all that to the changelog. It would be also great to see
> whether that really scales with the number of cgroups if that is easy to
> check with your benchmark.

I restored this, with some changelog additions.

Tim, please send along any updates you'd like to make to this, along
with tested-by, acked-by, etc.

I added the Fixes: tag.  Do we think this is serious enough to warrant
backporting into -stable trees?  I suspect so, as any older-tree
maintainer who spies this change will say "hell yeah".


From: Honglei Wang <honglei.wang@oracle.com>
Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size

lruvec_lru_size() is invokving lruvec_page_state_local() to get the
lru_size.  It's base on lruvec_stat_local.count[] of mem_cgroup_per_node. 
This counter is updated in a batched way.  It won't be charged if the
number of incoming pages doesn't meet the needs of MEMCG_CHARGE_BATCH
which is defined as 32.

The testcase in LTP madvise09[1] fails because small blocks of memory are
not charged.  It creates a new memcgroup and sets up 32 MADV_FREE pages. 
Then it forks a child who will introduce memory pressure in the memcgroup.
The MADV_FREE pages are expected to be released under the pressure, but
32 is not more than MEMCG_CHARGE_BATCH and these pages won't be charged in
lruvec_stat_local.count[] until some more pages come in to satisfy the
needs of batch charging.  So these MADV_FREE pages can't be freed in
memory pressure which is a bit conflicted with the definition of
MADV_FREE.

Get the lru_size base on lru_zone_size of mem_cgroup_per_node which is not
updated via batching can making it more accurate in this scenario.

This is effectively a partial reversion of 1a61ab8038e72 ("mm: memcontrol:
replace zone summing with lruvec_page_state()").

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c

Tim said:

: We were running a database benchmark in mem cgroup and found that
: lruvec_lru_size is taking up a huge chunk of CPU cycles (about 25% of our
: kernel time - bout 7% of total cpu cycles) on 5.3 kernel.
: 
: The main issue is the loop in lruvec_page_state_local called by
: lruvec_lru_size in the mem cgroup path:
: 
: for_each_possible_cpu(cpu) x += per_cpu(pn->lruvec_stat_local->count[idx],
: cpu); It is costly looping through all the cpus to get the lru vec size
: info.  And doing this on our workload with 96 cpu threads and 500 mem
: cgroups makes things much worse.  We might end up having 96 cpus * 500
: cgroups * 2 (main) LRUs pagevecs, which is a lot of data structures to be
: running through all the time.
: 
: Hongwei's patch restores the previous method for computing lru_size and is
: much more efficient in getting the lru_size.  We got a 20% throughput
: improvement in our database benchmark with Hongwei's patch, and
: lruvec_lru_size's cpu overhead completely disappeared from the cpu
: profile.

Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
+++ a/mm/vmscan.c
@@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
  */
 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
 {
-	unsigned long lru_size;
+	unsigned long lru_size = 0;
 	int zid;
 
-	if (!mem_cgroup_disabled())
-		lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
-	else
+	if (!mem_cgroup_disabled()) {
+		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
+	} else
 		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
 
 	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
_



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 22:14           ` Andrew Morton
@ 2019-10-15  6:19             ` Michal Hocko
  2019-10-15 20:38               ` Andrew Morton
  2019-10-15 18:23             ` Tim Chen
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2019-10-15  6:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Dave Hansen, Honglei Wang, Johannes Weiner, linux-mm

On Mon 14-10-19 15:14:30, Andrew Morton wrote:
[...]
> From: Honglei Wang <honglei.wang@oracle.com>
> Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
> 
> lruvec_lru_size() is invokving lruvec_page_state_local() to get the
> lru_size.  It's base on lruvec_stat_local.count[] of mem_cgroup_per_node. 
> This counter is updated in a batched way.  It won't be charged if the
> number of incoming pages doesn't meet the needs of MEMCG_CHARGE_BATCH
> which is defined as 32.
> 
> The testcase in LTP madvise09[1] fails because small blocks of memory are
> not charged.  It creates a new memcgroup and sets up 32 MADV_FREE pages. 
> Then it forks a child who will introduce memory pressure in the memcgroup.
> The MADV_FREE pages are expected to be released under the pressure, but
> 32 is not more than MEMCG_CHARGE_BATCH and these pages won't be charged in
> lruvec_stat_local.count[] until some more pages come in to satisfy the
> needs of batch charging.  So these MADV_FREE pages can't be freed in
> memory pressure which is a bit conflicted with the definition of
> MADV_FREE.
> 
> Get the lru_size base on lru_zone_size of mem_cgroup_per_node which is not
> updated via batching can making it more accurate in this scenario.
> 
> This is effectively a partial reversion of 1a61ab8038e72 ("mm: memcontrol:
> replace zone summing with lruvec_page_state()").
> 
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
> 
> Tim said:
> 
> : We were running a database benchmark in mem cgroup and found that
> : lruvec_lru_size is taking up a huge chunk of CPU cycles (about 25% of our
> : kernel time - bout 7% of total cpu cycles) on 5.3 kernel.
> : 
> : The main issue is the loop in lruvec_page_state_local called by
> : lruvec_lru_size in the mem cgroup path:
> : 
> : for_each_possible_cpu(cpu) x += per_cpu(pn->lruvec_stat_local->count[idx],
> : cpu); It is costly looping through all the cpus to get the lru vec size
> : info.  And doing this on our workload with 96 cpu threads and 500 mem
> : cgroups makes things much worse.  We might end up having 96 cpus * 500
> : cgroups * 2 (main) LRUs pagevecs, which is a lot of data structures to be
> : running through all the time.
> : 
> : Hongwei's patch restores the previous method for computing lru_size and is
> : much more efficient in getting the lru_size.  We got a 20% throughput
> : improvement in our database benchmark with Hongwei's patch, and
> : lruvec_lru_size's cpu overhead completely disappeared from the cpu
> : profile.

I dunno, but squashing those two changelogs sounds more confusing than
helpful to me. What about the folowing instead?
"
1a61ab8038e72 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective. Tim has reported that this is not really
the case for their database benchmark which is showing an opposite
results where lruvec_page_state is taking up a huge chunk of CPU cycles
(about 25% of the system time which is roughly 7% of total cpu cycles)
on 5.3 kernels. The workload is running on a larger machine (96cpus),
it has many cgroups (500) and it is heavily direct reclaim bound.

<Tim's perf profiles goes here>

The likely culprit is the cache traffic the lruvec_page_state_local
generates. Dave Hansen says:
: I was thinking purely of the cache footprint.  If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
: 
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles

Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.
"

> Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
> Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
> Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
> Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/vmscan.c |    9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> --- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
> +++ a/mm/vmscan.c
> @@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>  {
> -	unsigned long lru_size;
> +	unsigned long lru_size = 0;
>  	int zid;
>  
> -	if (!mem_cgroup_disabled())
> -		lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
> -	else
> +	if (!mem_cgroup_disabled()) {
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> +	} else
>  		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  
>  	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> _

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-14 22:14           ` Andrew Morton
  2019-10-15  6:19             ` Michal Hocko
@ 2019-10-15 18:23             ` Tim Chen
  1 sibling, 0 replies; 12+ messages in thread
From: Tim Chen @ 2019-10-15 18:23 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: Dave Hansen, Honglei Wang, Johannes Weiner, linux-mm

On 10/14/19 3:14 PM, Andrew Morton wrote:
> On Mon, 14 Oct 2019 20:31:07 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
>> Please put all that to the changelog. It would be also great to see
>> whether that really scales with the number of cgroups if that is easy to
>> check with your benchmark.
> 
> I restored this, with some changelog additions.
> 
> Tim, please send along any updates you'd like to make to this, along
> with tested-by, acked-by, etc.
> 
> I added the Fixes: tag.  Do we think this is serious enough to warrant
> backporting into -stable trees?  I suspect so, as any older-tree
> maintainer who spies this change will say "hell yeah".

Johannes' changes that introduced the sum over all percpu counters
in lruvec_lru_size only got merged recently in 5.2. I tested 5.1 kernel and
it does not have this particular scalability issue.

So backport is only needed for 5.2 stable. 

Feel free to add 

Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>

Thanks.

Tim


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-15  6:19             ` Michal Hocko
@ 2019-10-15 20:38               ` Andrew Morton
  2019-10-16  7:25                 ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2019-10-15 20:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Dave Hansen, Honglei Wang, Johannes Weiner, linux-mm

On Tue, 15 Oct 2019 08:19:20 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> I dunno, but squashing those two changelogs sounds more confusing than
> helpful to me. What about the folowing instead?



From: Honglei Wang <honglei.wang@oracle.com>
Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size

1a61ab8038e72 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective.  Tim has reported that this is not really
the case for their database benchmark which is showing an opposite results
where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25%
of the system time which is roughly 7% of total cpu cycles) on 5.3
kernels.  The workload is running on a larger machine (96cpus), it has
many cgroups (500) and it is heavily direct reclaim bound.

Tim Chen said:

: The problem can also be reproduced by running simple multi-threaded
: pmbench benchmark with a fast Optane SSD swap (see profile below).
: 
: 
: 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
:             |
:             |--3.07%--lruvec_lru_size
:             |          |
:             |          |--2.11%--cpumask_next
:             |          |          |
:             |          |           --1.66%--find_next_bit
:             |          |
:             |           --0.57%--call_function_interrupt
:             |                     |
:             |                      --0.55%--smp_call_function_interrupt
:             |
:             |--1.59%--0x441f0fc3d009
:             |          _ops_rdtsc_init_base_freq
:             |          access_histogram
:             |          page_fault
:             |          __do_page_fault
:             |          handle_mm_fault
:             |          __handle_mm_fault
:             |          |
:             |           --1.54%--do_swap_page
:             |                     swapin_readahead
:             |                     swap_cluster_readahead
:             |                     |
:             |                      --1.53%--read_swap_cache_async
:             |                                __read_swap_cache_async
:             |                                alloc_pages_vma
:             |                                __alloc_pages_nodemask
:             |                                __alloc_pages_slowpath
:             |                                try_to_free_pages
:             |                                do_try_to_free_pages
:             |                                shrink_node
:             |                                shrink_node_memcg
:             |                                |
:             |                                |--0.77%--lruvec_lru_size
:             |                                |
:             |                                 --0.76%--inactive_list_is_low
:             |                                           |
:             |                                            --0.76%--lruvec_lru_size
:             |
:              --1.50%--measure_read
:                        page_fault
:                        __do_page_fault
:                        handle_mm_fault
:                        __handle_mm_fault
:                        do_swap_page
:                        swapin_readahead
:                        swap_cluster_readahead
:                        |
:                         --1.48%--read_swap_cache_async
:                                   __read_swap_cache_async
:                                   alloc_pages_vma
:                                   __alloc_pages_nodemask
:                                   __alloc_pages_slowpath
:                                   try_to_free_pages
:                                   do_try_to_free_pages
:                                   shrink_node
:                                   shrink_node_memcg
:                                   |
:                                   |--0.75%--inactive_list_is_low
:                                   |          |
:                                   |           --0.75%--lruvec_lru_size
:                                   |
:                                    --0.73%--lruvec_lru_size


The likely culprit is the cache traffic the lruvec_page_state_local
generates. Dave Hansen says:

: I was thinking purely of the cache footprint.  If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
: 
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles

Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.

Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: <stable@vger.kernel.org>	[5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
+++ a/mm/vmscan.c
@@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
  */
 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
 {
-	unsigned long lru_size;
+	unsigned long lru_size = 0;
 	int zid;
 
-	if (!mem_cgroup_disabled())
-		lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
-	else
+	if (!mem_cgroup_disabled()) {
+		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
+	} else
 		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
 
 	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
_



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: memcgroup lruvec_lru_size scaling issue
  2019-10-15 20:38               ` Andrew Morton
@ 2019-10-16  7:25                 ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2019-10-16  7:25 UTC (permalink / raw)
  To: Tim Chen, Andrew Morton
  Cc: Dave Hansen, Honglei Wang, Johannes Weiner, linux-mm

On Tue 15-10-19 13:38:31, Andrew Morton wrote:
> On Tue, 15 Oct 2019 08:19:20 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > I dunno, but squashing those two changelogs sounds more confusing than
> > helpful to me. What about the folowing instead?
> 
> 
> 
> From: Honglei Wang <honglei.wang@oracle.com>
> Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
> 
> 1a61ab8038e72 ("mm: memcontrol: replace zone summing with
> lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
> instead of calculating it directly from lru_zone_size with an idea that
> this would be more effective.  Tim has reported that this is not really
> the case for their database benchmark which is showing an opposite results
> where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25%
> of the system time which is roughly 7% of total cpu cycles) on 5.3
> kernels.  The workload is running on a larger machine (96cpus), it has
> many cgroups (500) and it is heavily direct reclaim bound.
> 
> Tim Chen said:
> 
> : The problem can also be reproduced by running simple multi-threaded
> : pmbench benchmark with a fast Optane SSD swap (see profile below).
> : 
> : 
> : 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
> :             |
> :             |--3.07%--lruvec_lru_size
> :             |          |
> :             |          |--2.11%--cpumask_next
> :             |          |          |
> :             |          |           --1.66%--find_next_bit
> :             |          |
> :             |           --0.57%--call_function_interrupt
> :             |                     |
> :             |                      --0.55%--smp_call_function_interrupt
> :             |
> :             |--1.59%--0x441f0fc3d009
> :             |          _ops_rdtsc_init_base_freq
> :             |          access_histogram
> :             |          page_fault
> :             |          __do_page_fault
> :             |          handle_mm_fault
> :             |          __handle_mm_fault
> :             |          |
> :             |           --1.54%--do_swap_page
> :             |                     swapin_readahead
> :             |                     swap_cluster_readahead
> :             |                     |
> :             |                      --1.53%--read_swap_cache_async
> :             |                                __read_swap_cache_async
> :             |                                alloc_pages_vma
> :             |                                __alloc_pages_nodemask
> :             |                                __alloc_pages_slowpath
> :             |                                try_to_free_pages
> :             |                                do_try_to_free_pages
> :             |                                shrink_node
> :             |                                shrink_node_memcg
> :             |                                |
> :             |                                |--0.77%--lruvec_lru_size
> :             |                                |
> :             |                                 --0.76%--inactive_list_is_low
> :             |                                           |
> :             |                                            --0.76%--lruvec_lru_size
> :             |
> :              --1.50%--measure_read
> :                        page_fault
> :                        __do_page_fault
> :                        handle_mm_fault
> :                        __handle_mm_fault
> :                        do_swap_page
> :                        swapin_readahead
> :                        swap_cluster_readahead
> :                        |
> :                         --1.48%--read_swap_cache_async
> :                                   __read_swap_cache_async
> :                                   alloc_pages_vma
> :                                   __alloc_pages_nodemask
> :                                   __alloc_pages_slowpath
> :                                   try_to_free_pages
> :                                   do_try_to_free_pages
> :                                   shrink_node
> :                                   shrink_node_memcg
> :                                   |
> :                                   |--0.75%--inactive_list_is_low
> :                                   |          |
> :                                   |           --0.75%--lruvec_lru_size
> :                                   |
> :                                    --0.73%--lruvec_lru_size
> 
> 
> The likely culprit is the cache traffic the lruvec_page_state_local
> generates. Dave Hansen says:
> 
> : I was thinking purely of the cache footprint.  If it's reading
> : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
> : bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
> : be 18k of data for the whole system and the caching would be pretty
> : efficient and all 18k would probably survive a tight page fault loop in
> : the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
> : fit in the L1 and probably wouldn't survive a tight page fault loop if
> : both logical threads were banging on different cgroups.
> : 
> : It's just a theory, but it's why I noted the number of cgroups when I
> : initially saw this show up in profiles

Btw. that theory could be confirmed by an increased number of cache
misses IIUC. Tim, could you give it a try please?

> 
> Fix the regression by partially reverting the said commit and calculate
> the lru size explicitly.
> 
> Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
> Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
> Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
> Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
> Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
> Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: <stable@vger.kernel.org>	[5.2+]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 
>  mm/vmscan.c |    9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> --- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
> +++ a/mm/vmscan.c
> @@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>  {
> -	unsigned long lru_size;
> +	unsigned long lru_size = 0;
>  	int zid;
>  
> -	if (!mem_cgroup_disabled())
> -		lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
> -	else
> +	if (!mem_cgroup_disabled()) {
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> +	} else
>  		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  
>  	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> _

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-10-16  7:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-14 17:17 memcgroup lruvec_lru_size scaling issue Tim Chen
2019-10-14 17:37 ` Michal Hocko
2019-10-14 17:49   ` Dave Hansen
2019-10-14 17:59     ` Michal Hocko
2019-10-14 18:06       ` Tim Chen
2019-10-14 18:31         ` Michal Hocko
2019-10-14 22:14           ` Andrew Morton
2019-10-15  6:19             ` Michal Hocko
2019-10-15 20:38               ` Andrew Morton
2019-10-16  7:25                 ` Michal Hocko
2019-10-15 18:23             ` Tim Chen
2019-10-14 18:11       ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.