linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Feng Tang <feng.tang@intel.com>
Cc: "Shakeel Butt" <shakeelb@google.com>,
	"Sang, Oliver" <oliver.sang@intel.com>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	lkp <lkp@intel.com>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Ivan Babrou" <ivan@cloudflare.com>, "Tejun Heo" <tj@kernel.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Waiman Long" <longman@redhat.com>,
	"kernel-team@cloudflare.com" <kernel-team@cloudflare.com>,
	"Wei Xu" <weixugc@google.com>, "Greg Thelen" <gthelen@google.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
Date: Mon, 23 Oct 2023 19:13:50 -0700	[thread overview]
Message-ID: <CAJD7tkYXJ3vcGvteNH98tB_C7OTo718XSxL=mFsUa7kO8vzFzA@mail.gmail.com> (raw)
In-Reply-To: <CAJD7tka5UnHBz=eX1LtynAjJ+O_oredMKBBL3kFNfG7PHjuMCw@mail.gmail.com>

On Mon, Oct 23, 2023 at 11:25 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> > > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > > > >
> > > > >
> > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > > > >
> > > > > testcase: will-it-scale
> > > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > > > parameters:
> > > > >
> > > > >         nr_task: 100%
> > > > >         mode: thread
> > > > >         test: fallocate1
> > > > >         cpufreq_governor: performance
> > > > >
> > > > >
> > > > > In addition to that, the commit also has significant impact on the following tests:
> > > > >
> > > > > +------------------+---------------------------------------------------------------+
> > > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > > > | test machine     | 104 threads 2 sockets (Skylake) with 192G memory              |
> > > > > | test parameters  | cpufreq_governor=performance                                  |
> > > > > |                  | mode=thread                                                   |
> > > > > |                  | nr_task=50%                                                   |
> > > > > |                  | test=fallocate1                                               |
> > > > > +------------------+---------------------------------------------------------------+
> > > > >
> > > >
> > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > > > there is a quick fix, IMO this series should be skipped for the
> > > > upcoming kernel open window.
> > >
> > > I am currently looking into it. It's reasonable to skip the next merge
> > > window if a quick fix isn't found soon.
> > >
> > > I am surprised by the size of the regression given the following:
> > >       1.12 ą  5%      +1.4        2.50 ą  2%
> > > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
> > >
> > > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
> >
> > Yes, this is kind of confusing. And we have seen similar cases before,
> > espcially for micro benchmark like will-it-scale, stressng, netperf
> > etc, the change to those functions in hot path was greatly amplified
> > in the final benchmark score.
> >
> > In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
> > the affected functions have around 10% change in perf's cpu-cycles,
> > and trigger 69% regression. IIRC, micro benchmarks are very sensitive
> > to those statistics update, like memcg's and vmstat.
> >
>
> Thanks for clarifying. I am still trying to reproduce locally but I am
> running into some quirks with tooling. I may have to run a modified
> version of the fallocate test manually. Meanwhile, I noticed that the
> patch was tested without the fixlet that I posted [1] for it,
> understandably. Would it be possible to get some numbers with that
> fixlet? It should reduce the total number of contended atomic
> operations, so it may help.
>
> [1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/
>
> I am also wondering if aligning the stats_updates atomic will help.
> Right now it may share a cacheline with some items of the
> events_pending array. The latter may be dirtied during a flush and
> unnecessarily dirty the former, but the chances are slim to be honest.
> If it's easy to test such a diff, that would be nice, but I don't
> expect a lot of difference:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7cbc7d94eb65..a35fce653262 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -646,7 +646,7 @@ struct memcg_vmstats {
>         unsigned long           events_pending[NR_MEMCG_EVENTS];
>
>         /* Stats updates since the last flush */
> -       atomic64_t              stats_updates;
> +       atomic64_t              stats_updates ____cacheline_aligned_in_smp;
>  };
>
>  /*

I still could not run the benchmark, but I used a version of
fallocate1.c that does 1 million iterations. I ran 100 in parallel.
This showed ~13% regression with the patch, so not the same as the
will-it-scale version, but it could be an indicator.

With that, I did not see any improvement with the fixlet above or
___cacheline_aligned_in_smp. So you can scratch that.

I did, however, see some improvement with reducing the indirection
layers by moving stats_updates directly into struct mem_cgroup. The
regression in my manual testing went down to 9%. Still not great, but
I am wondering how this reflects on the benchmark. If you're able to
test it that would be great, the diff is below. Meanwhile I am still
looking for other improvements that can be made.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f64ac140083e..b4dfcd8b9cc1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -270,6 +270,9 @@ struct mem_cgroup {

        CACHELINE_PADDING(_pad1_);

+       /* Stats updates since the last flush */
+       atomic64_t              stats_updates;
+
        /* memory.stat */
        struct memcg_vmstats    *vmstats;

@@ -309,6 +312,7 @@ struct mem_cgroup {
        atomic_t                moving_account;
        struct task_struct      *move_lock_task;

+       unsigned int __percpu *stats_updates_percpu;
        struct memcg_vmstats_percpu __percpu *vmstats_percpu;

 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7cbc7d94eb65..e5d2f3d4d874 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -627,9 +627,6 @@ struct memcg_vmstats_percpu {
        /* Cgroup1: threshold notifications & softlimit tree updates */
        unsigned long           nr_page_events;
        unsigned long           targets[MEM_CGROUP_NTARGETS];
-
-       /* Stats updates since the last flush */
-       unsigned int            stats_updates;
 };

 struct memcg_vmstats {
@@ -644,9 +641,6 @@ struct memcg_vmstats {
        /* Pending child counts during tree propagation */
        long                    state_pending[MEMCG_NR_STAT];
        unsigned long           events_pending[NR_MEMCG_EVENTS];
-
-       /* Stats updates since the last flush */
-       atomic64_t              stats_updates;
 };

 /*
@@ -695,14 +689,14 @@ static void memcg_stats_unlock(void)

 static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
 {
-       return atomic64_read(&memcg->vmstats->stats_updates) >
+       return atomic64_read(&memcg->stats_updates) >
                MEMCG_CHARGE_BATCH * num_online_cpus();
 }

 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
        int cpu = smp_processor_id();
-       unsigned int x;
+       unsigned int *stats_updates_percpu;

        if (!val)
                return;
@@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
        cgroup_rstat_updated(memcg->css.cgroup, cpu);

        for (; memcg; memcg = parent_mem_cgroup(memcg)) {
-               x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
-                                         abs(val));
+               stats_updates_percpu =
this_cpu_ptr(memcg->stats_updates_percpu);

-               if (x < MEMCG_CHARGE_BATCH)
+               *stats_updates_percpu += abs(val);
+               if (*stats_updates_percpu < MEMCG_CHARGE_BATCH)
                        continue;

                /*
@@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
                 * redundant. Avoid the overhead of the atomic update.
                 */
                if (!memcg_should_flush_stats(memcg))
-                       atomic64_add(x, &memcg->vmstats->stats_updates);
-               __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
+                       atomic64_add(*stats_updates_percpu,
&memcg->stats_updates);
+               *stats_updates_percpu = 0;
        }
 }

@@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
                free_mem_cgroup_per_node_info(memcg, node);
        kfree(memcg->vmstats);
        free_percpu(memcg->vmstats_percpu);
+       free_percpu(memcg->stats_updates_percpu);
        kfree(memcg);
 }

@@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
        if (!memcg->vmstats_percpu)
                goto fail;

+       memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int,
+                                                      GFP_KERNEL_ACCOUNT);
+       if (!memcg->stats_updates_percpu)
+               goto fail;
+
        for_each_node(node)
                if (alloc_mem_cgroup_per_node_info(memcg, node))
                        goto fail;
@@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
        struct mem_cgroup *memcg = mem_cgroup_from_css(css);
        struct mem_cgroup *parent = parent_mem_cgroup(memcg);
        struct memcg_vmstats_percpu *statc;
+       int *stats_updates_percpu;
        long delta, delta_cpu, v;
        int i, nid;

        statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+       stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu);

        for (i = 0; i < MEMCG_NR_STAT; i++) {
                /*
@@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
                        }
                }
        }
-       statc->stats_updates = 0;
+       *stats_updates_percpu = 0;
        /* We are in a per-cpu loop here, only do the atomic write once */
-       if (atomic64_read(&memcg->vmstats->stats_updates))
-               atomic64_set(&memcg->vmstats->stats_updates, 0);
+       if (atomic64_read(&memcg->stats_updates))
+               atomic64_set(&memcg->stats_updates, 0);
 }

 #ifdef CONFIG_MMU


  reply	other threads:[~2023-10-24  2:14 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-10  3:21 [PATCH v2 0/5] mm: memcg: subtree stats flushing and thresholds Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 1/5] mm: memcg: change flush_next_time to flush_last_time Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 2/5] mm: memcg: move vmstats structs definition above flushing code Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg Yosry Ahmed
2023-10-10  3:24   ` Yosry Ahmed
2023-10-10 20:45   ` Shakeel Butt
2023-10-10 21:02     ` Yosry Ahmed
2023-10-10 22:21       ` Yosry Ahmed
2023-10-11  0:36         ` Shakeel Butt
2023-10-11  1:48           ` Yosry Ahmed
2023-10-11 12:45             ` Shakeel Butt
2023-10-12  3:13               ` Yosry Ahmed
2023-10-12  8:01                 ` Yosry Ahmed
2023-10-12  8:04                 ` Yosry Ahmed
2023-10-12 13:29                   ` Johannes Weiner
2023-10-12 23:28                     ` Yosry Ahmed
2023-10-13  2:33                       ` Johannes Weiner
2023-10-13  2:38                         ` Yosry Ahmed
2023-10-12 13:35                   ` Shakeel Butt
2023-10-12 15:10                     ` Yosry Ahmed
2023-10-12 21:05                       ` Yosry Ahmed
2023-10-12 21:16                         ` Shakeel Butt
2023-10-12 21:19                           ` Yosry Ahmed
2023-10-12 21:38                             ` Shakeel Butt
2023-10-12 22:23                               ` Yosry Ahmed
2023-10-14 23:08                                 ` Andrew Morton
2023-10-16 18:42                                   ` Yosry Ahmed
2023-10-17 23:52                                   ` Yosry Ahmed
2023-10-18  8:22                                 ` Oliver Sang
2023-10-18  8:54                                   ` Yosry Ahmed
2023-10-20 16:17   ` kernel test robot
2023-10-20 17:23     ` Shakeel Butt
2023-10-20 17:42       ` Yosry Ahmed
2023-10-23  1:25         ` Feng Tang
2023-10-23 18:25           ` Yosry Ahmed
2023-10-24  2:13             ` Yosry Ahmed [this message]
2023-10-24  6:56               ` Oliver Sang
2023-10-24  7:14                 ` Yosry Ahmed
2023-10-25  6:09                   ` Oliver Sang
2023-10-25  6:22                     ` Yosry Ahmed
2023-10-25 17:06                       ` Shakeel Butt
2023-10-25 18:36                         ` Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 4/5] mm: workingset: move the stats flush into workingset_test_recent() Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 5/5] mm: memcg: restore subtree stats flushing Yosry Ahmed
2023-10-10 16:48 ` [PATCH v2 0/5] mm: memcg: subtree stats flushing and thresholds domenico cerasuolo
2023-10-10 19:01   ` Yosry Ahmed
2023-10-18 21:12 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJD7tkYXJ3vcGvteNH98tB_C7OTo718XSxL=mFsUa7kO8vzFzA@mail.gmail.com' \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=ivan@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=longman@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).