Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: 王贇 <yun.wang@linux.alibaba.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Iurii Zaikin <yzaikin@google.com>,
	Michal Koutn? <mkoutny@suse.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org,
	"Paul E. McKenney" <paulmck@linux.ibm.com>
Subject: Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
Date: Thu, 28 Nov 2019 10:09:13 +0800
Message-ID: <3ff78d18-fa29-13f3-81e5-a05537a2e344@linux.alibaba.com> (raw)
In-Reply-To: <20191127101932.GN28938@suse.de>

On 2019/11/27 下午6:19, Mel Gorman wrote:
> On Wed, Nov 27, 2019 at 09:49:34AM +0800, ?????? wrote:
>> Currently there are no good approach to monitoring the per-cgroup
>> numa efficiency, this could be a trouble especially when groups
>> are sharing CPUs, it's impossible to tell which one caused the
>> remote-memory access by reading hardware counter since multiple
>> workloads could sharing the same CPU, which make it painful when
>> one want to find out the root cause and fix the issue>>
> 
> It's already possible to identify specific tasks triggering PMU events
> so this is not exactly true.

Should fix the description regarding this...

I think you mean tools like numatop which showing per task local/remote
accessing info from PMU, correct?

It's a good one for debugging, but when we talking about monitoring over
cluster sharing by multiple users, still not very practical... compared
to the workloads classified historical data.

I'm not sure about the overhead and limitation of this PMU approach, or
whether there are any platform it's not yet supported, worth a survey.

> 
>> In order to address this, we introduced new per-cgroup statistic
>> for numa:
>>   * the numa locality to imply the numa balancing efficiency
>>   * the numa execution time on each node
>>
[snip]
>> +#ifdef CONFIG_PROC_SYSCTL
>> +int sysctl_cg_numa_stat(struct ctl_table *table, int write,
>> +			 void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +	struct ctl_table t;
>> +	int err;
>> +	int state = static_branch_likely(&sched_cg_numa_stat);
>> +
>> +	if (write && !capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	t = *table;
>> +	t.data = &state;
>> +	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
>> +	if (err < 0 || !write)
>> +		return err;
>> +
>> +	if (state)
>> +		static_branch_enable(&sched_cg_numa_stat);
>> +	else
>> +		static_branch_disable(&sched_cg_numa_stat);
>> +
>> +	return err;
>> +}
>> +#endif
>> +
> 
> Why is this implemented as a toggle? I'm finding it hard to make sense
> of this. The numa_stat should not even exist if the feature is disabled.

numa_stat will not exist if CONFIG is not enabled, do you mean it should
also disappear when dynamically turn off?

> 
> Assuming that is fixed then the runtime overhead is fine but the same
> issues with the quality of the information relying on NUMA balancing
> limits the usefulness of this. Disabling NUMA balancing or the scan rate
> dropping to a very low frequency would lead in misleading conclusions as
> well as false positives if the CPU and memory policies force remote memory
> usage. Similarly, the timing of the information available is variable du
> to how numa_faults_locality gets reset so sometimes the information is
> fine-grained and sometimes it's coarse grained. It will also pretend to
> display useful information even if NUMA balancing is disabled.

The data just represent what we traced on NUMA balancing PF, so yes folks
need some understanding on NUMA balancing to figure out the real meaning
behind locality.

We want it to tell the real story, if NUMA balancing disabled or scan rate
dropped very low, the locality increments should be very small, when it
keep failing for memory policy or CPU binding reason, we want it to tell how
bad it is, locality just show us how NUMA Balancing is performing, the data
could contains many information since how OS dealing with NUMA could be
complicated...

> 
> I find it hard to believe it would be useful in practice and I think users
> would have real trouble interpreting the data given how much it relies on
> internal implementation details of NUMA balancing. I cannot be certain
> as clearly something motivated the creation of this patch although it's
> unclear if it has ever been used to debug and fix an actual problem in
> the field. Hence, I'm neutral on the patch and will neither ack or nack
> it and will defer to the scheduler maintainers but if I was pushed on it,
> I would be disinclined to merge the patch due to the potential confusion
> caused by users who believe it provides accurate information when at best
> it gives a rough approximation with variable granularity.

We have our cluster enabled this feature already, an old version though but
still helpful, when we want to debug NUMA issues, this could give good hints.

Consider it as load_1/5/15 which not accurate but tell the trend of system
behavior, locality giving the trend of NUMA Balancing as long as it's working,
when increasing slowly it means locality already good enough or no more memory
to adjust, and that's fine, for those who disabled the NUMA Balancing, they do
their own NUMA optimization and find their own ways to estimate the results.

Anyway, we thanks for all those good inputs from your side :-)

Regards,
Michael Wang


  reply index

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced " 王贇
2019-11-13  3:44 ` [PATCH 1/3] sched/numa: advanced per-cgroup " 王贇
2019-11-13  3:45 ` [PATCH 2/3] sched/numa: expose per-task pages-migration-failure 王贇
2019-11-13  3:45 ` [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
2019-11-13 15:09   ` Jonathan Corbet
2019-11-14  1:52     ` 王贇
2019-11-13 18:28   ` Iurii Zaikin
2019-11-14  2:22     ` 王贇
2019-11-15  2:29   ` [PATCH v2 " 王贇
2019-11-20  9:45 ` [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
2019-11-25  1:35 ` 王贇
2019-11-27  1:48 ` [PATCH v2 " 王贇
2019-11-27  1:49   ` [PATCH v2 1/3] sched/numa: advanced per-cgroup " 王贇
2019-11-27 10:19     ` Mel Gorman
2019-11-28  2:09       ` 王贇 [this message]
2019-11-28 12:39         ` Michal Koutný
2019-11-28 13:41           ` 王贇
2019-11-28 15:58             ` Michal Koutný
2019-11-29  1:52               ` 王贇
2019-11-29  5:19                 ` 王贇
2019-11-29 10:06                   ` Michal Koutný
2019-12-02  2:11                     ` 王贇
2019-11-27  1:50   ` [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure 王贇
2019-11-27 10:00     ` Mel Gorman
2019-12-02  2:22     ` 王贇
2019-11-27  1:50   ` [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
2019-11-27  4:58     ` Randy Dunlap
2019-11-27  5:54       ` 王贇
2019-12-03  5:59   ` [PATCH v3 0/2] sched/numa: introduce numa locality 王贇
2019-12-03  6:00     ` [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-04  2:33       ` Randy Dunlap
2019-12-04  2:38         ` 王贇
2019-12-03  6:02     ` [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
2019-12-03 13:43       ` Jonathan Corbet
2019-12-04  2:27         ` 王贇
2019-12-04  7:58     ` [PATCH v4 0/2] sched/numa: introduce numa locality 王贇
2019-12-04  7:59       ` [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-05  3:28         ` Randy Dunlap
2019-12-05  3:29           ` Randy Dunlap
2019-12-05  3:52             ` 王贇
2019-12-04  8:00       ` [PATCH v4 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
2019-12-05  3:40         ` Randy Dunlap
2019-12-05  6:53       ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
2019-12-05  6:53         ` [PATCH v5 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-05  6:54         ` [PATCH v5 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇
2019-12-10  2:19         ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
2019-12-13  1:43         ` [PATCH v6 " 王贇
2019-12-13  1:47           ` [PATCH v6 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2020-01-03 15:14             ` Michal Koutný
2020-01-04  4:51               ` 王贇
2019-12-13  1:48           ` [PATCH v6 2/2] sched/numa: documentation for per-cgroup numa 王贇
2019-12-27  2:22           ` [PATCH v6 0/2] sched/numa: introduce numa locality 王贇
2020-01-17  2:19           ` 王贇
2020-01-19  6:08           ` [PATCH v7 " 王贇
2020-01-19  6:09             ` [PATCH v7 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2020-01-19  6:09             ` [PATCH v7 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇
2020-01-21  0:12               ` Randy Dunlap
2020-01-21  1:58                 ` 王贇
2020-01-21  1:56             ` [PATCH v8 0/2] sched/numa: introduce numa locality 王贇
2020-01-21  1:57               ` [PATCH v8 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2020-01-21  1:57               ` [PATCH v8 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇
2020-01-21  2:08                 ` Randy Dunlap

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ff78d18-fa29-13f3-81e5-a05537a2e344@linux.alibaba.com \
    --to=yun.wang@linux.alibaba.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=keescook@chromium.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=paulmck@linux.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git