All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
Date: Mon, 31 Jul 2017 14:41:42 -0400	[thread overview]
Message-ID: <20170731184142.GA30943@cmpxchg.org> (raw)
In-Reply-To: <20170731083111.tgjgkwge5dgt5m2e@hirez.programming.kicks-ass.net>

On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote:
> On Sun, Jul 30, 2017 at 11:28:13AM -0400, Johannes Weiner wrote:
> > On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> > > > +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> > > > +			      int old, int new)
> > > > +{
> > > > +	enum memdelay_domain_state state;
> > > > +	struct memdelay_domain_cpu *mdc;
> > > > +	unsigned long now, delta;
> > > > +	unsigned long flags;
> > > > +
> > > > +	mdc = per_cpu_ptr(md->mdcs, cpu);
> > > > +	spin_lock_irqsave(&mdc->lock, flags);
> > > 
> > > Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
> > > do we really want to add more atomics there?
> > 
> > I think we should be able to get away without an additional lock and
> > rely on the rq lock instead. schedule, enqueue, dequeue already hold
> > it, memdelay_enter/leave could be added. I need to think about what to
> > do with try_to_wake_up in order to get the cpu move accounting inside
> > the locked section of ttwu_queue(), but that should be doable too.
> 
> So could you start by describing what actual statistics we need? Because
> as is the scheduler already does a gazillion stats and why can't re
> repurpose some of those?

If that's possible, that would be great of course.

We want to be able to tell how many tasks in a domain (the system or a
memory cgroup) are inside a memdelay section as opposed to how many
are in a "productive" state such as runnable or iowait. Then derive
from that whether the domain as a whole is unproductive (all non-idle
tasks memdelayed), or partially unproductive (some delayed, but CPUs
are productive or there are iowait tasks). Then derive the percentages
of walltime the domain spends partially or fully unproductive.

For that we need per-domain counters for

	1) nr of tasks in memdelay sections
	2) nr of iowait or runnable/queued tasks that are NOT inside
	   memdelay sections

The memdelay and runnable counts need to be per-cpu as well. (The idea
is this: if you have one CPU and some tasks are delayed while others
are runnable, you're 100% partially productive, as the CPU is fully
used. But if you have two CPUs, and the tasks on one CPU are all
runnable while the tasks on the others are all delayed, the domain is
50% of the time fully unproductive (and not 100% partially productive)
as half the available CPU time is being squandered by delays).

On the system-level, we already count runnable/queued per cpu through
rq->nr_running.

However, we need to distinguish between productive runnables and tasks
that are in runnable while in a memdelay section (doing reclaim). The
current counters don't do that.

Lastly, and somewhat obscurely, the presence of runnable tasks means
that usually the domain is at least partially productive. But if the
CPU is used by a task in a memdelay section (direct reclaim), the
domain is fully unproductive (unless there are iowait tasks in the
domain, since they make "progress" without CPU). So we need to track
task_current() && task_memdelayed() per-domain per-cpu as well.

Now, thinking only about the system-level, we could split
rq->nr_running into a sets of delayed and non-delayed counters
(present them as sum in all current read sides).

Adding an rq counter for tasks inside memdelay sections should be
straight-forward as well (except for maybe the migration cost of that
state between CPUs in ttwu that Mike pointed out).

That leaves the question of how to track these numbers per cgroup at
an acceptable cost. The idea for a tree of cgroups is that walltime
impact of delays at each level is reported for all tasks at or below
that level. E.g. a leave group aggregates the state of its own tasks,
the root/system aggregates the state of all tasks in the system; hence
the propagation of the task state counters up the hierarchy.

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
Date: Mon, 31 Jul 2017 14:41:42 -0400	[thread overview]
Message-ID: <20170731184142.GA30943@cmpxchg.org> (raw)
In-Reply-To: <20170731083111.tgjgkwge5dgt5m2e@hirez.programming.kicks-ass.net>

On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote:
> On Sun, Jul 30, 2017 at 11:28:13AM -0400, Johannes Weiner wrote:
> > On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> > > > +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> > > > +			      int old, int new)
> > > > +{
> > > > +	enum memdelay_domain_state state;
> > > > +	struct memdelay_domain_cpu *mdc;
> > > > +	unsigned long now, delta;
> > > > +	unsigned long flags;
> > > > +
> > > > +	mdc = per_cpu_ptr(md->mdcs, cpu);
> > > > +	spin_lock_irqsave(&mdc->lock, flags);
> > > 
> > > Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
> > > do we really want to add more atomics there?
> > 
> > I think we should be able to get away without an additional lock and
> > rely on the rq lock instead. schedule, enqueue, dequeue already hold
> > it, memdelay_enter/leave could be added. I need to think about what to
> > do with try_to_wake_up in order to get the cpu move accounting inside
> > the locked section of ttwu_queue(), but that should be doable too.
> 
> So could you start by describing what actual statistics we need? Because
> as is the scheduler already does a gazillion stats and why can't re
> repurpose some of those?

If that's possible, that would be great of course.

We want to be able to tell how many tasks in a domain (the system or a
memory cgroup) are inside a memdelay section as opposed to how many
are in a "productive" state such as runnable or iowait. Then derive
from that whether the domain as a whole is unproductive (all non-idle
tasks memdelayed), or partially unproductive (some delayed, but CPUs
are productive or there are iowait tasks). Then derive the percentages
of walltime the domain spends partially or fully unproductive.

For that we need per-domain counters for

	1) nr of tasks in memdelay sections
	2) nr of iowait or runnable/queued tasks that are NOT inside
	   memdelay sections

The memdelay and runnable counts need to be per-cpu as well. (The idea
is this: if you have one CPU and some tasks are delayed while others
are runnable, you're 100% partially productive, as the CPU is fully
used. But if you have two CPUs, and the tasks on one CPU are all
runnable while the tasks on the others are all delayed, the domain is
50% of the time fully unproductive (and not 100% partially productive)
as half the available CPU time is being squandered by delays).

On the system-level, we already count runnable/queued per cpu through
rq->nr_running.

However, we need to distinguish between productive runnables and tasks
that are in runnable while in a memdelay section (doing reclaim). The
current counters don't do that.

Lastly, and somewhat obscurely, the presence of runnable tasks means
that usually the domain is at least partially productive. But if the
CPU is used by a task in a memdelay section (direct reclaim), the
domain is fully unproductive (unless there are iowait tasks in the
domain, since they make "progress" without CPU). So we need to track
task_current() && task_memdelayed() per-domain per-cpu as well.

Now, thinking only about the system-level, we could split
rq->nr_running into a sets of delayed and non-delayed counters
(present them as sum in all current read sides).

Adding an rq counter for tasks inside memdelay sections should be
straight-forward as well (except for maybe the migration cost of that
state between CPUs in ttwu that Mike pointed out).

That leaves the question of how to track these numbers per cgroup at
an acceptable cost. The idea for a tree of cgroups is that walltime
impact of delays at each level is reported for all tasks at or below
that level. E.g. a leave group aggregates the state of its own tasks,
the root/system aggregates the state of all tasks in the system; hence
the propagation of the task state counters up the hierarchy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-31 18:42 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-27 15:30 [PATCH 0/3] memdelay: memory health metric for systems and workloads Johannes Weiner
2017-07-27 15:30 ` Johannes Weiner
2017-07-27 15:30 ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros Johannes Weiner
2017-07-27 15:30   ` Johannes Weiner
2017-07-27 15:30 ` [PATCH 2/3] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2017-07-27 15:30   ` Johannes Weiner
2017-07-27 15:30 ` [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads Johannes Weiner
2017-07-27 15:30   ` Johannes Weiner
2017-07-27 15:56   ` Johannes Weiner
2017-07-27 15:56     ` Johannes Weiner
2017-07-29  9:10   ` Peter Zijlstra
2017-07-29  9:10     ` Peter Zijlstra
2017-07-30 15:28     ` Johannes Weiner
2017-07-30 15:28       ` Johannes Weiner
2017-07-31  8:31       ` Peter Zijlstra
2017-07-31  8:31         ` Peter Zijlstra
2017-07-31 18:41         ` Johannes Weiner [this message]
2017-07-31 18:41           ` Johannes Weiner
2017-07-31 19:49           ` Mike Galbraith
2017-07-31 19:49             ` Mike Galbraith
2017-07-31 20:38             ` Johannes Weiner
2017-07-31 20:38               ` Johannes Weiner
2017-08-01  2:23               ` Mike Galbraith
2017-08-01  2:23                 ` Mike Galbraith
2017-08-01  7:57           ` Peter Zijlstra
2017-08-01  7:57             ` Peter Zijlstra
2017-08-01 12:26             ` Johannes Weiner
2017-08-01 12:26               ` Johannes Weiner
2017-08-13 14:52               ` Peter Zijlstra
2017-08-13 14:52                 ` Peter Zijlstra
2017-07-29 13:31   ` kbuild test robot
2017-07-27 20:43 ` [PATCH 0/3] memdelay: memory health metric " Andrew Morton
2017-07-27 20:43   ` Andrew Morton
2017-07-28 19:43   ` Johannes Weiner
2017-07-28 19:43     ` Johannes Weiner
2017-08-02  8:11     ` Michal Hocko
2017-08-02  8:11       ` Michal Hocko
2017-07-29  2:48 ` Mike Galbraith
2017-07-29  2:48   ` Mike Galbraith
2017-07-29  3:21   ` Mike Galbraith
2017-07-29  3:21     ` Mike Galbraith
2017-07-29  6:38   ` Mike Galbraith
2017-07-29  6:38     ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170731184142.GA30943@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.