All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-block@vger.kernel.org, cgroups@vger.kernel.org,
	Ingo Molnar <mingo@redhat.com>,
	Andrew Morton <akpm@linuxfoundation.org>,
	Tejun Heo <tj@kernel.org>, Balbir Singh <bsingharora@gmail.com>,
	Mike Galbraith <efault@gmx.de>, Oliver Yang <yangoliver@me.com>,
	Shakeel Butt <shakeelb@google.com>, xxx xxx <x.qendo@gmail.com>,
	Taras Kondratiuk <takondra@cisco.com>,
	Daniel Walker <danielwa@cisco.com>,
	Vinayak Menon <vinmenon@codeaurora.org>,
	Ruslan Ruslichenko <rruslich@cisco.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO
Date: Thu, 10 May 2018 09:41:32 -0400	[thread overview]
Message-ID: <20180510134132.GA19348@cmpxchg.org> (raw)
In-Reply-To: <20180509113849.GJ12235@hirez.programming.kicks-ass.net>

On Wed, May 09, 2018 at 01:38:49PM +0200, Peter Zijlstra wrote:
> On Wed, May 09, 2018 at 12:46:18PM +0200, Peter Zijlstra wrote:
> > On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > 
> > > @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > >  	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> > >  	if (task_cpu(p) != cpu) {
> > >  		wake_flags |= WF_MIGRATED;
> > > +		psi_ttwu_dequeue(p);
> > >  		set_task_cpu(p, cpu);
> > >  	}
> > >  
> > 
> > > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > > +{
> > > +	/*
> > > +	 * Is the task being migrated during a wakeup? Make sure to
> > > +	 * deregister its sleep-persistent psi states from the old
> > > +	 * queue, and let psi_enqueue() know it has to requeue.
> > > +	 */
> > > +	if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > > +		struct rq_flags rf;
> > > +		struct rq *rq;
> > > +		int clear = 0;
> > > +
> > > +		if (p->in_iowait)
> > > +			clear |= TSK_IOWAIT;
> > > +		if (p->flags & PF_MEMSTALL)
> > > +			clear |= TSK_MEMSTALL;
> > > +
> > > +		rq = __task_rq_lock(p, &rf);
> > > +		update_rq_clock(rq);
> > > +		psi_task_change(p, rq_clock(rq), clear, 0);
> > > +		p->sched_psi_wake_requeue = 1;
> > > +		__task_rq_unlock(rq, &rf);
> > > +	}
> > > +}
> > 
> > Yeah, no... not happening.
> > 
> > We spend a lot of time to never touch the old rq->lock on wakeups. Mason
> > was the one pushing for that, so he should very well know this.
> > 
> > The one cross-cpu atomic (iowait) is already a problem (the whole iowait
> > accounting being useless makes it even worse), adding significant remote
> > prodding is just really bad.
> 
> Also, since all you need is the global number, I don't think you
> actually need any of this. See what we do for nr_uninterruptible.
> 
> In general I think you want to (re)read loadavg.c some more, and maybe
> reuse a bit more of that.

So there is a reason I'm tracking productivity states per-cpu and not
globally. Consider the following example periods on two CPUs:

    CPU 0
Task 1: | EXECUTING  | memstalled |
Task 2: | runqueued  | EXECUTING  |

    CPU 1
Task 3: | memstalled | EXECUTING  |

If we tracked only the global number of stalled tasks, similarly to
nr_uninterruptible, the number would be elevated throughout the whole
sampling period, giving a pressure value of 100% for "some stalled".
And, since there is always something executing, a "full stall" of 0%.

Now consider what happens when the Task 3 sequence is the other way
around:

    CPU 0
Task 1: | EXECUTING  | memstalled |
Task 2: | runqueued  | EXECUTING  |

    CPU 1
Task 3: | EXECUTING  | memstalled |

Here the number of stalled tasks is elevated only during half of the
sampling period, this time giving a pressure reading of 50% for "some"
(and again 0% for "full").

That's a different measurement, but in terms of workload progress, the
sequences are functionally equivalent. In both scenarios the same
amount of productive CPU cycles is spent advancing tasks 1, 2 and 3,
and the same amount of potentially productive CPU time is lost due to
the contention of memory. We really ought to read the same pressure.

So what I'm doing is calculating the productivity loss on each CPU in
a sampling period as if they were independent time slices. It doesn't
matter how you slice and dice the sequences within each one - if used
CPU time and lost CPU time have the same proportion, we have the same
pressure.

In both scenarios above, this method will give a pressure reading of
some=50% and full=25% of "normalized walltime", which is the time loss
the work would experience on a single CPU executing it serially.

To illustrate:

    CPU X
        1            2            3            4
Task 1: | EXECUTING  | memstalled | sleeping   | sleeping   |
Task 2: | runqueued  | EXECUTING  | sleeping   | sleeping   |
Task 3: | sleeping   | sleeping   | EXECUTING  | memstalled |

You can clearly see the 50% of walltime in which *somebody* isn't
advancing (2 and 4), and the 25% of walltime in which *no* tasks are
(3). Same amount of work, same memory stalls, same pressure numbers.

Globalized state tracking would produce those numbers on the single
CPU (obviously), but once concurrency gets into the mix, it's
questionable what its results mean. It certainly isn't able to
reliably detect equivalent slowdowns of individual tasks ("some" is
all over the place), and in this example wasn't able to capture the
impact of contention on overall work completion ("full" is 0%).

* CPU 0: some = 50%, full =  0%
  CPU 1: some = 50%, full = 50%
    avg: some = 50%, full = 25%

  reply	other threads:[~2018-05-10 13:39 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-07 21:01 [PATCH 0/7] " Johannes Weiner
2018-05-07 21:01 ` [PATCH 1/7] mm: workingset: don't drop refault information prematurely Johannes Weiner
2018-05-07 21:01 ` [PATCH 2/7] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2018-05-07 21:01 ` [PATCH 3/7] delayacct: track delays from thrashing cache pages Johannes Weiner
2018-05-07 21:01 ` [PATCH 4/7] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
2018-05-07 21:01 ` [PATCH 5/7] sched: loadavg: make calc_load_n() public Johannes Weiner
2018-05-09  9:49   ` Peter Zijlstra
2018-05-10 13:46     ` Johannes Weiner
2018-05-07 21:01 ` [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-05-08  0:42   ` Randy Dunlap
2018-05-08 14:06     ` Johannes Weiner
2018-05-08  1:35   ` kbuild test robot
2018-05-08  3:04   ` kbuild test robot
2018-05-08 14:05     ` Johannes Weiner
2018-05-09  9:59   ` Peter Zijlstra
2018-05-10 13:49     ` Johannes Weiner
2018-05-09 10:04   ` Peter Zijlstra
2018-05-10 14:10     ` Johannes Weiner
2018-05-09 10:05   ` Peter Zijlstra
2018-05-10 14:13     ` Johannes Weiner
2018-05-09 10:14   ` Peter Zijlstra
2018-05-10 14:18     ` Johannes Weiner
2018-05-09 10:21   ` Peter Zijlstra
2018-05-10 14:24     ` Johannes Weiner
2018-05-09 10:26   ` Peter Zijlstra
2018-05-09 10:46   ` Peter Zijlstra
2018-05-09 11:38     ` Peter Zijlstra
2018-05-10 13:41       ` Johannes Weiner [this message]
2018-05-14  8:33         ` Peter Zijlstra
2018-05-09 10:55   ` Peter Zijlstra
2018-05-09 11:03   ` Vinayak Menon
2018-05-23 13:17     ` Johannes Weiner
2018-05-23 13:19       ` Vinayak Menon
2018-06-07  0:46   ` Suren Baghdasaryan
2018-05-07 21:01 ` [PATCH 7/7] psi: cgroup support Johannes Weiner
2018-05-09 11:07   ` Peter Zijlstra
2018-05-10 14:49     ` Johannes Weiner
2018-05-14 15:39 ` [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO Christopher Lameter
2018-05-14 17:35   ` Bart Van Assche
2018-05-14 18:55   ` Johannes Weiner
2018-05-14 20:15     ` Christopher Lameter
2018-05-26  0:29 ` Suren Baghdasaryan
2018-05-29 18:16   ` Johannes Weiner
2018-05-30 23:32     ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180510134132.GA19348@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linuxfoundation.org \
    --cc=bsingharora@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=danielwa@cisco.com \
    --cc=efault@gmx.de \
    --cc=kernel-team@fb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rruslich@cisco.com \
    --cc=shakeelb@google.com \
    --cc=takondra@cisco.com \
    --cc=tj@kernel.org \
    --cc=vinmenon@codeaurora.org \
    --cc=x.qendo@gmail.com \
    --cc=yangoliver@me.com \
    --subject='Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.