Re: Sum of weights idea for CFS PI

From: Qais Yousef <qais.yousef@arm.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	Youssef Esmat <youssefesmat@google.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: Sum of weights idea for CFS PI
Date: Fri, 30 Sep 2022 14:49:31 +0100	[thread overview]
Message-ID: <20220930134931.mpopdvri4xuponw2@wubuntu> (raw)
In-Reply-To: <cb6c406e-1431-fcfd-ef82-87259760ead9@joelfernandes.org>

Hi Joel

I'm interested in the topic, if I can be CCed in any future discussions I'd
appreciate it :)

On 09/29/22 16:38, Joel Fernandes wrote:
> Hi Peter, all,
> 
> Just following-up about the idea Peter suggested at LPC22 about sum of weights
> to solve the CFS priority inversion issues using priority inheritance. I am not
> sure if a straight forward summation of the weights of dependencies in the
> chain, is sufficient (or may cause too much unfairness).
> 
> I think it will work if all the tasks on CPU are 100% in utilization:
> 
> Say if you have 4 tasks (A, B, C, D) running and each one has equal
> weight (W) except for A which has twice the weight (2W).
> So the CPU bandwidth distribution is (assuming all are running):
> A:   2 / 5
> B, C. D:  1 / 5
> 
> Say out of the 4 tasks, 3 of them are a part of a classical priority
> inversion scenario (A, B and C).
> 
> Say now A blocks on a lock and that lock's owner C is running, however now
> because A has blocked, B gets 1/3 bandwidth, where as it should have been
> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
> bandwidth - still not fair since B is eating away CPU bandwidth causing the
> priority inversion we want to remedy.
> 
> The correct bandwidth distribution should be (B and D should be unchanged):
> B = 1/5
> D = 1/5
> 
> C = 3/5
> 
> This means that C's weight should be 3W , and B and D should be W each
> as before. So indeed, C's new weight is its original weight PLUS the
> weight of the A - that's needed to keep the CPU usage of the other
> tasks (B, D) in check so that C makes forward progress on behalf of A and the
> other tasks don't eat into the CPU utilization.
> 
> However, I think this will kinda fall apart if A is asleep 50% of the time
> (assume the sleep is because of I/O and unrelated to the PI chain).
> 
> Because now if all were running (and assume no PI dependencies), with A being
> 50%, the bandwidth of B, C and D each would be divided into 2 components:
> 
> a.  when A is running, it would be as above.
> b.  but if A was sleeping, B, C, and D would get 1/3.
> 
> So on average, B, C and D get:  (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
> or 1/5 bandwidth.

The average metric is interesting one. It can be confusing to reason about too.

I think we have 3 events to take into account here, not 2:

a. when A is running and NOT blocked on C.
b. when A is running and BLOCKED on C.
c. A is sleeping.

This means A, B, C and D's shares will be:

    A ,  B ,  C ,  D
a. 2/5, 1/5, 1/5, 1/5
b. -  , 3/5, 1/5, 1/5
c. -  , 1/3, 1/3, 1/3

Since A is sleeping for 50%, I don't think we can assume equal distribution for
the 3 events (can't just divide by 3).

I believe we can assume that

a. occurs 25% of the time
b. occurs 25% of the time
c. occurs 50% of the time

I *think* this should provide something more representative.

> 
> But now say A happen to block on a lock that C is holding. You would boost C to
> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
> C should actually get.
> 
> C should get (8/30 + 6/30 = 14/30) AFAICS.
> 
> Hopefully one can see that a straight summation of weights is not enough. It
> needs to be something like:
> 
> C's new weight = C's original weight + (A's weight) * (A's utilization)
> 
> Or something, otherwise the inherited weight may be too much to properly solve it.
> 
> Any thoughts on this? You mentioned you had some notes on this and/or proxy
> execution, could you share it?

I assume we'll be using rt-mutex inheritance property to handle this? If this
was discussed during a talk, I'd appreciate a link to that.

In the past in OSPM conference we brought up an issue with performance
inversion where a task running on a smaller (slower to be more generic) CPU is
holding the lock and causing massive delays for waiters. This is an artefact of
DVFS. For HMP, there's an additional cause due to the unequal capacities of the
CPUs.

Proxy execution seems to be the nice solution to all of these problems, but
it's a long way away. I'm interested to learn how this inheritance will be
implemented. And whether there are any userspace conversion issues. i.e: do
we need to convert all locks to rt-mutex locks?

Thanks

--
Qais Yousef