* Sum of weights idea for CFS PI @ 2022-09-29 20:38 Joel Fernandes 2022-09-30 13:49 ` Qais Yousef 0 siblings, 1 reply; 22+ messages in thread From: Joel Fernandes @ 2022-09-29 20:38 UTC (permalink / raw) To: Peter Zijlstra Cc: LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner Hi Peter, all, Just following-up about the idea Peter suggested at LPC22 about sum of weights to solve the CFS priority inversion issues using priority inheritance. I am not sure if a straight forward summation of the weights of dependencies in the chain, is sufficient (or may cause too much unfairness). I think it will work if all the tasks on CPU are 100% in utilization: Say if you have 4 tasks (A, B, C, D) running and each one has equal weight (W) except for A which has twice the weight (2W). So the CPU bandwidth distribution is (assuming all are running): A: 2 / 5 B, C. D: 1 / 5 Say out of the 4 tasks, 3 of them are a part of a classical priority inversion scenario (A, B and C). Say now A blocks on a lock and that lock's owner C is running, however now because A has blocked, B gets 1/3 bandwidth, where as it should have been limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4 bandwidth - still not fair since B is eating away CPU bandwidth causing the priority inversion we want to remedy. The correct bandwidth distribution should be (B and D should be unchanged): B = 1/5 D = 1/5 C = 3/5 This means that C's weight should be 3W , and B and D should be W each as before. So indeed, C's new weight is its original weight PLUS the weight of the A - that's needed to keep the CPU usage of the other tasks (B, D) in check so that C makes forward progress on behalf of A and the other tasks don't eat into the CPU utilization. However, I think this will kinda fall apart if A is asleep 50% of the time (assume the sleep is because of I/O and unrelated to the PI chain). Because now if all were running (and assume no PI dependencies), with A being 50%, the bandwidth of B, C and D each would be divided into 2 components: a. when A is running, it would be as above. b. but if A was sleeping, B, C, and D would get 1/3. So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30 or 1/5 bandwidth. But now say A happen to block on a lock that C is holding. You would boost C to weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what C should actually get. C should get (8/30 + 6/30 = 14/30) AFAICS. Hopefully one can see that a straight summation of weights is not enough. It needs to be something like: C's new weight = C's original weight + (A's weight) * (A's utilization) Or something, otherwise the inherited weight may be too much to properly solve it. Any thoughts on this? You mentioned you had some notes on this and/or proxy execution, could you share it? Thanks, - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-09-29 20:38 Sum of weights idea for CFS PI Joel Fernandes @ 2022-09-30 13:49 ` Qais Yousef 2022-09-30 15:44 ` Youssef Esmat 2022-09-30 17:34 ` Joel Fernandes 0 siblings, 2 replies; 22+ messages in thread From: Qais Yousef @ 2022-09-30 13:49 UTC (permalink / raw) To: Joel Fernandes Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner Hi Joel I'm interested in the topic, if I can be CCed in any future discussions I'd appreciate it :) On 09/29/22 16:38, Joel Fernandes wrote: > Hi Peter, all, > > Just following-up about the idea Peter suggested at LPC22 about sum of weights > to solve the CFS priority inversion issues using priority inheritance. I am not > sure if a straight forward summation of the weights of dependencies in the > chain, is sufficient (or may cause too much unfairness). > > I think it will work if all the tasks on CPU are 100% in utilization: > > Say if you have 4 tasks (A, B, C, D) running and each one has equal > weight (W) except for A which has twice the weight (2W). > So the CPU bandwidth distribution is (assuming all are running): > A: 2 / 5 > B, C. D: 1 / 5 > > Say out of the 4 tasks, 3 of them are a part of a classical priority > inversion scenario (A, B and C). > > Say now A blocks on a lock and that lock's owner C is running, however now > because A has blocked, B gets 1/3 bandwidth, where as it should have been > limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4 > bandwidth - still not fair since B is eating away CPU bandwidth causing the > priority inversion we want to remedy. > > The correct bandwidth distribution should be (B and D should be unchanged): > B = 1/5 > D = 1/5 > > C = 3/5 > > This means that C's weight should be 3W , and B and D should be W each > as before. So indeed, C's new weight is its original weight PLUS the > weight of the A - that's needed to keep the CPU usage of the other > tasks (B, D) in check so that C makes forward progress on behalf of A and the > other tasks don't eat into the CPU utilization. > > However, I think this will kinda fall apart if A is asleep 50% of the time > (assume the sleep is because of I/O and unrelated to the PI chain). > > Because now if all were running (and assume no PI dependencies), with A being > 50%, the bandwidth of B, C and D each would be divided into 2 components: > > a. when A is running, it would be as above. > b. but if A was sleeping, B, C, and D would get 1/3. > > So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30 > or 1/5 bandwidth. The average metric is interesting one. It can be confusing to reason about too. I think we have 3 events to take into account here, not 2: a. when A is running and NOT blocked on C. b. when A is running and BLOCKED on C. c. A is sleeping. This means A, B, C and D's shares will be: A , B , C , D a. 2/5, 1/5, 1/5, 1/5 b. - , 3/5, 1/5, 1/5 c. - , 1/3, 1/3, 1/3 Since A is sleeping for 50%, I don't think we can assume equal distribution for the 3 events (can't just divide by 3). I believe we can assume that a. occurs 25% of the time b. occurs 25% of the time c. occurs 50% of the time I *think* this should provide something more representative. > > But now say A happen to block on a lock that C is holding. You would boost C to > weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what > C should actually get. > > C should get (8/30 + 6/30 = 14/30) AFAICS. > > Hopefully one can see that a straight summation of weights is not enough. It > needs to be something like: > > C's new weight = C's original weight + (A's weight) * (A's utilization) > > Or something, otherwise the inherited weight may be too much to properly solve it. > > Any thoughts on this? You mentioned you had some notes on this and/or proxy > execution, could you share it? I assume we'll be using rt-mutex inheritance property to handle this? If this was discussed during a talk, I'd appreciate a link to that. In the past in OSPM conference we brought up an issue with performance inversion where a task running on a smaller (slower to be more generic) CPU is holding the lock and causing massive delays for waiters. This is an artefact of DVFS. For HMP, there's an additional cause due to the unequal capacities of the CPUs. Proxy execution seems to be the nice solution to all of these problems, but it's a long way away. I'm interested to learn how this inheritance will be implemented. And whether there are any userspace conversion issues. i.e: do we need to convert all locks to rt-mutex locks? Thanks -- Qais Yousef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-09-30 13:49 ` Qais Yousef @ 2022-09-30 15:44 ` Youssef Esmat 2022-09-30 17:42 ` Joel Fernandes 2022-09-30 17:34 ` Joel Fernandes 1 sibling, 1 reply; 22+ messages in thread From: Youssef Esmat @ 2022-09-30 15:44 UTC (permalink / raw) To: Qais Yousef Cc: Joel Fernandes, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner Hi Everyone! I am not sure we should care about A's sleeping pattern. The case we care about is when A is running or wants to run but can't because it is blocked on C. In that case C should get the weight of A as if A was running. Ideally this is also a temporary boost since critical sections should be relatively small, so erring on the side of giving C slightly more runtime would be safe I think. Thanks, Youssef On Fri, Sep 30, 2022 at 8:49 AM Qais Yousef <qais.yousef@arm.com> wrote: > > Hi Joel > > I'm interested in the topic, if I can be CCed in any future discussions I'd > appreciate it :) > > On 09/29/22 16:38, Joel Fernandes wrote: > > Hi Peter, all, > > > > Just following-up about the idea Peter suggested at LPC22 about sum of weights > > to solve the CFS priority inversion issues using priority inheritance. I am not > > sure if a straight forward summation of the weights of dependencies in the > > chain, is sufficient (or may cause too much unfairness). > > > > I think it will work if all the tasks on CPU are 100% in utilization: > > > > Say if you have 4 tasks (A, B, C, D) running and each one has equal > > weight (W) except for A which has twice the weight (2W). > > So the CPU bandwidth distribution is (assuming all are running): > > A: 2 / 5 > > B, C. D: 1 / 5 > > > > Say out of the 4 tasks, 3 of them are a part of a classical priority > > inversion scenario (A, B and C). > > > > Say now A blocks on a lock and that lock's owner C is running, however now > > because A has blocked, B gets 1/3 bandwidth, where as it should have been > > limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4 > > bandwidth - still not fair since B is eating away CPU bandwidth causing the > > priority inversion we want to remedy. > > > > The correct bandwidth distribution should be (B and D should be unchanged): > > B = 1/5 > > D = 1/5 > > > > C = 3/5 > > > > This means that C's weight should be 3W , and B and D should be W each > > as before. So indeed, C's new weight is its original weight PLUS the > > weight of the A - that's needed to keep the CPU usage of the other > > tasks (B, D) in check so that C makes forward progress on behalf of A and the > > other tasks don't eat into the CPU utilization. > > > > However, I think this will kinda fall apart if A is asleep 50% of the time > > (assume the sleep is because of I/O and unrelated to the PI chain). > > > > Because now if all were running (and assume no PI dependencies), with A being > > 50%, the bandwidth of B, C and D each would be divided into 2 components: > > > > a. when A is running, it would be as above. > > b. but if A was sleeping, B, C, and D would get 1/3. > > > > So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30 > > or 1/5 bandwidth. > > The average metric is interesting one. It can be confusing to reason about too. > > I think we have 3 events to take into account here, not 2: > > a. when A is running and NOT blocked on C. > b. when A is running and BLOCKED on C. > c. A is sleeping. > > This means A, B, C and D's shares will be: > > A , B , C , D > a. 2/5, 1/5, 1/5, 1/5 > b. - , 3/5, 1/5, 1/5 > c. - , 1/3, 1/3, 1/3 > > Since A is sleeping for 50%, I don't think we can assume equal distribution for > the 3 events (can't just divide by 3). > > I believe we can assume that > > a. occurs 25% of the time > b. occurs 25% of the time > c. occurs 50% of the time > > I *think* this should provide something more representative. > > > > > But now say A happen to block on a lock that C is holding. You would boost C to > > weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what > > C should actually get. > > > > C should get (8/30 + 6/30 = 14/30) AFAICS. > > > > Hopefully one can see that a straight summation of weights is not enough. It > > needs to be something like: > > > > C's new weight = C's original weight + (A's weight) * (A's utilization) > > > > Or something, otherwise the inherited weight may be too much to properly solve it. > > > > Any thoughts on this? You mentioned you had some notes on this and/or proxy > > execution, could you share it? > > I assume we'll be using rt-mutex inheritance property to handle this? If this > was discussed during a talk, I'd appreciate a link to that. > > In the past in OSPM conference we brought up an issue with performance > inversion where a task running on a smaller (slower to be more generic) CPU is > holding the lock and causing massive delays for waiters. This is an artefact of > DVFS. For HMP, there's an additional cause due to the unequal capacities of the > CPUs. > > Proxy execution seems to be the nice solution to all of these problems, but > it's a long way away. I'm interested to learn how this inheritance will be > implemented. And whether there are any userspace conversion issues. i.e: do > we need to convert all locks to rt-mutex locks? > > > Thanks > > -- > Qais Yousef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-09-30 15:44 ` Youssef Esmat @ 2022-09-30 17:42 ` Joel Fernandes 2022-09-30 18:10 ` Youssef Esmat 0 siblings, 1 reply; 22+ messages in thread From: Joel Fernandes @ 2022-09-30 17:42 UTC (permalink / raw) To: Youssef Esmat, Qais Yousef Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 9/30/2022 11:44 AM, Youssef Esmat wrote: > Hi Everyone! Hi Youssef, (Youssef is new to LKML though in no way new to OS or software development. I gave him the usual 'dont-top-post' chat already - fyi). > I am not sure we should care about A's sleeping pattern. The case we > care about is when A is running or wants to run but can't because it > is blocked on C. In that case C should get the weight of A as if A was > running. Just to clarify - Youssef did mean sum of weights of different things in the chain, and not just weights (he confirmed on chat that that's what he meant). > Ideally this is also a temporary boost since critical sections should > be relatively small, so erring on the side of giving C slightly more > runtime would be safe I think. True. But I would not hold my breath too much on user space not holding a lock for very long periods of time. But I agree that generally should be true. thanks, - Joel > > Thanks, > Youssef > > On Fri, Sep 30, 2022 at 8:49 AM Qais Yousef <qais.yousef@arm.com> wrote: >> >> Hi Joel >> >> I'm interested in the topic, if I can be CCed in any future discussions I'd >> appreciate it :) >> >> On 09/29/22 16:38, Joel Fernandes wrote: >>> Hi Peter, all, >>> >>> Just following-up about the idea Peter suggested at LPC22 about sum of weights >>> to solve the CFS priority inversion issues using priority inheritance. I am not >>> sure if a straight forward summation of the weights of dependencies in the >>> chain, is sufficient (or may cause too much unfairness). >>> >>> I think it will work if all the tasks on CPU are 100% in utilization: >>> >>> Say if you have 4 tasks (A, B, C, D) running and each one has equal >>> weight (W) except for A which has twice the weight (2W). >>> So the CPU bandwidth distribution is (assuming all are running): >>> A: 2 / 5 >>> B, C. D: 1 / 5 >>> >>> Say out of the 4 tasks, 3 of them are a part of a classical priority >>> inversion scenario (A, B and C). >>> >>> Say now A blocks on a lock and that lock's owner C is running, however now >>> because A has blocked, B gets 1/3 bandwidth, where as it should have been >>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4 >>> bandwidth - still not fair since B is eating away CPU bandwidth causing the >>> priority inversion we want to remedy. >>> >>> The correct bandwidth distribution should be (B and D should be unchanged): >>> B = 1/5 >>> D = 1/5 >>> >>> C = 3/5 >>> >>> This means that C's weight should be 3W , and B and D should be W each >>> as before. So indeed, C's new weight is its original weight PLUS the >>> weight of the A - that's needed to keep the CPU usage of the other >>> tasks (B, D) in check so that C makes forward progress on behalf of A and the >>> other tasks don't eat into the CPU utilization. >>> >>> However, I think this will kinda fall apart if A is asleep 50% of the time >>> (assume the sleep is because of I/O and unrelated to the PI chain). >>> >>> Because now if all were running (and assume no PI dependencies), with A being >>> 50%, the bandwidth of B, C and D each would be divided into 2 components: >>> >>> a. when A is running, it would be as above. >>> b. but if A was sleeping, B, C, and D would get 1/3. >>> >>> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30 >>> or 1/5 bandwidth. >> >> The average metric is interesting one. It can be confusing to reason about too. >> >> I think we have 3 events to take into account here, not 2: >> >> a. when A is running and NOT blocked on C. >> b. when A is running and BLOCKED on C. >> c. A is sleeping. >> >> This means A, B, C and D's shares will be: >> >> A , B , C , D >> a. 2/5, 1/5, 1/5, 1/5 >> b. - , 3/5, 1/5, 1/5 >> c. - , 1/3, 1/3, 1/3 >> >> Since A is sleeping for 50%, I don't think we can assume equal distribution for >> the 3 events (can't just divide by 3). >> >> I believe we can assume that >> >> a. occurs 25% of the time >> b. occurs 25% of the time >> c. occurs 50% of the time >> >> I *think* this should provide something more representative. >> >>> >>> But now say A happen to block on a lock that C is holding. You would boost C to >>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what >>> C should actually get. >>> >>> C should get (8/30 + 6/30 = 14/30) AFAICS. >>> >>> Hopefully one can see that a straight summation of weights is not enough. It >>> needs to be something like: >>> >>> C's new weight = C's original weight + (A's weight) * (A's utilization) >>> >>> Or something, otherwise the inherited weight may be too much to properly solve it. >>> >>> Any thoughts on this? You mentioned you had some notes on this and/or proxy >>> execution, could you share it? >> >> I assume we'll be using rt-mutex inheritance property to handle this? If this >> was discussed during a talk, I'd appreciate a link to that. >> >> In the past in OSPM conference we brought up an issue with performance >> inversion where a task running on a smaller (slower to be more generic) CPU is >> holding the lock and causing massive delays for waiters. This is an artefact of >> DVFS. For HMP, there's an additional cause due to the unequal capacities of the >> CPUs. >> >> Proxy execution seems to be the nice solution to all of these problems, but >> it's a long way away. I'm interested to learn how this inheritance will be >> implemented. And whether there are any userspace conversion issues. i.e: do >> we need to convert all locks to rt-mutex locks? >> >> >> Thanks >> >> -- >> Qais Yousef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-09-30 17:42 ` Joel Fernandes @ 2022-09-30 18:10 ` Youssef Esmat 2022-09-30 18:45 ` Joel Fernandes 0 siblings, 1 reply; 22+ messages in thread From: Youssef Esmat @ 2022-09-30 18:10 UTC (permalink / raw) To: Joel Fernandes Cc: Qais Yousef, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On Fri, Sep 30, 2022 at 12:42 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On 9/30/2022 11:44 AM, Youssef Esmat wrote: > > Hi Everyone! > > Hi Youssef, > > (Youssef is new to LKML though in no way new to OS or software development. I > gave him the usual 'dont-top-post' chat already - fyi). > > > I am not sure we should care about A's sleeping pattern. The case we > > care about is when A is running or wants to run but can't because it > > is blocked on C. In that case C should get the weight of A as if A was > > running. > > Just to clarify - Youssef did mean sum of weights of different things in the > chain, and not just weights (he confirmed on chat that that's what he meant). > Yeah thanks for clarifying, I meant that C should get the sum of weights as if A was running (3/5 in your example) since in this segment of time A would have been running if it was not blocked on the lock. I think it's safe to ignore the average and just use the sum of weights. So it would look like this: Time -> A: Slp, 2/5, Blk, 2/5, Slp B: 1/3, 1/5, 1/5, 1/5, 1/3 C: 1/3, 1/5, 3/5, 1/5, 1/3 D: 1/3, 1/5, 1/5, 1/5, 1/3 * Slp means sleeping * Blk means blocked on the lock owned by C. > > Ideally this is also a temporary boost since critical sections should > > be relatively small, so erring on the side of giving C slightly more > > runtime would be safe I think. > > True. But I would not hold my breath too much on user space not holding a lock > for very long periods of time. But I agree that generally should be true. > > thanks, > > - Joel > > > > > > Thanks, > > Youssef > > > > On Fri, Sep 30, 2022 at 8:49 AM Qais Yousef <qais.yousef@arm.com> wrote: > >> > >> Hi Joel > >> > >> I'm interested in the topic, if I can be CCed in any future discussions I'd > >> appreciate it :) > >> > >> On 09/29/22 16:38, Joel Fernandes wrote: > >>> Hi Peter, all, > >>> > >>> Just following-up about the idea Peter suggested at LPC22 about sum of weights > >>> to solve the CFS priority inversion issues using priority inheritance. I am not > >>> sure if a straight forward summation of the weights of dependencies in the > >>> chain, is sufficient (or may cause too much unfairness). > >>> > >>> I think it will work if all the tasks on CPU are 100% in utilization: > >>> > >>> Say if you have 4 tasks (A, B, C, D) running and each one has equal > >>> weight (W) except for A which has twice the weight (2W). > >>> So the CPU bandwidth distribution is (assuming all are running): > >>> A: 2 / 5 > >>> B, C. D: 1 / 5 > >>> > >>> Say out of the 4 tasks, 3 of them are a part of a classical priority > >>> inversion scenario (A, B and C). > >>> > >>> Say now A blocks on a lock and that lock's owner C is running, however now > >>> because A has blocked, B gets 1/3 bandwidth, where as it should have been > >>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4 > >>> bandwidth - still not fair since B is eating away CPU bandwidth causing the > >>> priority inversion we want to remedy. > >>> > >>> The correct bandwidth distribution should be (B and D should be unchanged): > >>> B = 1/5 > >>> D = 1/5 > >>> > >>> C = 3/5 > >>> > >>> This means that C's weight should be 3W , and B and D should be W each > >>> as before. So indeed, C's new weight is its original weight PLUS the > >>> weight of the A - that's needed to keep the CPU usage of the other > >>> tasks (B, D) in check so that C makes forward progress on behalf of A and the > >>> other tasks don't eat into the CPU utilization. > >>> > >>> However, I think this will kinda fall apart if A is asleep 50% of the time > >>> (assume the sleep is because of I/O and unrelated to the PI chain). > >>> > >>> Because now if all were running (and assume no PI dependencies), with A being > >>> 50%, the bandwidth of B, C and D each would be divided into 2 components: > >>> > >>> a. when A is running, it would be as above. > >>> b. but if A was sleeping, B, C, and D would get 1/3. > >>> > >>> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30 > >>> or 1/5 bandwidth. > >> > >> The average metric is interesting one. It can be confusing to reason about too. > >> > >> I think we have 3 events to take into account here, not 2: > >> > >> a. when A is running and NOT blocked on C. > >> b. when A is running and BLOCKED on C. > >> c. A is sleeping. > >> > >> This means A, B, C and D's shares will be: > >> > >> A , B , C , D > >> a. 2/5, 1/5, 1/5, 1/5 > >> b. - , 3/5, 1/5, 1/5 > >> c. - , 1/3, 1/3, 1/3 > >> > >> Since A is sleeping for 50%, I don't think we can assume equal distribution for > >> the 3 events (can't just divide by 3). > >> > >> I believe we can assume that > >> > >> a. occurs 25% of the time > >> b. occurs 25% of the time > >> c. occurs 50% of the time > >> > >> I *think* this should provide something more representative. > >> > >>> > >>> But now say A happen to block on a lock that C is holding. You would boost C to > >>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what > >>> C should actually get. > >>> > >>> C should get (8/30 + 6/30 = 14/30) AFAICS. > >>> > >>> Hopefully one can see that a straight summation of weights is not enough. It > >>> needs to be something like: > >>> > >>> C's new weight = C's original weight + (A's weight) * (A's utilization) > >>> > >>> Or something, otherwise the inherited weight may be too much to properly solve it. > >>> > >>> Any thoughts on this? You mentioned you had some notes on this and/or proxy > >>> execution, could you share it? > >> > >> I assume we'll be using rt-mutex inheritance property to handle this? If this > >> was discussed during a talk, I'd appreciate a link to that. > >> > >> In the past in OSPM conference we brought up an issue with performance > >> inversion where a task running on a smaller (slower to be more generic) CPU is > >> holding the lock and causing massive delays for waiters. This is an artefact of > >> DVFS. For HMP, there's an additional cause due to the unequal capacities of the > >> CPUs. > >> > >> Proxy execution seems to be the nice solution to all of these problems, but > >> it's a long way away. I'm interested to learn how this inheritance will be > >> implemented. And whether there are any userspace conversion issues. i.e: do > >> we need to convert all locks to rt-mutex locks? > >> > >> > >> Thanks > >> > >> -- > >> Qais Yousef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-09-30 18:10 ` Youssef Esmat @ 2022-09-30 18:45 ` Joel Fernandes 0 siblings, 0 replies; 22+ messages in thread From: Joel Fernandes @ 2022-09-30 18:45 UTC (permalink / raw) To: Youssef Esmat Cc: Qais Yousef, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On Fri, Sep 30, 2022 at 2:10 PM Youssef Esmat <youssefesmat@google.com> wrote: [..] > > > Hi Everyone! > > > > Hi Youssef, > > > > (Youssef is new to LKML though in no way new to OS or software development. I > > gave him the usual 'dont-top-post' chat already - fyi). > > > > > I am not sure we should care about A's sleeping pattern. The case we > > > care about is when A is running or wants to run but can't because it > > > is blocked on C. In that case C should get the weight of A as if A was > > > running. > > > > Just to clarify - Youssef did mean sum of weights of different things in the > > chain, and not just weights (he confirmed on chat that that's what he meant). > > > > Yeah thanks for clarifying, I meant that C should get the sum of > weights as if A was running (3/5 in your example) since in this > segment of time A would have been running if it was not blocked on the > lock. I think it's safe to ignore the average and just use the sum of For the onlooker, we are talking about the classical case of priority inversion involving 3 tasks A, B and C which can be expanded to a chain of tasks. Highest prio A blocks on a lock that lowest prio C holds, while an unrelated medium prio B blocks C (or reduces progress of it as in the case of CFS). On the note of "A would have been running if it was not blocked on the lock". I think that would be an assumption - we don't know if A would be running. We only know the past, not the future. A could very well make an I/O request for example. Hence there could be a need to use A's past utilization, right? thanks, - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-09-30 13:49 ` Qais Yousef 2022-09-30 15:44 ` Youssef Esmat @ 2022-09-30 17:34 ` Joel Fernandes 2022-10-03 16:14 ` Qais Yousef 2022-10-04 11:50 ` Sebastian Andrzej Siewior 1 sibling, 2 replies; 22+ messages in thread From: Joel Fernandes @ 2022-09-30 17:34 UTC (permalink / raw) To: Qais Yousef Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 9/30/2022 9:49 AM, Qais Yousef wrote: > Hi Joel > > I'm interested in the topic, if I can be CCed in any future discussions I'd > appreciate it :) Yes, surely! Will do :) > On 09/29/22 16:38, Joel Fernandes wrote: >> Hi Peter, all, >> >> Just following-up about the idea Peter suggested at LPC22 about sum of weights >> to solve the CFS priority inversion issues using priority inheritance. I am not >> sure if a straight forward summation of the weights of dependencies in the >> chain, is sufficient (or may cause too much unfairness). >> >> I think it will work if all the tasks on CPU are 100% in utilization: >> >> Say if you have 4 tasks (A, B, C, D) running and each one has equal >> weight (W) except for A which has twice the weight (2W). >> So the CPU bandwidth distribution is (assuming all are running): >> A: 2 / 5 >> B, C. D: 1 / 5 >> >> Say out of the 4 tasks, 3 of them are a part of a classical priority >> inversion scenario (A, B and C). >> >> Say now A blocks on a lock and that lock's owner C is running, however now >> because A has blocked, B gets 1/3 bandwidth, where as it should have been >> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4 >> bandwidth - still not fair since B is eating away CPU bandwidth causing the >> priority inversion we want to remedy. >> >> The correct bandwidth distribution should be (B and D should be unchanged): >> B = 1/5 >> D = 1/5 >> >> C = 3/5 >> >> This means that C's weight should be 3W , and B and D should be W each >> as before. So indeed, C's new weight is its original weight PLUS the >> weight of the A - that's needed to keep the CPU usage of the other >> tasks (B, D) in check so that C makes forward progress on behalf of A and the >> other tasks don't eat into the CPU utilization. >> >> However, I think this will kinda fall apart if A is asleep 50% of the time >> (assume the sleep is because of I/O and unrelated to the PI chain). >> >> Because now if all were running (and assume no PI dependencies), with A being >> 50%, the bandwidth of B, C and D each would be divided into 2 components: >> >> a. when A is running, it would be as above. >> b. but if A was sleeping, B, C, and D would get 1/3. >> >> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30 >> or 1/5 bandwidth. > > The average metric is interesting one. It can be confusing to reason about too. > > I think we have 3 events to take into account here, not 2: > > a. when A is running and NOT blocked on C. > b. when A is running and BLOCKED on C. > c. A is sleeping. > > This means A, B, C and D's shares will be: > > A , B , C , D > a. 2/5, 1/5, 1/5, 1/5 > b. - , 3/5, 1/5, 1/5 Here did you mean: b. -, 1/5, 3/5, 1/5 ? A blocked on C means, C should get A's weight (on top of its own). > c. - , 1/3, 1/3, 1/3 > > Since A is sleeping for 50%, I don't think we can assume equal distribution for > the 3 events (can't just divide by 3). Oh yeah, I did not get to _that_ part of the story yet at this point of the email, I brought it up later below where I say "But now say A happen to block".. > > I believe we can assume that > > a. occurs 25% of the time > b. occurs 25% of the time > c. occurs 50% of the time > > I *think* this should provide something more representative. Yes possible. My basic idea was I was trying to *not* change the distribution of B based on whether A blocks on C, or whether it does not. In my view, B's distribution should not change just because A and C have a lock dependency, because otherwise C can get too much or too little time. If C gets too much time, then its hurting B. If C gets too little time, then its hurting A. >> But now say A happen to block on a lock that C is holding. You would boost C to >> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what >> C should actually get. >> >> C should get (8/30 + 6/30 = 14/30) AFAICS. >> >> Hopefully one can see that a straight summation of weights is not enough. It >> needs to be something like: >> >> C's new weight = C's original weight + (A's weight) * (A's utilization) >> >> Or something, otherwise the inherited weight may be too much to properly solve it. >> >> Any thoughts on this? You mentioned you had some notes on this and/or proxy >> execution, could you share it? > > I assume we'll be using rt-mutex inheritance property to handle this? If this > was discussed during a talk, I'd appreciate a link to that. Yes that's the idea but I am also aware that 'other' kinds of dependencies in userspace that cannot benefit from a kernel-only boost. So if we consider a bounded-buffer producer/consumer. We can consider the producer as A, and the consumer as C, with B being a random mid-priority task. Once the bounded buffer gets full, A is waiting on a signal from C that it filled the buffer for which C needs to run in the first place to queue its payload into the buffer. However, trouble-maker B is determined to eat away's C's time and develop a prio inversion between itself and A. This pattern should also generalize to a worker pool consuming work. In this case, there is no lock involved yet you have a dependency. But I don't mean to sound depressing, and just because there are cases like this does not mean we should not solve the lock-based ones. When I looked at Android, I saw that it uses futex directly from Android Runtime code instead of using pthread. So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do in the kernel will JustWork(Tm) ? > > In the past in OSPM conference we brought up an issue with performance > inversion where a task running on a smaller (slower to be more generic) CPU is > holding the lock and causing massive delays for waiters. This is an artefact of > DVFS. For HMP, there's an additional cause due to the unequal capacities of the > CPUs. > > Proxy execution seems to be the nice solution to all of these problems, but > it's a long way away. I'm interested to learn how this inheritance will be > implemented. And whether there are any userspace conversion issues. i.e: do > we need to convert all locks to rt-mutex locks? I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds reasonable to me. Other thoughts? thanks, - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-09-30 17:34 ` Joel Fernandes @ 2022-10-03 16:14 ` Qais Yousef 2022-10-03 16:27 ` Joel Fernandes 2022-10-04 20:27 ` Joel Fernandes 2022-10-04 11:50 ` Sebastian Andrzej Siewior 1 sibling, 2 replies; 22+ messages in thread From: Qais Yousef @ 2022-10-03 16:14 UTC (permalink / raw) To: Joel Fernandes Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 09/30/22 13:34, Joel Fernandes wrote: > > > On 9/30/2022 9:49 AM, Qais Yousef wrote: > > Hi Joel > > > > I'm interested in the topic, if I can be CCed in any future discussions I'd > > appreciate it :) > > Yes, surely! Will do :) > > > On 09/29/22 16:38, Joel Fernandes wrote: > >> Hi Peter, all, > >> > >> Just following-up about the idea Peter suggested at LPC22 about sum of weights > >> to solve the CFS priority inversion issues using priority inheritance. I am not > >> sure if a straight forward summation of the weights of dependencies in the > >> chain, is sufficient (or may cause too much unfairness). > >> > >> I think it will work if all the tasks on CPU are 100% in utilization: > >> > >> Say if you have 4 tasks (A, B, C, D) running and each one has equal > >> weight (W) except for A which has twice the weight (2W). > >> So the CPU bandwidth distribution is (assuming all are running): > >> A: 2 / 5 > >> B, C. D: 1 / 5 > >> > >> Say out of the 4 tasks, 3 of them are a part of a classical priority > >> inversion scenario (A, B and C). > >> > >> Say now A blocks on a lock and that lock's owner C is running, however now > >> because A has blocked, B gets 1/3 bandwidth, where as it should have been > >> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4 > >> bandwidth - still not fair since B is eating away CPU bandwidth causing the > >> priority inversion we want to remedy. > >> > >> The correct bandwidth distribution should be (B and D should be unchanged): > >> B = 1/5 > >> D = 1/5 > >> > >> C = 3/5 > >> > >> This means that C's weight should be 3W , and B and D should be W each > >> as before. So indeed, C's new weight is its original weight PLUS the > >> weight of the A - that's needed to keep the CPU usage of the other > >> tasks (B, D) in check so that C makes forward progress on behalf of A and the > >> other tasks don't eat into the CPU utilization. > >> > >> However, I think this will kinda fall apart if A is asleep 50% of the time > >> (assume the sleep is because of I/O and unrelated to the PI chain). > >> > >> Because now if all were running (and assume no PI dependencies), with A being > >> 50%, the bandwidth of B, C and D each would be divided into 2 components: > >> > >> a. when A is running, it would be as above. > >> b. but if A was sleeping, B, C, and D would get 1/3. > >> > >> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30 > >> or 1/5 bandwidth. > > > > The average metric is interesting one. It can be confusing to reason about too. > > > > I think we have 3 events to take into account here, not 2: > > > > a. when A is running and NOT blocked on C. > > b. when A is running and BLOCKED on C. > > c. A is sleeping. > > > > This means A, B, C and D's shares will be: > > > > A , B , C , D > > a. 2/5, 1/5, 1/5, 1/5 > > b. - , 3/5, 1/5, 1/5 > > Here did you mean: > b. -, 1/5, 3/5, 1/5 ? Yes! Sorry a typo. > > A blocked on C means, C should get A's weight (on top of its own). > > > c. - , 1/3, 1/3, 1/3 > > > > Since A is sleeping for 50%, I don't think we can assume equal distribution for > > the 3 events (can't just divide by 3). > > Oh yeah, I did not get to _that_ part of the story yet at this point of the > email, I brought it up later below where I say "But now say A happen to block".. Hmm, but that's my point, I think you need to take _that_ part here or we're not be doing an apple to apple comparison. > > > > > I believe we can assume that > > > > a. occurs 25% of the time > > b. occurs 25% of the time > > c. occurs 50% of the time > > > > I *think* this should provide something more representative. > > Yes possible. My basic idea was I was trying to *not* change the distribution of > B based on whether A blocks on C, or whether it does not. In my view, B's > distribution should not change just because A and C have a lock dependency, > because otherwise C can get too much or too little time. If C gets too much > time, then its hurting B. If C gets too little time, then its hurting A. Maybe I am getting confused. But AFAICT you're treating a. when A is running, it would be as above. b. but if A was sleeping, B, C, and D would get 1/3. similar to a. when A is running *and blocked on C for all its runtime* b. but if A was sleeping, B, C, and D would get 1/3. I don't think this is valid. If A is blocked on C for 50% of the time, and sleeping for 50% of the time, when did it get blocked/unblocked? This will have an impact on the average share for C and skew it, no? Unless I missed something, the average share of C being (3/5 + 1/3) is an impossible state. You need to consider the portion of time when C runs as 1/5, when A is actually not blocked on anything, too. Hmm actually I just re-read your statement below and you just say 3/5 (18/30) is too much. You didn't consider the average. I'll leave the above in hope to help me understand what am I missing and where I went wrong :-) Generally IMHO looking at the average will not help. I think if the share values make sense in each state individually (and I believe they are), that would be enough. AFAICS, B and D are still taking the right amount of time when C inherits the bandwidth. And C by definition will run longer when A is blocked on it for the whole duration of this blocked time. > > >> But now say A happen to block on a lock that C is holding. You would boost C to > >> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what > >> C should actually get. > >> > >> C should get (8/30 + 6/30 = 14/30) AFAICS. > >> > >> Hopefully one can see that a straight summation of weights is not enough. It > >> needs to be something like: > >> > >> C's new weight = C's original weight + (A's weight) * (A's utilization) > >> > >> Or something, otherwise the inherited weight may be too much to properly solve it. > >> > >> Any thoughts on this? You mentioned you had some notes on this and/or proxy > >> execution, could you share it? > > > > I assume we'll be using rt-mutex inheritance property to handle this? If this > > was discussed during a talk, I'd appreciate a link to that. > > Yes that's the idea but I am also aware that 'other' kinds of dependencies in > userspace that cannot benefit from a kernel-only boost. > > So if we consider a bounded-buffer producer/consumer. We can consider the > producer as A, and the consumer as C, with B being a random mid-priority task. > Once the bounded buffer gets full, A is waiting on a signal from C that it > filled the buffer for which C needs to run in the first place to queue its > payload into the buffer. However, trouble-maker B is determined to eat away's > C's time and develop a prio inversion between itself and A. This pattern should > also generalize to a worker pool consuming work. Will proxy-execution be able to handle these other cases? I think it is tied to rt-mutex to be able to detect the dependency chain too. Looks a generic extension of the problem that all potential solutions will need to deal with. > > In this case, there is no lock involved yet you have a dependency. But I don't > mean to sound depressing, and just because there are cases like this does not > mean we should not solve the lock-based ones. When I looked at Android, I saw > that it uses futex directly from Android Runtime code instead of using pthread. > So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do > in the kernel will JustWork(Tm) ? I guess it will depend on individual libc implementation, but I thought all of them use FUTEX under the hood for pthreads mutexes. Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI? > > > > > In the past in OSPM conference we brought up an issue with performance > > inversion where a task running on a smaller (slower to be more generic) CPU is > > holding the lock and causing massive delays for waiters. This is an artefact of > > DVFS. For HMP, there's an additional cause due to the unequal capacities of the > > CPUs. > > > > Proxy execution seems to be the nice solution to all of these problems, but > > it's a long way away. I'm interested to learn how this inheritance will be > > implemented. And whether there are any userspace conversion issues. i.e: do > > we need to convert all locks to rt-mutex locks? > > I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to > weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds > reasonable to me. I realized whether we need to worry about in-kernel kthreads users too.. pthreads requires users to specify if they want PTHREAD_PRIO_{INHERIT, PROTECT} when initializing their mutex. AFAICS from glibc, PTHREAD_PRIO_INHERIT maps to FUTEX_LOCK_PI. But PTHREAD_PRIO_PROTOTECT uses a different algorithm that is implemented in glibc. This makes me think how much this makes sense in linux as for CFS priorities are transalted into weights and not actual priorites like in RT :-/ I need to dig more.. Expert opinions would help for sure :) Thanks! -- Qais Yousef > > Other thoughts? > > thanks, > > - Joel > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-03 16:14 ` Qais Yousef @ 2022-10-03 16:27 ` Joel Fernandes 2022-10-04 16:30 ` Qais Yousef 2022-10-04 20:27 ` Joel Fernandes 1 sibling, 1 reply; 22+ messages in thread From: Joel Fernandes @ 2022-10-03 16:27 UTC (permalink / raw) To: Qais Yousef Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney There's a lot to unwind so I will reply in pieces after spending some time thinking about it, but just for this part: On 10/3/2022 12:14 PM, Qais Yousef wrote: >> In this case, there is no lock involved yet you have a dependency. But I don't >> mean to sound depressing, and just because there are cases like this does not >> mean we should not solve the lock-based ones. When I looked at Android, I saw >> that it uses futex directly from Android Runtime code instead of using pthread. >> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do >> in the kernel will JustWork(Tm) ? > I guess it will depend on individual libc implementation, but I thought all of > them use FUTEX under the hood for pthreads mutexes. > > Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI? > In the case of FUTEX_LOCK_PI, you have to store the TID of the 'lock owner' in the futex word to signify that lock is held. That wont work for the case above, Producer/Consumer signalling each other on a bounded-buffer, right? That's not locking even though it is acquiring and release of a limited resource. thanks, - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-03 16:27 ` Joel Fernandes @ 2022-10-04 16:30 ` Qais Yousef 2022-10-04 19:48 ` Joel Fernandes 0 siblings, 1 reply; 22+ messages in thread From: Qais Yousef @ 2022-10-04 16:30 UTC (permalink / raw) To: Joel Fernandes Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 10/03/22 12:27, Joel Fernandes wrote: > There's a lot to unwind so I will reply in pieces after spending some time > thinking about it, but just for this part: > > On 10/3/2022 12:14 PM, Qais Yousef wrote: > >> In this case, there is no lock involved yet you have a dependency. But I don't > >> mean to sound depressing, and just because there are cases like this does not > >> mean we should not solve the lock-based ones. When I looked at Android, I saw > >> that it uses futex directly from Android Runtime code instead of using pthread. > >> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do > >> in the kernel will JustWork(Tm) ? > > I guess it will depend on individual libc implementation, but I thought all of > > them use FUTEX under the hood for pthreads mutexes. > > > > Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI? > > > > In the case of FUTEX_LOCK_PI, you have to store the TID of the 'lock owner' in > the futex word to signify that lock is held. Right. So userspace has to opt-in. > That wont work for the case above, Producer/Consumer signalling each other on a > bounded-buffer, right? That's not locking even though it is acquiring and > release of a limited resource. Yes but as I tried to point out I don't think proxy-execution handles this case where you don't hold a lock explicitly. But I could be wrong. IIUC Sebastian's understanding is similar to mine. Only 'locks' (FUTEX_LOCK_PI which ends up using rt-mutex) do PI inheritance. So this signaling scenario is a new class of problems that wasn't handled before; to my understanding. Thanks -- Qais Yousef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-04 16:30 ` Qais Yousef @ 2022-10-04 19:48 ` Joel Fernandes 2022-10-05 9:31 ` Qais Yousef 0 siblings, 1 reply; 22+ messages in thread From: Joel Fernandes @ 2022-10-04 19:48 UTC (permalink / raw) To: Qais Yousef Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 10/4/2022 12:30 PM, Qais Yousef wrote: > On 10/03/22 12:27, Joel Fernandes wrote: >> There's a lot to unwind so I will reply in pieces after spending some time >> thinking about it, but just for this part: >> >> On 10/3/2022 12:14 PM, Qais Yousef wrote: >>>> In this case, there is no lock involved yet you have a dependency. But I don't >>>> mean to sound depressing, and just because there are cases like this does not >>>> mean we should not solve the lock-based ones. When I looked at Android, I saw >>>> that it uses futex directly from Android Runtime code instead of using pthread. >>>> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do >>>> in the kernel will JustWork(Tm) ? >>> I guess it will depend on individual libc implementation, but I thought all of >>> them use FUTEX under the hood for pthreads mutexes. >>> >>> Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI? >>> >> >> In the case of FUTEX_LOCK_PI, you have to store the TID of the 'lock owner' in >> the futex word to signify that lock is held. > > Right. So userspace has to opt-in. > >> That wont work for the case above, Producer/Consumer signalling each other on a >> bounded-buffer, right? That's not locking even though it is acquiring and >> release of a limited resource. > > Yes but as I tried to point out I don't think proxy-execution handles this case > where you don't hold a lock explicitly. But I could be wrong. I don't disagree. Proxy execution is an implementation detail, without more information from userspace, any implementation cannot help. I was just responding to your point about converting all futexes which you cannot do without knowing what the futex is used for. But I am thinking of messing around with rt_mutex_setprio() and some userspace tests to see if I can make the sum of weights thing work for the *userspace locking* usecases (FUTEX_LOCK_PI). Then run some tests and collect some traces. Perhaps you can do that on the Android side as well. > IIUC Sebastian's > understanding is similar to mine. Only 'locks' (FUTEX_LOCK_PI which ends up > using rt-mutex) do PI inheritance. > > So this signaling scenario is a new class of problems that wasn't handled > before; to my understanding. Most certainly, agreed. Thanks! - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-04 19:48 ` Joel Fernandes @ 2022-10-05 9:31 ` Qais Yousef 0 siblings, 0 replies; 22+ messages in thread From: Qais Yousef @ 2022-10-05 9:31 UTC (permalink / raw) To: Joel Fernandes Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 10/04/22 15:48, Joel Fernandes wrote: > > > On 10/4/2022 12:30 PM, Qais Yousef wrote: > > On 10/03/22 12:27, Joel Fernandes wrote: > >> There's a lot to unwind so I will reply in pieces after spending some time > >> thinking about it, but just for this part: > >> > >> On 10/3/2022 12:14 PM, Qais Yousef wrote: > >>>> In this case, there is no lock involved yet you have a dependency. But I don't > >>>> mean to sound depressing, and just because there are cases like this does not > >>>> mean we should not solve the lock-based ones. When I looked at Android, I saw > >>>> that it uses futex directly from Android Runtime code instead of using pthread. > >>>> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do > >>>> in the kernel will JustWork(Tm) ? > >>> I guess it will depend on individual libc implementation, but I thought all of > >>> them use FUTEX under the hood for pthreads mutexes. > >>> > >>> Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI? > >>> > >> > >> In the case of FUTEX_LOCK_PI, you have to store the TID of the 'lock owner' in > >> the futex word to signify that lock is held. > > > > Right. So userspace has to opt-in. > > > >> That wont work for the case above, Producer/Consumer signalling each other on a > >> bounded-buffer, right? That's not locking even though it is acquiring and > >> release of a limited resource. > > > > Yes but as I tried to point out I don't think proxy-execution handles this case > > where you don't hold a lock explicitly. But I could be wrong. > > I don't disagree. Proxy execution is an implementation detail, without more > information from userspace, any implementation cannot help. I was just > responding to your point about converting all futexes which you cannot do > without knowing what the futex is used for. +1 I don't think I read much on literature on priority inversion caused by waiting on signals. I need to research that. I think it is considered a voluntary sleep and sane system design should ensure both of these tasks priorities don't lead to starvation based on expected rate of producer/consumer. It doesn't seem to be a problem for PREEMPT_RT since no body has done anything about it AFAICT? It could be the fact that in CFS priority is weights (or bandwidth) and this introduces this new class of problems. I think we should still ask the question if the priority assignment is wrong when this happens. If there's a clear relationship between producer/consumer, should they have the same priority if they do equal amount of work? > > But I am thinking of messing around with rt_mutex_setprio() and some userspace > tests to see if I can make the sum of weights thing work for the *userspace > locking* usecases (FUTEX_LOCK_PI). Then run some tests and collect some traces. > Perhaps you can do that on the Android side as well. I'd be happy to help, yes :) In my view, the trickiest part would be is how to account for the stolen time. If C gets 3/5 share and runs for 2/5 before releasing the lock, then when A wakes up, it should perceive that it ran for 1/5 (time C stole from A while holding the lock) and run only for 1/5 before getting preempted. To preserve its 2/5 share. That is IF we want to be very accurate. If the 2 tasks are not on the same rq, I think that will not change how things are done a lot.. > > > IIUC Sebastian's > > understanding is similar to mine. Only 'locks' (FUTEX_LOCK_PI which ends up > > using rt-mutex) do PI inheritance. > > > > So this signaling scenario is a new class of problems that wasn't handled > > before; to my understanding. > Most certainly, agreed. Sorry I am thinking out loud for most/all part of my reply :) Thanks! -- Qais Yousef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-03 16:14 ` Qais Yousef 2022-10-03 16:27 ` Joel Fernandes @ 2022-10-04 20:27 ` Joel Fernandes 2022-10-05 10:04 ` Qais Yousef 1 sibling, 1 reply; 22+ messages in thread From: Joel Fernandes @ 2022-10-04 20:27 UTC (permalink / raw) To: Qais Yousef Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 10/3/2022 12:14 PM, Qais Yousef wrote: > On 09/30/22 13:34, Joel Fernandes wrote: >> >> >> On 9/30/2022 9:49 AM, Qais Yousef wrote: >>> Hi Joel >>> >>> I'm interested in the topic, if I can be CCed in any future discussions I'd >>> appreciate it :) >> >> Yes, surely! Will do :) >> >>> On 09/29/22 16:38, Joel Fernandes wrote: >>>> Hi Peter, all, >>>> >>>> Just following-up about the idea Peter suggested at LPC22 about sum of weights >>>> to solve the CFS priority inversion issues using priority inheritance. I am not >>>> sure if a straight forward summation of the weights of dependencies in the >>>> chain, is sufficient (or may cause too much unfairness). >>>> >>>> I think it will work if all the tasks on CPU are 100% in utilization: >>>> >>>> Say if you have 4 tasks (A, B, C, D) running and each one has equal >>>> weight (W) except for A which has twice the weight (2W). >>>> So the CPU bandwidth distribution is (assuming all are running): >>>> A: 2 / 5 >>>> B, C. D: 1 / 5 >>>> >>>> Say out of the 4 tasks, 3 of them are a part of a classical priority >>>> inversion scenario (A, B and C). >>>> >>>> Say now A blocks on a lock and that lock's owner C is running, however now >>>> because A has blocked, B gets 1/3 bandwidth, where as it should have been >>>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4 >>>> bandwidth - still not fair since B is eating away CPU bandwidth causing the >>>> priority inversion we want to remedy. >>>> >>>> The correct bandwidth distribution should be (B and D should be unchanged): >>>> B = 1/5 >>>> D = 1/5 >>>> >>>> C = 3/5 >>>> >>>> This means that C's weight should be 3W , and B and D should be W each >>>> as before. So indeed, C's new weight is its original weight PLUS the >>>> weight of the A - that's needed to keep the CPU usage of the other >>>> tasks (B, D) in check so that C makes forward progress on behalf of A and the >>>> other tasks don't eat into the CPU utilization. >>>> >>>> However, I think this will kinda fall apart if A is asleep 50% of the time >>>> (assume the sleep is because of I/O and unrelated to the PI chain). >>>> >>>> Because now if all were running (and assume no PI dependencies), with A being >>>> 50%, the bandwidth of B, C and D each would be divided into 2 components: >>>> >>>> a. when A is running, it would be as above. >>>> b. but if A was sleeping, B, C, and D would get 1/3. >>>> >>>> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30 >>>> or 1/5 bandwidth. >>> >>> The average metric is interesting one. It can be confusing to reason about too. >>> >>> I think we have 3 events to take into account here, not 2: >>> >>> a. when A is running and NOT blocked on C. >>> b. when A is running and BLOCKED on C. >>> c. A is sleeping. >>> >>> This means A, B, C and D's shares will be: >>> >>> A , B , C , D >>> a. 2/5, 1/5, 1/5, 1/5 >>> b. - , 3/5, 1/5, 1/5 >> >> Here did you mean: >> b. -, 1/5, 3/5, 1/5 ? > > Yes! Sorry a typo. > >> >> A blocked on C means, C should get A's weight (on top of its own). >> >>> c. - , 1/3, 1/3, 1/3 >>> >>> Since A is sleeping for 50%, I don't think we can assume equal distribution for >>> the 3 events (can't just divide by 3). >> >> Oh yeah, I did not get to _that_ part of the story yet at this point of the >> email, I brought it up later below where I say "But now say A happen to block".. > > Hmm, but that's my point, I think you need to take _that_ part here or we're > not be doing an apple to apple comparison. > >> >>> >>> I believe we can assume that >>> >>> a. occurs 25% of the time >>> b. occurs 25% of the time >>> c. occurs 50% of the time >>> >>> I *think* this should provide something more representative. >> >> Yes possible. My basic idea was I was trying to *not* change the distribution of >> B based on whether A blocks on C, or whether it does not. In my view, B's >> distribution should not change just because A and C have a lock dependency, >> because otherwise C can get too much or too little time. If C gets too much >> time, then its hurting B. If C gets too little time, then its hurting A. > > Maybe I am getting confused. But AFAICT you're treating > > > a. when A is running, it would be as above. > b. but if A was sleeping, B, C, and D would get 1/3. > > similar to > > a. when A is running *and blocked on C for all its runtime* > b. but if A was sleeping, B, C, and D would get 1/3. I am treating the following the same: a. when A is running, it would be as above. b. but if A was sleeping, B, C, and D would get 1/3. similar to a. when A is running *and blocked on C for all its runtime* ^^ -- in this case, B and D should not have their distributions changed at all because they are not participating in the lock acquire and release. So they should neither be hurt any more, nor be boosted. They should simply stay same [1] b. but if A was sleeping, B, C, and D would get 1/3. [1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio. If all are running 100% and A does not block on C, B is blocked by A indefinitely. So the prio of A and B are inverted. We seek to rectify this, that is we need make changes such that, B is returned back to the blocked state. We do this by boosting C. In other words, the prio inheritance will cause B's distribution to not be changed (it was supposed to be blocked before and it is now going to be blocked state again). CFS should not behave any differently, B's distribution should not be changed before/after the priority inhertiance of A by C. That's just my opinion - and that's how I calculated to distribution. With that mind, could you go back to seeing if my math was originally correct or did I mess something up? I do think though that Youssef's point of not worrying about being too accurate is reasonable if the critical sections are short lived but I'm not sure. > > I don't think this is valid. If A is blocked on C for 50% of the time, and > sleeping for 50% of the time, when did it get blocked/unblocked? > > This will have an impact on the average share for C and skew it, no? > > Unless I missed something, the average share of C being (3/5 + 1/3) is an > impossible state. You need to consider the portion of time when C runs as 1/5, > when A is actually not blocked on anything, too. > > Hmm actually I just re-read your statement below and you just say 3/5 (18/30) > is too much. You didn't consider the average. I'll leave the above in hope to > help me understand what am I missing and where I went wrong :-) > > Generally IMHO looking at the average will not help. I think if the share > values make sense in each state individually (and I believe they are), that > would be enough. AFAICS, B and D are still taking the right amount of time when > C inherits the bandwidth. And C by definition will run longer when A is blocked > on it for the whole duration of this blocked time. I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify the math, and then taking average of that. I think that's reasonable? > >> >>>> But now say A happen to block on a lock that C is holding. You would boost C to >>>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what >>>> C should actually get. >>>> >>>> C should get (8/30 + 6/30 = 14/30) AFAICS. >>>> >>>> Hopefully one can see that a straight summation of weights is not enough. It >>>> needs to be something like: >>>> >>>> C's new weight = C's original weight + (A's weight) * (A's utilization) >>>> >>>> Or something, otherwise the inherited weight may be too much to properly solve it. >>>> >>>> Any thoughts on this? You mentioned you had some notes on this and/or proxy >>>> execution, could you share it? >>> >>> I assume we'll be using rt-mutex inheritance property to handle this? If this >>> was discussed during a talk, I'd appreciate a link to that. >> >> Yes that's the idea but I am also aware that 'other' kinds of dependencies in >> userspace that cannot benefit from a kernel-only boost. >> >> So if we consider a bounded-buffer producer/consumer. We can consider the >> producer as A, and the consumer as C, with B being a random mid-priority task. >> Once the bounded buffer gets full, A is waiting on a signal from C that it >> filled the buffer for which C needs to run in the first place to queue its >> payload into the buffer. However, trouble-maker B is determined to eat away's >> C's time and develop a prio inversion between itself and A. This pattern should >> also generalize to a worker pool consuming work. > > Will proxy-execution be able to handle these other cases? I think it is tied to > rt-mutex to be able to detect the dependency chain too. Looks a generic > extension of the problem that all potential solutions will need to deal with. Yeah, in theory if the producer and consumer can "mark" themselves as potentially have someone waiting for them, then the kernel has that info and any implementation may capitalize on that. I am not sure if this is already possible by some command in the futex API. >>> In the past in OSPM conference we brought up an issue with performance >>> inversion where a task running on a smaller (slower to be more generic) CPU is >>> holding the lock and causing massive delays for waiters. This is an artefact of >>> DVFS. For HMP, there's an additional cause due to the unequal capacities of the >>> CPUs. >>> >>> Proxy execution seems to be the nice solution to all of these problems, but >>> it's a long way away. I'm interested to learn how this inheritance will be >>> implemented. And whether there are any userspace conversion issues. i.e: do >>> we need to convert all locks to rt-mutex locks? >> >> I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to >> weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds >> reasonable to me. > > I realized whether we need to worry about in-kernel kthreads users too.. Possibly. > pthreads requires users to specify if they want PTHREAD_PRIO_{INHERIT, PROTECT} > when initializing their mutex. PTHREAD_PRIO_PROTECT seems like an interesting approach! Thanks for mentioning. > AFAICS from glibc, PTHREAD_PRIO_INHERIT maps to FUTEX_LOCK_PI. Ok. > But PTHREAD_PRIO_PROTOTECT uses a different algorithm that is implemented in > glibc. This makes me think how much this makes sense in linux as for CFS > priorities are transalted into weights and not actual priorites like in RT :-/ > > I need to dig more.. Ah same here, more reading/digging to do ;). - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-04 20:27 ` Joel Fernandes @ 2022-10-05 10:04 ` Qais Yousef 2022-10-06 13:53 ` Joel Fernandes 0 siblings, 1 reply; 22+ messages in thread From: Qais Yousef @ 2022-10-05 10:04 UTC (permalink / raw) To: Joel Fernandes Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney Hi Joel On 10/04/22 16:27, Joel Fernandes wrote: [...] > I am treating the following the same: > > a. when A is running, it would be as above. > b. but if A was sleeping, B, C, and D would get 1/3. > > similar to > > a. when A is running *and blocked on C for all its runtime* > ^^ -- in this case, B and D should not have their distributions > changed at all because they are not participating in the > lock acquire and release. So they should neither be hurt > any more, nor be boosted. They should simply stay same [1] > > b. but if A was sleeping, B, C, and D would get 1/3. > > > [1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio. > > If all are running 100% and A does not block on C, B is blocked by A > indefinitely. So the prio of A and B are inverted. We seek to rectify this, that > is we need make changes such that, B is returned back to the blocked state. We > do this by boosting C. > > In other words, the prio inheritance will cause B's distribution to not be > changed (it was supposed to be blocked before and it is now going to be blocked > state again). > > CFS should not behave any differently, B's distribution should not be changed > before/after the priority inhertiance of A by C. That's just my opinion - and > that's how I calculated to distribution. With that mind, could you go back to > seeing if my math was originally correct or did I mess something up? It's not about the math. But I think the before and after can't be the same for C.. > I do think though that Youssef's point of not worrying about being too accurate > is reasonable if the critical sections are short lived but I'm not sure. .. I do agree with that as well. I was just trying to highlight that looking at average can be misleading and I don't see C taking too much time. If any worries I have, it'd be not accounting correctly for the stolen time C takes from A. Otherwise A + C share combined would be higher than it should be. Which might be the problem you're trying to highlight but I am unable to get/see. But this is an implementation detail and an artefact of wrong accounting, not how shares are summed. > > I don't think this is valid. If A is blocked on C for 50% of the time, and > > sleeping for 50% of the time, when did it get blocked/unblocked? > > > > This will have an impact on the average share for C and skew it, no? > > > > Unless I missed something, the average share of C being (3/5 + 1/3) is an > > impossible state. You need to consider the portion of time when C runs as 1/5, > > when A is actually not blocked on anything, too. > > > > Hmm actually I just re-read your statement below and you just say 3/5 (18/30) > > is too much. You didn't consider the average. I'll leave the above in hope to > > help me understand what am I missing and where I went wrong :-) > > > > Generally IMHO looking at the average will not help. I think if the share > > values make sense in each state individually (and I believe they are), that > > would be enough. AFAICS, B and D are still taking the right amount of time when > > C inherits the bandwidth. And C by definition will run longer when A is blocked > > on it for the whole duration of this blocked time. > > I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify > the math, and then taking average of that. I think that's reasonable? I'm not sure. This is skewing the results in my view. I think the comparison should just be: 1) A, B, C, and D are all running and nothing gets blocked at all. Then shares would be: 2/5, 1/5, 1/5, 1/5 2) A is blocked and C; B, C, D are running with no blocked time. Shares would be: - , 1/5, 3/5, 1/5 By definition, we want to treat A in (2) as RUNNING because as soon as C unblocks A we should return to (1). From B and D perspective, their share is not impacted throughout this transition. Which is AFAIU is what we want to achieve. I think considering the sleeping time and averaging can lead to misleading results if care is not taken. Anyway - just trying to explain how I see it and why C is unlikely to be taking too much time. I could be wrong. As Youssef said, I think there's no fundamental problem here. My 2 cents ;-) Thanks! -- Qais Yousef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-05 10:04 ` Qais Yousef @ 2022-10-06 13:53 ` Joel Fernandes 2022-10-06 19:40 ` Youssef Esmat 0 siblings, 1 reply; 22+ messages in thread From: Joel Fernandes @ 2022-10-06 13:53 UTC (permalink / raw) To: Qais Yousef Cc: Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On Wed, Oct 5, 2022 at 6:04 AM Qais Yousef <qais.yousef@arm.com> wrote: > > Hi Joel > > On 10/04/22 16:27, Joel Fernandes wrote: > > [...] > > > I am treating the following the same: > > > > a. when A is running, it would be as above. > > b. but if A was sleeping, B, C, and D would get 1/3. > > > > similar to > > > > a. when A is running *and blocked on C for all its runtime* > > ^^ -- in this case, B and D should not have their distributions > > changed at all because they are not participating in the > > lock acquire and release. So they should neither be hurt > > any more, nor be boosted. They should simply stay same [1] > > > > b. but if A was sleeping, B, C, and D would get 1/3. > > > > > > [1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio. > > > > If all are running 100% and A does not block on C, B is blocked by A > > indefinitely. So the prio of A and B are inverted. We seek to rectify this, that > > is we need make changes such that, B is returned back to the blocked state. We > > do this by boosting C. > > > > In other words, the prio inheritance will cause B's distribution to not be > > changed (it was supposed to be blocked before and it is now going to be blocked > > state again). > > > > CFS should not behave any differently, B's distribution should not be changed > > before/after the priority inhertiance of A by C. That's just my opinion - and > > that's how I calculated to distribution. With that mind, could you go back to > > seeing if my math was originally correct or did I mess something up? > > It's not about the math. But I think the before and after can't be the same for > C.. C is acquiring/releasing the lock so I expect its distribution to change. I was talking about the poor B who has nothing to do with the lock. > > > I don't think this is valid. If A is blocked on C for 50% of the time, and > > > sleeping for 50% of the time, when did it get blocked/unblocked? > > > > > > This will have an impact on the average share for C and skew it, no? > > > > > > Unless I missed something, the average share of C being (3/5 + 1/3) is an > > > impossible state. You need to consider the portion of time when C runs as 1/5, > > > when A is actually not blocked on anything, too. > > > > > > Hmm actually I just re-read your statement below and you just say 3/5 (18/30) > > > is too much. You didn't consider the average. I'll leave the above in hope to > > > help me understand what am I missing and where I went wrong :-) > > > > > > Generally IMHO looking at the average will not help. I think if the share > > > values make sense in each state individually (and I believe they are), that > > > would be enough. AFAICS, B and D are still taking the right amount of time when > > > C inherits the bandwidth. And C by definition will run longer when A is blocked > > > on it for the whole duration of this blocked time. > > > > I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify > > the math, and then taking average of that. I think that's reasonable? > > I'm not sure. This is skewing the results in my view. > > I think the comparison should just be: > > 1) A, B, C, and D are all running and nothing gets blocked at all. Then shares > would be: > > 2/5, 1/5, 1/5, 1/5 > > 2) A is blocked and C; B, C, D are running with no blocked time. Shares would > be: > > - , 1/5, 3/5, 1/5 > > By definition, we want to treat A in (2) as RUNNING because as soon as > C unblocks A we should return to (1). From B and D perspective, their share is > not impacted throughout this transition. Which is AFAIU is what we want to > achieve. > > I think considering the sleeping time and averaging can lead to misleading > results if care is not taken. Yes, but that doesn't mean we can just ignore it. It is easy in my view to skew the inherited weight to a very large number, only to find that tasks unrelated to the lock acquire/release are "suffering" though they had nothing to do with the lock or the PI. But it is reasonable to try the simple approach first and see the impact. I also never said the averaging approach or consideration of sleeping time is perfect ;-) > Anyway - just trying to explain how I see it and why C is unlikely to be taking > too much time. I could be wrong. As Youssef said, I think there's no > fundamental problem here. I know on Android where they use smaller HZ, the large tick causes lots of problems for large nice deltas. Example if a highly niced task was to be preempted for 1ms, and preempts instead at 3ms, then the less-niced task will not be so nice (even less nice than it promised to be) any more because of the 2ms boost that the higher niced task got. This can lead the the sched_latency thrown out of the window. Not adjusting the weights properly can potentially make that problem much worse IMO. Thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-06 13:53 ` Joel Fernandes @ 2022-10-06 19:40 ` Youssef Esmat 2022-10-08 15:04 ` Joel Fernandes 0 siblings, 1 reply; 22+ messages in thread From: Youssef Esmat @ 2022-10-06 19:40 UTC (permalink / raw) To: Joel Fernandes Cc: Qais Yousef, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On Thu, Oct 6, 2022 at 8:53 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Wed, Oct 5, 2022 at 6:04 AM Qais Yousef <qais.yousef@arm.com> wrote: > > > > Hi Joel > > > > On 10/04/22 16:27, Joel Fernandes wrote: > > > > [...] > > > > > I am treating the following the same: > > > > > > a. when A is running, it would be as above. > > > b. but if A was sleeping, B, C, and D would get 1/3. > > > > > > similar to > > > > > > a. when A is running *and blocked on C for all its runtime* > > > ^^ -- in this case, B and D should not have their distributions > > > changed at all because they are not participating in the > > > lock acquire and release. So they should neither be hurt > > > any more, nor be boosted. They should simply stay same [1] > > > > > > b. but if A was sleeping, B, C, and D would get 1/3. > > > > > > > > > [1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio. > > > > > > If all are running 100% and A does not block on C, B is blocked by A > > > indefinitely. So the prio of A and B are inverted. We seek to rectify this, that > > > is we need make changes such that, B is returned back to the blocked state. We > > > do this by boosting C. > > > > > > In other words, the prio inheritance will cause B's distribution to not be > > > changed (it was supposed to be blocked before and it is now going to be blocked > > > state again). > > > > > > CFS should not behave any differently, B's distribution should not be changed > > > before/after the priority inhertiance of A by C. That's just my opinion - and > > > that's how I calculated to distribution. With that mind, could you go back to > > > seeing if my math was originally correct or did I mess something up? > > > > It's not about the math. But I think the before and after can't be the same for > > C.. > > C is acquiring/releasing the lock so I expect its distribution to > change. I was talking about the poor B who has nothing to do with the > lock. > > > > > I don't think this is valid. If A is blocked on C for 50% of the time, and > > > > sleeping for 50% of the time, when did it get blocked/unblocked? > > > > > > > > This will have an impact on the average share for C and skew it, no? > > > > > > > > Unless I missed something, the average share of C being (3/5 + 1/3) is an > > > > impossible state. You need to consider the portion of time when C runs as 1/5, > > > > when A is actually not blocked on anything, too. > > > > > > > > Hmm actually I just re-read your statement below and you just say 3/5 (18/30) > > > > is too much. You didn't consider the average. I'll leave the above in hope to > > > > help me understand what am I missing and where I went wrong :-) > > > > > > > > Generally IMHO looking at the average will not help. I think if the share > > > > values make sense in each state individually (and I believe they are), that > > > > would be enough. AFAICS, B and D are still taking the right amount of time when > > > > C inherits the bandwidth. And C by definition will run longer when A is blocked > > > > on it for the whole duration of this blocked time. > > > > > > I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify > > > the math, and then taking average of that. I think that's reasonable? > > > > I'm not sure. This is skewing the results in my view. > > > > I think the comparison should just be: > > > > 1) A, B, C, and D are all running and nothing gets blocked at all. Then shares > > would be: > > > > 2/5, 1/5, 1/5, 1/5 > > > > 2) A is blocked and C; B, C, D are running with no blocked time. Shares would > > be: > > > > - , 1/5, 3/5, 1/5 > > > > By definition, we want to treat A in (2) as RUNNING because as soon as > > C unblocks A we should return to (1). From B and D perspective, their share is > > not impacted throughout this transition. Which is AFAIU is what we want to > > achieve. > > > > I think considering the sleeping time and averaging can lead to misleading > > results if care is not taken. > > Yes, but that doesn't mean we can just ignore it. It is easy in my > view to skew the inherited weight to a very large number, only to find > that tasks unrelated to the lock acquire/release are "suffering" > though they had nothing to do with the lock or the PI. But it is > reasonable to try the simple approach first and see the impact. > > I also never said the averaging approach or consideration of sleeping > time is perfect ;-) > > > Anyway - just trying to explain how I see it and why C is unlikely to be taking > > too much time. I could be wrong. As Youssef said, I think there's no > > fundamental problem here. > > I know on Android where they use smaller HZ, the large tick causes > lots of problems for large nice deltas. Example if a highly niced task > was to be preempted for 1ms, and preempts instead at 3ms, then the > less-niced task will not be so nice (even less nice than it promised > to be) any more because of the 2ms boost that the higher niced task > got. This can lead the the sched_latency thrown out of the window. Not > adjusting the weights properly can potentially make that problem much > worse IMO. Once C releases the lock it should get adjusted and A will get adjusted also regardless of tick. At the point we adjust the weights we have a chance to check for preemption and cause a reschedule. If C doesn't release the lock quickly (hopefully rare), it should continue to run at the adjusted weight since it is still blocking A. > > Thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-06 19:40 ` Youssef Esmat @ 2022-10-08 15:04 ` Joel Fernandes 2022-10-10 14:46 ` Qais Yousef 0 siblings, 1 reply; 22+ messages in thread From: Joel Fernandes @ 2022-10-08 15:04 UTC (permalink / raw) To: Youssef Esmat Cc: Qais Yousef, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney > On Oct 6, 2022, at 3:40 PM, Youssef Esmat <youssefesmat@google.com> wrote: > [..] >> >>> Anyway - just trying to explain how I see it and why C is unlikely to be taking >>> too much time. I could be wrong. As Youssef said, I think there's no >>> fundamental problem here. >> >> I know on Android where they use smaller HZ, the large tick causes >> lots of problems for large nice deltas. Example if a highly niced task >> was to be preempted for 1ms, and preempts instead at 3ms, then the >> less-niced task will not be so nice (even less nice than it promised >> to be) any more because of the 2ms boost that the higher niced task >> got. This can lead the the sched_latency thrown out of the window. Not >> adjusting the weights properly can potentially make that problem much >> worse IMO. > > Once C releases the lock it should get adjusted and A will get > adjusted also regardless of tick. At the point we adjust the weights > we have a chance to check for preemption and cause a reschedule. Yes but the lock can be held for potentially long time (and even user space lock). I’m more comfortable with Peter’s PE patch which seems a more generic solution, than sum of weights if we can get it working. I’m studying Connor’s patch set now… > If C doesn't release the lock quickly (hopefully rare), it should > continue to run at the adjusted weight since it is still blocking A. We can’t depend on rare. Even one bad instance is a fail. So if lock holder and releaser go crazy, we can’t destabilize the system. After all, this is CFS and fairness should not be broken, even if rarely. Thanks. > >> >> Thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-08 15:04 ` Joel Fernandes @ 2022-10-10 14:46 ` Qais Yousef 2022-10-10 15:11 ` Joel Fernandes 0 siblings, 1 reply; 22+ messages in thread From: Qais Yousef @ 2022-10-10 14:46 UTC (permalink / raw) To: Joel Fernandes Cc: Youssef Esmat, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 10/08/22 11:04, Joel Fernandes wrote: > > > > On Oct 6, 2022, at 3:40 PM, Youssef Esmat <youssefesmat@google.com> wrote: > > > [..] > >> > >>> Anyway - just trying to explain how I see it and why C is unlikely to be > >>> taking too much time. I could be wrong. As Youssef said, I think there's > >>> no fundamental problem here. > >> > >> I know on Android where they use smaller HZ, the large tick causes lots of > >> problems for large nice deltas. Example if a highly niced task was to be > >> preempted for 1ms, and preempts instead at 3ms, then the less-niced task > >> will not be so nice (even less nice than it promised to be) any more > >> because of the 2ms boost that the higher niced task got. This can lead the > >> the sched_latency thrown out of the window. Not adjusting the weights > >> properly can potentially make that problem much worse IMO. > > > > Once C releases the lock it should get adjusted and A will get adjusted > > also regardless of tick. At the point we adjust the weights we have > > a chance to check for preemption and cause a reschedule. > > Yes but the lock can be held for potentially long time (and even user space > lock). I’m more comfortable with Peter’s PE patch which seems a more generic > solution, than sum of weights if we can get it working. I’m studying Connor’s > patch set now… The 2 solutions are equivalent AFAICT. With summation: A , B , C , D sleeping, running, running, running - , 1/5 , 3/5 , 1/5 Where we'll treat A as running but donate its bandwidth to C, the mutex owner. With PE: A , B , C , D running, running, running, running 2/5 , 1/5 , 1/5 , 1/5 Where A will donate its execution context to C, the mutex owner. In both cases we should end up with the same distribution as if neither A nor C ever go to sleep because of holding the mutex. I still can't see how B and D fairness will be impacted as the solution to the problem is to never treat a waiter as sleeping and let the owner run for more, but only within the limit of what the waiter is allowed to run for. AFAICS, both solutions maintain this relationship. Thanks -- Qais Yousef ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-10 14:46 ` Qais Yousef @ 2022-10-10 15:11 ` Joel Fernandes 2022-10-12 14:30 ` Qais Yousef 0 siblings, 1 reply; 22+ messages in thread From: Joel Fernandes @ 2022-10-10 15:11 UTC (permalink / raw) To: Qais Yousef Cc: Youssef Esmat, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On Mon, Oct 10, 2022 at 10:46 AM Qais Yousef <qais.yousef@arm.com> wrote: > > On 10/08/22 11:04, Joel Fernandes wrote: > > > > > > > On Oct 6, 2022, at 3:40 PM, Youssef Esmat <youssefesmat@google.com> wrote: > > > > > [..] > > >> > > >>> Anyway - just trying to explain how I see it and why C is unlikely to be > > >>> taking too much time. I could be wrong. As Youssef said, I think there's > > >>> no fundamental problem here. > > >> > > >> I know on Android where they use smaller HZ, the large tick causes lots of > > >> problems for large nice deltas. Example if a highly niced task was to be > > >> preempted for 1ms, and preempts instead at 3ms, then the less-niced task > > >> will not be so nice (even less nice than it promised to be) any more > > >> because of the 2ms boost that the higher niced task got. This can lead the > > >> the sched_latency thrown out of the window. Not adjusting the weights > > >> properly can potentially make that problem much worse IMO. > > > > > > Once C releases the lock it should get adjusted and A will get adjusted > > > also regardless of tick. At the point we adjust the weights we have > > > a chance to check for preemption and cause a reschedule. > > > > Yes but the lock can be held for potentially long time (and even user space > > lock). I’m more comfortable with Peter’s PE patch which seems a more generic > > solution, than sum of weights if we can get it working. I’m studying Connor’s > > patch set now… > > The 2 solutions are equivalent AFAICT. Possibly. Maybe I am talking about a non-issue then, but I had to be careful ;-) Maybe both have the issue I was referring to, or they don't. But in any case, PE seems more organic. > With summation: > > A , B , C , D > sleeping, running, running, running > - , 1/5 , 3/5 , 1/5 > > Where we'll treat A as running but donate its bandwidth to C, the mutex owner. > With PE: > > A , B , C , D > running, running, running, running > 2/5 , 1/5 , 1/5 , 1/5 > > Where A will donate its execution context to C, the mutex owner. Yes. It would also be great if Peter can participate in this thread, if he has time. Not to nitpick but to be more precise in PE terminology, you mean "scheduler context". The "execution context" is not inherited [1] If p1 is selected to run while still blocked, the lock owner p2 can run "on its behalf", inheriting p1's scheduler context. Execution context is not inherited, meaning that e.g. the CPUs where p2 can run are still determined by its own affinity and not p1's. [1] https://lore.kernel.org/all/73859883-78c4-1080-7846-e8d644ad397a@redhat.com/t/#mdf0146cdf78e48fc5cc515c1a34cdc1d596e0ed8 > In both cases we should end up with the same distribution as if neither A nor > C ever go to sleep because of holding the mutex. Hopefully! > I still can't see how B and D fairness will be impacted as the solution to the > problem is to never treat a waiter as sleeping and let the owner run for more, > but only within the limit of what the waiter is allowed to run for. AFAICS, > both solutions maintain this relationship. True! - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-10 15:11 ` Joel Fernandes @ 2022-10-12 14:30 ` Qais Yousef 0 siblings, 0 replies; 22+ messages in thread From: Qais Yousef @ 2022-10-12 14:30 UTC (permalink / raw) To: Joel Fernandes Cc: Youssef Esmat, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, bigeasy, Paul E. McKenney On 10/10/22 11:11, Joel Fernandes wrote: > On Mon, Oct 10, 2022 at 10:46 AM Qais Yousef <qais.yousef@arm.com> wrote: > > > > On 10/08/22 11:04, Joel Fernandes wrote: > > > > > > > > > > On Oct 6, 2022, at 3:40 PM, Youssef Esmat <youssefesmat@google.com> wrote: > > > > > > > [..] > > > >> > > > >>> Anyway - just trying to explain how I see it and why C is unlikely to be > > > >>> taking too much time. I could be wrong. As Youssef said, I think there's > > > >>> no fundamental problem here. > > > >> > > > >> I know on Android where they use smaller HZ, the large tick causes lots of > > > >> problems for large nice deltas. Example if a highly niced task was to be > > > >> preempted for 1ms, and preempts instead at 3ms, then the less-niced task > > > >> will not be so nice (even less nice than it promised to be) any more > > > >> because of the 2ms boost that the higher niced task got. This can lead the > > > >> the sched_latency thrown out of the window. Not adjusting the weights > > > >> properly can potentially make that problem much worse IMO. > > > > > > > > Once C releases the lock it should get adjusted and A will get adjusted > > > > also regardless of tick. At the point we adjust the weights we have > > > > a chance to check for preemption and cause a reschedule. > > > > > > Yes but the lock can be held for potentially long time (and even user space > > > lock). I’m more comfortable with Peter’s PE patch which seems a more generic > > > solution, than sum of weights if we can get it working. I’m studying Connor’s > > > patch set now… > > > > The 2 solutions are equivalent AFAICT. > > Possibly. Maybe I am talking about a non-issue then, but I had to be > careful ;-) Maybe both have the issue I was referring to, or they > don't. But in any case, PE seems more organic. Careful is good! I failed to see the problem, that doesn't mean it doesn't exist :-) > > > With summation: > > > > A , B , C , D > > sleeping, running, running, running > > - , 1/5 , 3/5 , 1/5 > > > > Where we'll treat A as running but donate its bandwidth to C, the mutex owner. > > > With PE: > > > > A , B , C , D > > running, running, running, running > > 2/5 , 1/5 , 1/5 , 1/5 > > > > Where A will donate its execution context to C, the mutex owner. > > Yes. It would also be great if Peter can participate in this thread, > if he has time. Not to nitpick but to be more precise in PE > terminology, you mean "scheduler context". The "execution context" is > not inherited [1] > > If p1 is selected to run while still blocked, the lock owner p2 can > run "on its behalf", inheriting p1's scheduler context. Execution > context is not inherited, meaning that e.g. the CPUs where p2 can run > are still determined by its own affinity and not p1's. Yep sorry got the terminology mixed up :-) Cheers -- Qais Yousef > > [1] https://lore.kernel.org/all/73859883-78c4-1080-7846-e8d644ad397a@redhat.com/t/#mdf0146cdf78e48fc5cc515c1a34cdc1d596e0ed8 > > > In both cases we should end up with the same distribution as if neither A nor > > C ever go to sleep because of holding the mutex. > > Hopefully! > > > I still can't see how B and D fairness will be impacted as the solution to the > > problem is to never treat a waiter as sleeping and let the owner run for more, > > but only within the limit of what the waiter is allowed to run for. AFAICS, > > both solutions maintain this relationship. > > True! > > - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-09-30 17:34 ` Joel Fernandes 2022-10-03 16:14 ` Qais Yousef @ 2022-10-04 11:50 ` Sebastian Andrzej Siewior 2022-10-04 14:43 ` Joel Fernandes 1 sibling, 1 reply; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2022-10-04 11:50 UTC (permalink / raw) To: Joel Fernandes Cc: Qais Yousef, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, Paul E. McKenney On 2022-09-30 13:34:49 [-0400], Joel Fernandes wrote: > In this case, there is no lock involved yet you have a dependency. But I don't > mean to sound depressing, and just because there are cases like this does not > mean we should not solve the lock-based ones. When I looked at Android, I saw > that it uses futex directly from Android Runtime code instead of using pthread. > So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do > in the kernel will JustWork(Tm) ? The easy part is just to replace the lock/unlock functions with FUTEX_LOCK_PI/UNLOCK_PI syscalls. The slightly advanced part is where you use an atomic operation to replace 0 with threads's ID in the lock path to avoid going into the kernel for locking if the lock is not contended. If it is, then you need to use the syscall. … > > Proxy execution seems to be the nice solution to all of these problems, but > > it's a long way away. I'm interested to learn how this inheritance will be > > implemented. And whether there are any userspace conversion issues. i.e: do > > we need to convert all locks to rt-mutex locks? > > I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to > weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds > reasonable to me. Based on my understanding with proxy-execution, all in-kernel locks should be covered. Priority inheritance (PI) works only with FUTEX_LOCK_PI for userpace and rtmutex for the in-kernel locks. Regular FUTEX_LOCK does only wait/wake in userspace so there is no way for the kernel to "help". Ah and for PI to work you need priorities that you can inherit. With two threads running as SCHED_OTHER there will be just "normal" sleep+wake in the kernel. If the blocking thread is SCHED_FIFO then it will inherit its priority to the lock owner. > Other thoughts? > > thanks, > > - Joel Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Sum of weights idea for CFS PI 2022-10-04 11:50 ` Sebastian Andrzej Siewior @ 2022-10-04 14:43 ` Joel Fernandes 0 siblings, 0 replies; 22+ messages in thread From: Joel Fernandes @ 2022-10-04 14:43 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Qais Yousef, Peter Zijlstra, LKML, Steven Rostedt, juri.lelli, vincent.guittot, Youssef Esmat, Dietmar Eggemann, Thomas Gleixner, bristot, clark.williams, Paul E. McKenney On 10/4/2022 7:50 AM, Sebastian Andrzej Siewior wrote: > On 2022-09-30 13:34:49 [-0400], Joel Fernandes wrote: >> In this case, there is no lock involved yet you have a dependency. But I don't >> mean to sound depressing, and just because there are cases like this does not >> mean we should not solve the lock-based ones. When I looked at Android, I saw >> that it uses futex directly from Android Runtime code instead of using pthread. >> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do >> in the kernel will JustWork(Tm) ? > > The easy part is just to replace the lock/unlock functions with > FUTEX_LOCK_PI/UNLOCK_PI syscalls. The slightly advanced part is where > you use an atomic operation to replace 0 with threads's ID in the lock > path to avoid going into the kernel for locking if the lock is not > contended. If it is, then you need to use the syscall. > > … >>> Proxy execution seems to be the nice solution to all of these problems, but >>> it's a long way away. I'm interested to learn how this inheritance will be >>> implemented. And whether there are any userspace conversion issues. i.e: do >>> we need to convert all locks to rt-mutex locks? >> >> I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to >> weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds >> reasonable to me. > > Based on my understanding with proxy-execution, all in-kernel locks > should be covered. > Priority inheritance (PI) works only with FUTEX_LOCK_PI for userpace and > rtmutex for the in-kernel locks. Regular FUTEX_LOCK does only wait/wake > in userspace so there is no way for the kernel to "help". Ah and for PI > to work you need priorities that you can inherit. With two threads > running as SCHED_OTHER there will be just "normal" sleep+wake in the > kernel. If the blocking thread is SCHED_FIFO then it will inherit its > priority to the lock owner. Hi Sebastian, I agree with your thoughts on this. Yes proxy execution idea should cover this. Basically, any primitive that allows userspace to let the kernel know is a dependency can use this AFAICS, FUTEX_LOCK_PI being a prime example. Perhaps Android's binder being another where A sends a message to C and blocks till C responds. Meanwhile medium prio B blocks C. thanks, - Joel ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-10-12 14:30 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-29 20:38 Sum of weights idea for CFS PI Joel Fernandes 2022-09-30 13:49 ` Qais Yousef 2022-09-30 15:44 ` Youssef Esmat 2022-09-30 17:42 ` Joel Fernandes 2022-09-30 18:10 ` Youssef Esmat 2022-09-30 18:45 ` Joel Fernandes 2022-09-30 17:34 ` Joel Fernandes 2022-10-03 16:14 ` Qais Yousef 2022-10-03 16:27 ` Joel Fernandes 2022-10-04 16:30 ` Qais Yousef 2022-10-04 19:48 ` Joel Fernandes 2022-10-05 9:31 ` Qais Yousef 2022-10-04 20:27 ` Joel Fernandes 2022-10-05 10:04 ` Qais Yousef 2022-10-06 13:53 ` Joel Fernandes 2022-10-06 19:40 ` Youssef Esmat 2022-10-08 15:04 ` Joel Fernandes 2022-10-10 14:46 ` Qais Yousef 2022-10-10 15:11 ` Joel Fernandes 2022-10-12 14:30 ` Qais Yousef 2022-10-04 11:50 ` Sebastian Andrzej Siewior 2022-10-04 14:43 ` Joel Fernandes
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.