linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/fair: no sync wakeup from interrupt context
@ 2022-07-11 22:47 Libo Chen
  2022-07-13 16:40 ` Tim Chen
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Libo Chen @ 2022-07-11 22:47 UTC (permalink / raw)
  To: peterz, vincent.guittot, mgorman, tim.c.chen, 21cnbao, dietmar.eggemann
  Cc: linux-kernel, tglx

Barry Song first pointed out that replacing sync wakeup with regular wakeup
seems to reduce overeager wakeup pulling and shows noticeable performance
improvement.[1]

This patch argues that allowing sync for wakeups from interrupt context
is a bug and fixing it can improve performance even when irq/softirq is
evenly spread out.

For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
waker can be any random task that happens to be running on that CPU when the
interrupt comes in. This is completely different from other wakups where the
task running on the waking CPU is the actual waker. For example, two tasks
communicate through a pipe or mutiple tasks access the same critical section,
etc. This difference is important because with sync we assume the waker will
get off the runqueue and go to sleep immedately after the wakeup. The
assumption is built into wake_affine() where it discounts the waker's presence
from the runqueue when sync is true. The random waker from interrupts bears no
relation to the wakee and don't usually go to sleep immediately afterwards
unless wakeup granularity is reached. Plus the scheduler no longer enforces the
preepmtion of waker for sync wakeup as it used to before
patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
wakeup preemption for wakeups from interrupt contexts doesn't seem to be
appropriate too but at least sync wakeup will do what it's supposed to do.

Add a check to make sure that sync can only be set when in_task() is true. If
a wakeup is from interrupt context, sync flag will be 0 because in_task()
returns 0.

Tested in two scenarios: wakeups from 1) task contexts, expected to see no
performance changes; 2) interrupt contexts, expected to see better performance
under low/medium load and no regression under heavy load.

Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
daemon is active.

Hackbench (config-scheduler-unbound)
=========
process-pipes:
                Baseline                Patched
Amean   1       0.4300  ( 0.00%)        0.4420  ( -2.79%)
Amean   4       0.8757  ( 0.00%)        0.8774  ( -0.20%)
Amean   7       1.3712  ( 0.00%)        1.3789  ( -0.56%)
Amean   12      2.3541  ( 0.00%)        2.3714  ( -0.73%)
Amean   21      4.2229  ( 0.00%)        4.2439  ( -0.50%)
Amean   30      5.9113  ( 0.00%)        5.9451  ( -0.57%)
Amean   48      9.3873  ( 0.00%)        9.4898  ( -1.09%)
Amean   79      15.9281 ( 0.00%)        16.1385 ( -1.32%)
Amean   110     22.0961 ( 0.00%)        22.3433 ( -1.12%)
Amean   141     28.2973 ( 0.00%)        28.6209 ( -1.14%)
Amean   172     34.4709 ( 0.00%)        34.9347 ( -1.35%)
Amean   203     40.7621 ( 0.00%)        41.2497 ( -1.20%)
Amean   234     47.0416 ( 0.00%)        47.6470 ( -1.29%)
Amean   265     53.3048 ( 0.00%)        54.1625 ( -1.61%)
Amean   288     58.0595 ( 0.00%)        58.8096 ( -1.29%)

process-sockets:
                Baseline                Patched
Amean   1       0.6776   ( 0.00%)       0.6611   ( 2.43%)
Amean   4       2.6183   ( 0.00%)       2.5769   ( 1.58%)
Amean   7       4.5662   ( 0.00%)       4.4801   ( 1.89%)
Amean   12      7.7638   ( 0.00%)       7.6201   ( 1.85%)
Amean   21      13.5335  ( 0.00%)       13.2915  ( 1.79%)
Amean   30      19.3369  ( 0.00%)       18.9811  ( 1.84%)
Amean   48      31.0724  ( 0.00%)       30.6015  ( 1.52%)
Amean   79      51.1881  ( 0.00%)       50.4251  ( 1.49%)
Amean   110     71.3399  ( 0.00%)       70.4578  ( 1.24%)
Amean   141     91.4675  ( 0.00%)       90.3769  ( 1.19%)
Amean   172     111.7463 ( 0.00%)       110.3947 ( 1.21%)
Amean   203     131.6927 ( 0.00%)       130.3270 ( 1.04%)
Amean   234     151.7459 ( 0.00%)       150.1320 ( 1.06%)
Amean   265     171.9101 ( 0.00%)       169.9751 ( 1.13%)
Amean   288     186.9231 ( 0.00%)       184.7706 ( 1.15%)

thread-pipes:
                Baseline                Patched
Amean   1       0.4523  ( 0.00%)        0.4535  ( -0.28%)
Amean   4       0.9041  ( 0.00%)        0.9085  ( -0.48%)
Amean   7       1.4111  ( 0.00%)        1.4146  ( -0.25%)
Amean   12      2.3532  ( 0.00%)        2.3688  ( -0.66%)
Amean   21      4.1550  ( 0.00%)        4.1701  ( -0.36%)
Amean   30      6.1043  ( 0.00%)        6.2391  ( -2.21%)
Amean   48      10.2077 ( 0.00%)        10.3511 ( -1.40%)
Amean   79      16.7922 ( 0.00%)        17.0427 ( -1.49%)
Amean   110     23.3350 ( 0.00%)        23.6522 ( -1.36%)
Amean   141     29.6458 ( 0.00%)        30.2617 ( -2.08%)
Amean   172     35.8649 ( 0.00%)        36.4225 ( -1.55%)
Amean   203     42.4477 ( 0.00%)        42.8332 ( -0.91%)
Amean   234     48.7117 ( 0.00%)        49.4042 ( -1.42%)
Amean   265     54.9171 ( 0.00%)        55.6551 ( -1.34%)
Amean   288     59.5282 ( 0.00%)        60.2903 ( -1.28%)

thread-sockets:
                Baseline                Patched
Amean   1       0.6917   ( 0.00%)       0.6892   ( 0.37%)
Amean   4       2.6651   ( 0.00%)       2.6017   ( 2.38%)
Amean   7       4.6734   ( 0.00%)       4.5637   ( 2.35%)
Amean   12      8.0156   ( 0.00%)       7.8079   ( 2.59%)
Amean   21      14.0451  ( 0.00%)       13.6679  ( 2.69%)
Amean   30      20.0963  ( 0.00%)       19.5657  ( 2.64%)
Amean   48      32.4115  ( 0.00%)       31.6001  ( 2.50%)
Amean   79      53.1104  ( 0.00%)       51.8395  ( 2.39%)
Amean   110     74.0929  ( 0.00%)       72.4391  ( 2.23%)
Amean   141     95.1506  ( 0.00%)       93.0992  ( 2.16%)
Amean   172     116.1969 ( 0.00%)       113.8307 ( 2.04%)
Amean   203     137.4413 ( 0.00%)       134.5247 ( 2.12%)
Amean   234     158.5395 ( 0.00%)       155.2793 ( 2.06%)
Amean   265     179.7729 ( 0.00%)       176.1099 ( 2.04%)
Amean   288     195.5573 ( 0.00%)       191.3977 ( 2.13%)

Pgbench (config-db-pgbench-timed-ro-small)
=======
                Baseline            Patched
Hmean   1       68.54    ( 0.00%)   69.72    ( 1.71%)
Hmean   6       27725.78 ( 0.00%)   34119.11 ( 23.06%)
Hmean   12      55724.26 ( 0.00%)   63158.22 ( 13.34%)
Hmean   22      72806.26 ( 0.00%)   73389.98 ( 0.80%)
Hmean   30      79000.45 ( 0.00%)   75197.02 ( -4.81%)
Hmean   48      78180.16 ( 0.00%)   75074.09 ( -3.97%)
Hmean   80      75001.93 ( 0.00%)   70590.72 ( -5.88%)
Hmean   110     74812.25 ( 0.00%)   74128.57 ( -0.91%)
Hmean   142     74261.05 ( 0.00%)   72910.48 ( -1.82%)
Hmean   144     75375.35 ( 0.00%)   71295.72 ( -5.41%)

For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
conclude that there are no performance differences for wakeups from task context.
For pgbench, after many runs, 10~30% gains are very consistent at lower
client counts (< #cores per socket). For higher client counts, both kernels are
very close, +-5% swings are quite common. Also NET_TX|RX softirq load
does spread out across both NUMA nodes in this test so there is no need to do
any explicit RPS/RFS.

[1]: https://lkml.org/lkml/2021/11/5/234

Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..59b210d2cdb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 {
-	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+	/* Don't set sync for wakeup from irq/soft ctx */
+	int sync = in_task() && (wake_flags & WF_SYNC)
+		   && !(current->flags & PF_EXITING);
 	struct sched_domain *tmp, *sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
--
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
@ 2022-07-13 16:40 ` Tim Chen
       [not found]   ` <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
  2022-07-14 11:26 ` Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Tim Chen @ 2022-07-13 16:40 UTC (permalink / raw)
  To: Libo Chen, peterz, vincent.guittot, mgorman, 21cnbao, dietmar.eggemann
  Cc: linux-kernel, tglx

On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
> Barry Song first pointed out that replacing sync wakeup with regular wakeup
> seems to reduce overeager wakeup pulling and shows noticeable performance
> improvement.[1]
> 
> This patch argues that allowing sync for wakeups from interrupt context
> is a bug and fixing it can improve performance even when irq/softirq is
> evenly spread out.
> 
> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
> waker can be any random task that happens to be running on that CPU when the
> interrupt comes in. This is completely different from other wakups where the
> task running on the waking CPU is the actual waker. For example, two tasks
> communicate through a pipe or mutiple tasks access the same critical section,
> etc. This difference is important because with sync we assume the waker will
> get off the runqueue and go to sleep immedately after the wakeup. The
> assumption is built into wake_affine() where it discounts the waker's presence
> from the runqueue when sync is true. The random waker from interrupts bears no
> relation to the wakee and don't usually go to sleep immediately afterwards
> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
> preepmtion of waker for sync wakeup as it used to before
> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
> appropriate too but at least sync wakeup will do what it's supposed to do.

Will there be scenarios where you do want the task being woken up be pulled
to the CPU where the interrupt happened, as the data that needs to be accessed is
on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
packets received.  Sync still seems desirable, at least if there is no task currently
running on the CPU where interrupt happened.  So maybe we should have some consideration
of the load on the CPU/NUMA before deciding whether we should do sync wake for such
interrupt.

> 
> Add a check to make sure that sync can only be set when in_task() is true. If
> a wakeup is from interrupt context, sync flag will be 0 because in_task()
> returns 0.
> 
> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
> performance changes; 2) interrupt contexts, expected to see better performance
> under low/medium load and no regression under heavy load.
> 
> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
> daemon is active.
> 
> Hackbench (config-scheduler-unbound)
> =========
> process-pipes:
>                 Baseline                Patched
> Amean   1       0.4300  ( 0.00%)        0.4420  ( -2.79%)
> Amean   4       0.8757  ( 0.00%)        0.8774  ( -0.20%)
> Amean   7       1.3712  ( 0.00%)        1.3789  ( -0.56%)
> Amean   12      2.3541  ( 0.00%)        2.3714  ( -0.73%)
> Amean   21      4.2229  ( 0.00%)        4.2439  ( -0.50%)
> Amean   30      5.9113  ( 0.00%)        5.9451  ( -0.57%)
> Amean   48      9.3873  ( 0.00%)        9.4898  ( -1.09%)
> Amean   79      15.9281 ( 0.00%)        16.1385 ( -1.32%)
> Amean   110     22.0961 ( 0.00%)        22.3433 ( -1.12%)
> Amean   141     28.2973 ( 0.00%)        28.6209 ( -1.14%)
> Amean   172     34.4709 ( 0.00%)        34.9347 ( -1.35%)
> Amean   203     40.7621 ( 0.00%)        41.2497 ( -1.20%)
> Amean   234     47.0416 ( 0.00%)        47.6470 ( -1.29%)
> Amean   265     53.3048 ( 0.00%)        54.1625 ( -1.61%)
> Amean   288     58.0595 ( 0.00%)        58.8096 ( -1.29%)
> 
> process-sockets:
>                 Baseline                Patched
> Amean   1       0.6776   ( 0.00%)       0.6611   ( 2.43%)
> Amean   4       2.6183   ( 0.00%)       2.5769   ( 1.58%)
> Amean   7       4.5662   ( 0.00%)       4.4801   ( 1.89%)
> Amean   12      7.7638   ( 0.00%)       7.6201   ( 1.85%)
> Amean   21      13.5335  ( 0.00%)       13.2915  ( 1.79%)
> Amean   30      19.3369  ( 0.00%)       18.9811  ( 1.84%)
> Amean   48      31.0724  ( 0.00%)       30.6015  ( 1.52%)
> Amean   79      51.1881  ( 0.00%)       50.4251  ( 1.49%)
> Amean   110     71.3399  ( 0.00%)       70.4578  ( 1.24%)
> Amean   141     91.4675  ( 0.00%)       90.3769  ( 1.19%)
> Amean   172     111.7463 ( 0.00%)       110.3947 ( 1.21%)
> Amean   203     131.6927 ( 0.00%)       130.3270 ( 1.04%)
> Amean   234     151.7459 ( 0.00%)       150.1320 ( 1.06%)
> Amean   265     171.9101 ( 0.00%)       169.9751 ( 1.13%)
> Amean   288     186.9231 ( 0.00%)       184.7706 ( 1.15%)
> 
> thread-pipes:
>                 Baseline                Patched
> Amean   1       0.4523  ( 0.00%)        0.4535  ( -0.28%)
> Amean   4       0.9041  ( 0.00%)        0.9085  ( -0.48%)
> Amean   7       1.4111  ( 0.00%)        1.4146  ( -0.25%)
> Amean   12      2.3532  ( 0.00%)        2.3688  ( -0.66%)
> Amean   21      4.1550  ( 0.00%)        4.1701  ( -0.36%)
> Amean   30      6.1043  ( 0.00%)        6.2391  ( -2.21%)
> Amean   48      10.2077 ( 0.00%)        10.3511 ( -1.40%)
> Amean   79      16.7922 ( 0.00%)        17.0427 ( -1.49%)
> Amean   110     23.3350 ( 0.00%)        23.6522 ( -1.36%)
> Amean   141     29.6458 ( 0.00%)        30.2617 ( -2.08%)
> Amean   172     35.8649 ( 0.00%)        36.4225 ( -1.55%)
> Amean   203     42.4477 ( 0.00%)        42.8332 ( -0.91%)
> Amean   234     48.7117 ( 0.00%)        49.4042 ( -1.42%)
> Amean   265     54.9171 ( 0.00%)        55.6551 ( -1.34%)
> Amean   288     59.5282 ( 0.00%)        60.2903 ( -1.28%)
> 
> thread-sockets:
>                 Baseline                Patched
> Amean   1       0.6917   ( 0.00%)       0.6892   ( 0.37%)
> Amean   4       2.6651   ( 0.00%)       2.6017   ( 2.38%)
> Amean   7       4.6734   ( 0.00%)       4.5637   ( 2.35%)
> Amean   12      8.0156   ( 0.00%)       7.8079   ( 2.59%)
> Amean   21      14.0451  ( 0.00%)       13.6679  ( 2.69%)
> Amean   30      20.0963  ( 0.00%)       19.5657  ( 2.64%)
> Amean   48      32.4115  ( 0.00%)       31.6001  ( 2.50%)
> Amean   79      53.1104  ( 0.00%)       51.8395  ( 2.39%)
> Amean   110     74.0929  ( 0.00%)       72.4391  ( 2.23%)
> Amean   141     95.1506  ( 0.00%)       93.0992  ( 2.16%)
> Amean   172     116.1969 ( 0.00%)       113.8307 ( 2.04%)
> Amean   203     137.4413 ( 0.00%)       134.5247 ( 2.12%)
> Amean   234     158.5395 ( 0.00%)       155.2793 ( 2.06%)
> Amean   265     179.7729 ( 0.00%)       176.1099 ( 2.04%)
> Amean   288     195.5573 ( 0.00%)       191.3977 ( 2.13%)
> 
> Pgbench (config-db-pgbench-timed-ro-small)
> =======
>                 Baseline            Patched
> Hmean   1       68.54    ( 0.00%)   69.72    ( 1.71%)
> Hmean   6       27725.78 ( 0.00%)   34119.11 ( 23.06%)
> Hmean   12      55724.26 ( 0.00%)   63158.22 ( 13.34%)
> Hmean   22      72806.26 ( 0.00%)   73389.98 ( 0.80%)
> Hmean   30      79000.45 ( 0.00%)   75197.02 ( -4.81%)
> Hmean   48      78180.16 ( 0.00%)   75074.09 ( -3.97%)
> Hmean   80      75001.93 ( 0.00%)   70590.72 ( -5.88%)
> Hmean   110     74812.25 ( 0.00%)   74128.57 ( -0.91%)
> Hmean   142     74261.05 ( 0.00%)   72910.48 ( -1.82%)
> Hmean   144     75375.35 ( 0.00%)   71295.72 ( -5.41%)
> 
> For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
> conclude that there are no performance differences for wakeups from task context.
> For pgbench, after many runs, 10~30% gains are very consistent at lower
> client counts (< #cores per socket). 

Can you provide some further insights on why pgebench is helped at low load
case?  Is it because the woken tasks tend to stay put and not get moved around with interrupts
and maintain cache warmth?

Tim

> For higher client counts, both kernels are
> very close, +-5% swings are quite common. Also NET_TX|RX softirq load
> does spread out across both NUMA nodes in this test so there is no need to do
> any explicit RPS/RFS.
> 
> [1]: https://lkml.org/lkml/2021/11/5/234
> 
> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> ---
>  kernel/sched/fair.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..59b210d2cdb5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  static int
>  select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  {
> -	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> +	/* Don't set sync for wakeup from irq/soft ctx */
> +	int sync = in_task() && (wake_flags & WF_SYNC)
> +		   && !(current->flags & PF_EXITING);
>  	struct sched_domain *tmp, *sd = NULL;
>  	int cpu = smp_processor_id();
>  	int new_cpu = prev_cpu;
> --
> 2.31.1
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
       [not found]   ` <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
@ 2022-07-13 19:34     ` Libo Chen
  2022-07-13 20:51     ` Tim Chen
  2022-07-14 14:18     ` Mel Gorman
  2 siblings, 0 replies; 21+ messages in thread
From: Libo Chen @ 2022-07-13 19:34 UTC (permalink / raw)
  To: Tim Chen, peterz, vincent.guittot, mgorman, 21cnbao, dietmar.eggemann
  Cc: linux-kernel, tglx

linux-kernel@vger.kernel.org complains about html subpart... I am 
resending it in plain-text only
so everybody on the mailing list can see.

On 7/13/22 12:17, Libo Chen wrote:
>
>
> On 7/13/22 09:40, Tim Chen wrote:
>> On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
>>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>>> seems to reduce overeager wakeup pulling and shows noticeable performance
>>> improvement.[1]
>>>
>>> This patch argues that allowing sync for wakeups from interrupt context
>>> is a bug and fixing it can improve performance even when irq/softirq is
>>> evenly spread out.
>>>
>>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>>> waker can be any random task that happens to be running on that CPU when the
>>> interrupt comes in. This is completely different from other wakups where the
>>> task running on the waking CPU is the actual waker. For example, two tasks
>>> communicate through a pipe or mutiple tasks access the same critical section,
>>> etc. This difference is important because with sync we assume the waker will
>>> get off the runqueue and go to sleep immedately after the wakeup. The
>>> assumption is built into wake_affine() where it discounts the waker's presence
>>> from the runqueue when sync is true. The random waker from interrupts bears no
>>> relation to the wakee and don't usually go to sleep immediately afterwards
>>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>>> preepmtion of waker for sync wakeup as it used to before
>>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>>> appropriate too but at least sync wakeup will do what it's supposed to do.
>> Will there be scenarios where you do want the task being woken up be pulled
>> to the CPU where the interrupt happened, as the data that needs to be accessed is
>> on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
>> packets received.  Sync still seems desirable, at least if there is no task currently
>> running on the CPU where interrupt happened.  So maybe we should have some consideration
>> of the load on the CPU/NUMA before deciding whether we should do sync wake for such
>> interrupt.
>>
> There are only two places where sync wakeup matters: 
> wake_affine_idle() and wake_affine_weight().
> In wake_affine_idle(), it considers pulling if there is one runnable 
> on the waking CPU because
> of the belief that this runnable will voluntarily get off the 
> runqueue. In wake_affine_weight(),
> it basically takes off the waker's load again assuming the waker goes 
> to sleep after the wakeup.
> My argument is that this assumption doesn't really hold for wakeups 
> from the interrupt contexts
> when the waking CPU is non-idle. Wakeups from task context? sure, it 
> seems to be a reasonable
> assumption. For your idle case, I totally agree but I don't think 
> having sync or not will actually
> have any impacts here giving what the code does. Real impact comes 
> fromMel's patch 7332dec055f2457c3
> which makes it less likely to pull tasks when the waking CPU is idle. 
> I believe we should consider
> reverting 7332dec055f2 because a significant RDS latency regression 
> has been spotted recently on our
> system due to this patch.
>
>>> Add a check to make sure that sync can only be set when in_task() is true. If
>>> a wakeup is from interrupt context, sync flag will be 0 because in_task()
>>> returns 0.
>>>
>>> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
>>> performance changes; 2) interrupt contexts, expected to see better performance
>>> under low/medium load and no regression under heavy load.
>>>
>>> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
>>> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
>>> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
>>> daemon is active.
>>>
>>> Hackbench (config-scheduler-unbound)
>>> =========
>>> process-pipes:
>>>                  Baseline                Patched
>>> Amean   1       0.4300  ( 0.00%)        0.4420  ( -2.79%)
>>> Amean   4       0.8757  ( 0.00%)        0.8774  ( -0.20%)
>>> Amean   7       1.3712  ( 0.00%)        1.3789  ( -0.56%)
>>> Amean   12      2.3541  ( 0.00%)        2.3714  ( -0.73%)
>>> Amean   21      4.2229  ( 0.00%)        4.2439  ( -0.50%)
>>> Amean   30      5.9113  ( 0.00%)        5.9451  ( -0.57%)
>>> Amean   48      9.3873  ( 0.00%)        9.4898  ( -1.09%)
>>> Amean   79      15.9281 ( 0.00%)        16.1385 ( -1.32%)
>>> Amean   110     22.0961 ( 0.00%)        22.3433 ( -1.12%)
>>> Amean   141     28.2973 ( 0.00%)        28.6209 ( -1.14%)
>>> Amean   172     34.4709 ( 0.00%)        34.9347 ( -1.35%)
>>> Amean   203     40.7621 ( 0.00%)        41.2497 ( -1.20%)
>>> Amean   234     47.0416 ( 0.00%)        47.6470 ( -1.29%)
>>> Amean   265     53.3048 ( 0.00%)        54.1625 ( -1.61%)
>>> Amean   288     58.0595 ( 0.00%)        58.8096 ( -1.29%)
>>>
>>> process-sockets:
>>>                  Baseline                Patched
>>> Amean   1       0.6776   ( 0.00%)       0.6611   ( 2.43%)
>>> Amean   4       2.6183   ( 0.00%)       2.5769   ( 1.58%)
>>> Amean   7       4.5662   ( 0.00%)       4.4801   ( 1.89%)
>>> Amean   12      7.7638   ( 0.00%)       7.6201   ( 1.85%)
>>> Amean   21      13.5335  ( 0.00%)       13.2915  ( 1.79%)
>>> Amean   30      19.3369  ( 0.00%)       18.9811  ( 1.84%)
>>> Amean   48      31.0724  ( 0.00%)       30.6015  ( 1.52%)
>>> Amean   79      51.1881  ( 0.00%)       50.4251  ( 1.49%)
>>> Amean   110     71.3399  ( 0.00%)       70.4578  ( 1.24%)
>>> Amean   141     91.4675  ( 0.00%)       90.3769  ( 1.19%)
>>> Amean   172     111.7463 ( 0.00%)       110.3947 ( 1.21%)
>>> Amean   203     131.6927 ( 0.00%)       130.3270 ( 1.04%)
>>> Amean   234     151.7459 ( 0.00%)       150.1320 ( 1.06%)
>>> Amean   265     171.9101 ( 0.00%)       169.9751 ( 1.13%)
>>> Amean   288     186.9231 ( 0.00%)       184.7706 ( 1.15%)
>>>
>>> thread-pipes:
>>>                  Baseline                Patched
>>> Amean   1       0.4523  ( 0.00%)        0.4535  ( -0.28%)
>>> Amean   4       0.9041  ( 0.00%)        0.9085  ( -0.48%)
>>> Amean   7       1.4111  ( 0.00%)        1.4146  ( -0.25%)
>>> Amean   12      2.3532  ( 0.00%)        2.3688  ( -0.66%)
>>> Amean   21      4.1550  ( 0.00%)        4.1701  ( -0.36%)
>>> Amean   30      6.1043  ( 0.00%)        6.2391  ( -2.21%)
>>> Amean   48      10.2077 ( 0.00%)        10.3511 ( -1.40%)
>>> Amean   79      16.7922 ( 0.00%)        17.0427 ( -1.49%)
>>> Amean   110     23.3350 ( 0.00%)        23.6522 ( -1.36%)
>>> Amean   141     29.6458 ( 0.00%)        30.2617 ( -2.08%)
>>> Amean   172     35.8649 ( 0.00%)        36.4225 ( -1.55%)
>>> Amean   203     42.4477 ( 0.00%)        42.8332 ( -0.91%)
>>> Amean   234     48.7117 ( 0.00%)        49.4042 ( -1.42%)
>>> Amean   265     54.9171 ( 0.00%)        55.6551 ( -1.34%)
>>> Amean   288     59.5282 ( 0.00%)        60.2903 ( -1.28%)
>>>
>>> thread-sockets:
>>>                  Baseline                Patched
>>> Amean   1       0.6917   ( 0.00%)       0.6892   ( 0.37%)
>>> Amean   4       2.6651   ( 0.00%)       2.6017   ( 2.38%)
>>> Amean   7       4.6734   ( 0.00%)       4.5637   ( 2.35%)
>>> Amean   12      8.0156   ( 0.00%)       7.8079   ( 2.59%)
>>> Amean   21      14.0451  ( 0.00%)       13.6679  ( 2.69%)
>>> Amean   30      20.0963  ( 0.00%)       19.5657  ( 2.64%)
>>> Amean   48      32.4115  ( 0.00%)       31.6001  ( 2.50%)
>>> Amean   79      53.1104  ( 0.00%)       51.8395  ( 2.39%)
>>> Amean   110     74.0929  ( 0.00%)       72.4391  ( 2.23%)
>>> Amean   141     95.1506  ( 0.00%)       93.0992  ( 2.16%)
>>> Amean   172     116.1969 ( 0.00%)       113.8307 ( 2.04%)
>>> Amean   203     137.4413 ( 0.00%)       134.5247 ( 2.12%)
>>> Amean   234     158.5395 ( 0.00%)       155.2793 ( 2.06%)
>>> Amean   265     179.7729 ( 0.00%)       176.1099 ( 2.04%)
>>> Amean   288     195.5573 ( 0.00%)       191.3977 ( 2.13%)
>>>
>>> Pgbench (config-db-pgbench-timed-ro-small)
>>> =======
>>>                  Baseline            Patched
>>> Hmean   1       68.54    ( 0.00%)   69.72    ( 1.71%)
>>> Hmean   6       27725.78 ( 0.00%)   34119.11 ( 23.06%)
>>> Hmean   12      55724.26 ( 0.00%)   63158.22 ( 13.34%)
>>> Hmean   22      72806.26 ( 0.00%)   73389.98 ( 0.80%)
>>> Hmean   30      79000.45 ( 0.00%)   75197.02 ( -4.81%)
>>> Hmean   48      78180.16 ( 0.00%)   75074.09 ( -3.97%)
>>> Hmean   80      75001.93 ( 0.00%)   70590.72 ( -5.88%)
>>> Hmean   110     74812.25 ( 0.00%)   74128.57 ( -0.91%)
>>> Hmean   142     74261.05 ( 0.00%)   72910.48 ( -1.82%)
>>> Hmean   144     75375.35 ( 0.00%)   71295.72 ( -5.41%)
>>>
>>> For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
>>> conclude that there are no performance differences for wakeups from task context.
>>> For pgbench, after many runs, 10~30% gains are very consistent at lower
>>> client counts (< #cores per socket).
>> Can you provide some further insights on why pgebench is helped at low load
>> case?  Is it because the woken tasks tend to stay put and not get moved around with interrupts
>> and maintain cache warmth?
> Yes, and for read-only database workloads, the cache (whether it's 
> incoming packet or not) on the interrupt
> CPU isn't as performance critical as cache from its previous CPU where 
> the db task run to process data.
> To give you an example, consider a db client/server case, a client 
> sends a request for a select query
> through the network, the server accepts the query request and does all 
> the heavy lifting and sends the result
> back. For the server, the incoming packet is just a line of query 
> whereas the CPU and its L3 db process previously
> on has all the warm db caches, pulling it away from them is a crime 
> :)  This may seem to be a little contradictory
> to what I said earlier about the idle case and Mel's patch, but 
> ¯\_(ツ)_/¯ it's hard to make all the workloads out
> there happy. Anyway like I said earlier, this patch doesn't affect the 
> idle case
>
> At higher load, sync in wake_affine_idle() doesn't really matter 
> because the waking CPU could easily have more than
> 1 runnable tasks. Sync in wake_affine_weight() also doesn't matter 
> much as both sides have work to do, and cache
> benefit of not pulling decreases simply because there are a lot more 
> db processes under the same L3, they can compete
> for the same cachelines.
>
> Hope my explanation helps!
>
> Libo
>> Tim
>>
>>> For higher client counts, both kernels are
>>> very close, +-5% swings are quite common. Also NET_TX|RX softirq load
>>> does spread out across both NUMA nodes in this test so there is no need to do
>>> any explicit RPS/RFS.
>>>
>>> [1]:https://urldefense.com/v3/__https://lkml.org/lkml/2021/11/5/234__;!!ACWV5N9M2RV99hQ!PTDVWQJ1pb2KldrRnXA3EJ2CPKyHAolke1QRjVLadEl2ctd3xYxTo5uZ5Dp91L3ExImjbrrqbNM27uEA-E9do7LM$  
>>>
>>> Signed-off-by: Libo Chen<libo.chen@oracle.com>
>>> ---
>>>   kernel/sched/fair.c | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 794c2cb945f8..59b210d2cdb5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>>   static int
>>>   select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>>   {
>>> -	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>>> +	/* Don't set sync for wakeup from irq/soft ctx */
>>> +	int sync = in_task() && (wake_flags & WF_SYNC)
>>> +		   && !(current->flags & PF_EXITING);
>>>   	struct sched_domain *tmp, *sd = NULL;
>>>   	int cpu = smp_processor_id();
>>>   	int new_cpu = prev_cpu;
>>> --
>>> 2.31.1
>>>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
       [not found]   ` <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
  2022-07-13 19:34     ` Libo Chen
@ 2022-07-13 20:51     ` Tim Chen
  2022-07-13 21:37       ` Libo Chen
  2022-07-14 14:18     ` Mel Gorman
  2 siblings, 1 reply; 21+ messages in thread
From: Tim Chen @ 2022-07-13 20:51 UTC (permalink / raw)
  To: Libo Chen, peterz, vincent.guittot, mgorman, 21cnbao, dietmar.eggemann
  Cc: linux-kernel, tglx

On Wed, 2022-07-13 at 12:17 -0700, Libo Chen wrote:
> 
> 
> On 7/13/22 09:40, Tim Chen wrote:
> > On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
> > > Barry Song first pointed out that replacing sync wakeup with regular wakeup
> > > seems to reduce overeager wakeup pulling and shows noticeable performance
> > > improvement.[1]
> > > 
> > > This patch argues that allowing sync for wakeups from interrupt context
> > > is a bug and fixing it can improve performance even when irq/softirq is
> > > evenly spread out.
> > > 
> > > For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
> > > waker can be any random task that happens to be running on that CPU when the
> > > interrupt comes in. This is completely different from other wakups where the
> > > task running on the waking CPU is the actual waker. For example, two tasks
> > > communicate through a pipe or mutiple tasks access the same critical section,
> > > etc. This difference is important because with sync we assume the waker will
> > > get off the runqueue and go to sleep immedately after the wakeup. The
> > > assumption is built into wake_affine() where it discounts the waker's presence
> > > from the runqueue when sync is true. The random waker from interrupts bears no
> > > relation to the wakee and don't usually go to sleep immediately afterwards
> > > unless wakeup granularity is reached. Plus the scheduler no longer enforces the
> > > preepmtion of waker for sync wakeup as it used to before
> > > patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
> > > wakeup preemption for wakeups from interrupt contexts doesn't seem to be
> > > appropriate too but at least sync wakeup will do what it's supposed to do.
> > 
> > Will there be scenarios where you do want the task being woken up be pulled
> > to the CPU where the interrupt happened, as the data that needs to be accessed is
> > on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
> > packets received.  Sync still seems desirable, at least if there is no task currently
> > running on the CPU where interrupt happened.  So maybe we should have some consideration
> > of the load on the CPU/NUMA before deciding whether we should do sync wake for such
> > interrupt.
> > 
>  There are only two places where sync wakeup matters: wake_affine_idle() and wake_affine_weight().
> In wake_affine_idle(), it considers pulling if there is one runnable on the waking CPU because
> of the belief that this runnable will voluntarily get off the runqueue. In wake_affine_weight(),
> it basically takes off the waker's load again assuming the waker goes to sleep after the wakeup.
> My argument is that this assumption doesn't really hold for wakeups from the interrupt contexts
> when the waking CPU is non-idle. Wakeups from task context? sure, it seems to be a reasonable
> assumption. 

I agree with you that the the sync case load computation for wake_affine_idle()
and wake_affine_weight() is incorrect when waking a task from the interrupt context.
In this light, your proposal makes sense.

> For your idle case, I totally agree but I don't think having sync or not will actually
> have any impacts here giving what the code does. Real impact comes from Mel's patch 7332dec055f2457c3
> which makes it less likely to pull tasks when the waking CPU is idle. I believe we should consider
> reverting 7332dec055f2 because a significant RDS latency regression has been spotted recently on our
> system due to this patch. 

The commit 7332dec055f2 prevented cross NUMA node pulling.  I think if the 
waking CPU's NUMA node's average load is less than the prev CPU's NUMA node, 
this cross NUMA node pull could be allowed for better load distribution.    

> > 
> > Can you provide some further insights on why pgebench is helped at low load
> > case?  Is it because the woken tasks tend to stay put and not get moved around with interrupts
> > and maintain cache warmth?
>  Yes, and for read-only database workloads, the cache (whether it's incoming packet or not) on the interrupt
> CPU isn't as performance critical as cache from its previous CPU where the db task run to process data.
> To give you an example, consider a db client/server case, a client sends a request for a select query
> through the network, the server accepts the query request and does all the heavy lifting and sends the result
> back. For the server, the incoming packet is just a line of query whereas the CPU and its L3 db process previously
> on has all the warm db caches, pulling it away from them is a crime :)  This may seem to be a little contradictory
> to what I said earlier about the idle case and Mel's patch, but ¯\_(ツ)_/¯ it's hard to make all the workloads out
> there happy. Anyway like I said earlier, this patch doesn't affect the idle case  
> 
> At higher load, sync in wake_affine_idle() doesn't really matter because the waking CPU could easily have more than
> 1 runnable tasks. Sync in wake_affine_weight() also doesn't matter much as both sides have work to do, and cache
> benefit of not pulling decreases simply because there are a lot more db processes under the same L3, they can compete
> for the same cachelines.
> 
> Hope my explanation helps!

Yes, that makes sense.

Tim



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-13 20:51     ` Tim Chen
@ 2022-07-13 21:37       ` Libo Chen
  0 siblings, 0 replies; 21+ messages in thread
From: Libo Chen @ 2022-07-13 21:37 UTC (permalink / raw)
  To: Tim Chen, peterz, vincent.guittot, mgorman, 21cnbao, dietmar.eggemann
  Cc: linux-kernel, tglx



On 7/13/22 13:51, Tim Chen wrote:
> On Wed, 2022-07-13 at 12:17 -0700, Libo Chen wrote:
>>
>> On 7/13/22 09:40, Tim Chen wrote:
>>> On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
>>>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>>>> seems to reduce overeager wakeup pulling and shows noticeable performance
>>>> improvement.[1]
>>>>
>>>> This patch argues that allowing sync for wakeups from interrupt context
>>>> is a bug and fixing it can improve performance even when irq/softirq is
>>>> evenly spread out.
>>>>
>>>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>>>> waker can be any random task that happens to be running on that CPU when the
>>>> interrupt comes in. This is completely different from other wakups where the
>>>> task running on the waking CPU is the actual waker. For example, two tasks
>>>> communicate through a pipe or mutiple tasks access the same critical section,
>>>> etc. This difference is important because with sync we assume the waker will
>>>> get off the runqueue and go to sleep immedately after the wakeup. The
>>>> assumption is built into wake_affine() where it discounts the waker's presence
>>>> from the runqueue when sync is true. The random waker from interrupts bears no
>>>> relation to the wakee and don't usually go to sleep immediately afterwards
>>>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>>>> preepmtion of waker for sync wakeup as it used to before
>>>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>>>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>>>> appropriate too but at least sync wakeup will do what it's supposed to do.
>>> Will there be scenarios where you do want the task being woken up be pulled
>>> to the CPU where the interrupt happened, as the data that needs to be accessed is
>>> on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
>>> packets received.  Sync still seems desirable, at least if there is no task currently
>>> running on the CPU where interrupt happened.  So maybe we should have some consideration
>>> of the load on the CPU/NUMA before deciding whether we should do sync wake for such
>>> interrupt.
>>>
>>   There are only two places where sync wakeup matters: wake_affine_idle() and wake_affine_weight().
>> In wake_affine_idle(), it considers pulling if there is one runnable on the waking CPU because
>> of the belief that this runnable will voluntarily get off the runqueue. In wake_affine_weight(),
>> it basically takes off the waker's load again assuming the waker goes to sleep after the wakeup.
>> My argument is that this assumption doesn't really hold for wakeups from the interrupt contexts
>> when the waking CPU is non-idle. Wakeups from task context? sure, it seems to be a reasonable
>> assumption.
> I agree with you that the the sync case load computation for wake_affine_idle()
> and wake_affine_weight() is incorrect when waking a task from the interrupt context.
> In this light, your proposal makes sense.
>
>> For your idle case, I totally agree but I don't think having sync or not will actually
>> have any impacts here giving what the code does. Real impact comes from Mel's patch 7332dec055f2457c3
>> which makes it less likely to pull tasks when the waking CPU is idle. I believe we should consider
>> reverting 7332dec055f2 because a significant RDS latency regression has been spotted recently on our
>> system due to this patch.
> The commit 7332dec055f2 prevented cross NUMA node pulling.  I think if the
> waking CPU's NUMA node's average load is less than the prev CPU's NUMA node,
> this cross NUMA node pull could be allowed for better load distribution.
Yeah, we should rewrite wake_affine_weight() so that it compares average 
loads of two nodes
instead of two rq loads.


Libo

>>> Can you provide some further insights on why pgebench is helped at low load
>>> case?  Is it because the woken tasks tend to stay put and not get moved around with interrupts
>>> and maintain cache warmth?
>>   Yes, and for read-only database workloads, the cache (whether it's incoming packet or not) on the interrupt
>> CPU isn't as performance critical as cache from its previous CPU where the db task run to process data.
>> To give you an example, consider a db client/server case, a client sends a request for a select query
>> through the network, the server accepts the query request and does all the heavy lifting and sends the result
>> back. For the server, the incoming packet is just a line of query whereas the CPU and its L3 db process previously
>> on has all the warm db caches, pulling it away from them is a crime :)  This may seem to be a little contradictory
>> to what I said earlier about the idle case and Mel's patch, but ¯\_(ツ)_/¯ it's hard to make all the workloads out
>> there happy. Anyway like I said earlier, this patch doesn't affect the idle case
>>
>> At higher load, sync in wake_affine_idle() doesn't really matter because the waking CPU could easily have more than
>> 1 runnable tasks. Sync in wake_affine_weight() also doesn't matter much as both sides have work to do, and cache
>> benefit of not pulling decreases simply because there are a lot more db processes under the same L3, they can compete
>> for the same cachelines.
>>
>> Hope my explanation helps!
> Yes, that makes sense.
>
> Tim
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
  2022-07-13 16:40 ` Tim Chen
@ 2022-07-14 11:26 ` Peter Zijlstra
  2022-07-14 11:35 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2022-07-14 11:26 UTC (permalink / raw)
  To: Libo Chen
  Cc: vincent.guittot, mgorman, tim.c.chen, 21cnbao, dietmar.eggemann,
	linux-kernel, tglx

On Mon, Jul 11, 2022 at 03:47:04PM -0700, Libo Chen wrote:
> Barry Song first pointed out that replacing sync wakeup with regular wakeup
> seems to reduce overeager wakeup pulling and shows noticeable performance
> improvement.[1]
> 
> This patch argues that allowing sync for wakeups from interrupt context
> is a bug 

Yes.

> The
> assumption is built into wake_affine() where it discounts the waker's presence
> from the runqueue when sync is true. The random waker from interrupts bears no
> relation to the wakee and don't usually go to sleep immediately afterwards
> unless wakeup granularity is reached. 

Exactly that.

> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> ---
>  kernel/sched/fair.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..59b210d2cdb5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  static int
>  select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  {
> -	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> +	/* Don't set sync for wakeup from irq/soft ctx */
> +	int sync = in_task() && (wake_flags & WF_SYNC)
> +		   && !(current->flags & PF_EXITING);

That's not a coding style you'll find anywhere near this code though.
I'll fix it up.

>  	struct sched_domain *tmp, *sd = NULL;
>  	int cpu = smp_processor_id();
>  	int new_cpu = prev_cpu;
> --
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
  2022-07-13 16:40 ` Tim Chen
  2022-07-14 11:26 ` Peter Zijlstra
@ 2022-07-14 11:35 ` Peter Zijlstra
  2022-07-14 18:18   ` Libo Chen
  2022-07-21  8:44 ` [tip: sched/core] sched/fair: Disallow " tip-bot2 for Libo Chen
  2022-07-29  4:47 ` [PATCH] sched/fair: no " K Prateek Nayak
  4 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2022-07-14 11:35 UTC (permalink / raw)
  To: Libo Chen
  Cc: vincent.guittot, mgorman, tim.c.chen, 21cnbao, dietmar.eggemann,
	linux-kernel, tglx

On Mon, Jul 11, 2022 at 03:47:04PM -0700, Libo Chen wrote:
> [1]: https://lkml.org/lkml/2021/11/5/234

Please use lore.kernel.org based links that contain the MsgID of the
email in question, I think this wants to be:

https://lkml.kernel.org/r/20211105105136.12137-1-21cnbao@gmail.com

Right?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
       [not found]   ` <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
  2022-07-13 19:34     ` Libo Chen
  2022-07-13 20:51     ` Tim Chen
@ 2022-07-14 14:18     ` Mel Gorman
  2022-07-14 20:21       ` Libo Chen
  2 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2022-07-14 14:18 UTC (permalink / raw)
  To: Libo Chen
  Cc: Tim Chen, peterz, vincent.guittot, 21cnbao, dietmar.eggemann,
	linux-kernel, tglx

On Wed, Jul 13, 2022 at 12:17:33PM -0700, Libo Chen wrote:
> 
> 
> On 7/13/22 09:40, Tim Chen wrote:
> > On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
> > > Barry Song first pointed out that replacing sync wakeup with regular wakeup
> > > seems to reduce overeager wakeup pulling and shows noticeable performance
> > > improvement.[1]
> > > 
> > > This patch argues that allowing sync for wakeups from interrupt context
> > > is a bug and fixing it can improve performance even when irq/softirq is
> > > evenly spread out.
> > > 
> > > For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
> > > waker can be any random task that happens to be running on that CPU when the
> > > interrupt comes in. This is completely different from other wakups where the
> > > task running on the waking CPU is the actual waker. For example, two tasks
> > > communicate through a pipe or mutiple tasks access the same critical section,
> > > etc. This difference is important because with sync we assume the waker will
> > > get off the runqueue and go to sleep immedately after the wakeup. The
> > > assumption is built into wake_affine() where it discounts the waker's presence
> > > from the runqueue when sync is true. The random waker from interrupts bears no
> > > relation to the wakee and don't usually go to sleep immediately afterwards
> > > unless wakeup granularity is reached. Plus the scheduler no longer enforces the
> > > preepmtion of waker for sync wakeup as it used to before
> > > patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
> > > wakeup preemption for wakeups from interrupt contexts doesn't seem to be
> > > appropriate too but at least sync wakeup will do what it's supposed to do.
> > Will there be scenarios where you do want the task being woken up be pulled
> > to the CPU where the interrupt happened, as the data that needs to be accessed is
> > on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
> > packets received.  Sync still seems desirable, at least if there is no task currently
> > running on the CPU where interrupt happened.  So maybe we should have some consideration
> > of the load on the CPU/NUMA before deciding whether we should do sync wake for such
> > interrupt.
> > 
> There are only two places where sync wakeup matters: wake_affine_idle() and
> wake_affine_weight().
> In wake_affine_idle(), it considers pulling if there is one runnable on the
> waking CPU because
> of the belief that this runnable will voluntarily get off the runqueue. In
> wake_affine_weight(),
> it basically takes off the waker's load again assuming the waker goes to
> sleep after the wakeup.
> My argument is that this assumption doesn't really hold for wakeups from the
> interrupt contexts
> when the waking CPU is non-idle. Wakeups from task context? sure, it seems
> to be a reasonable
> assumption. For your idle case, I totally agree but I don't think having
> sync or not will actually
> have any impacts here giving what the code does. Real impact comes fromMel's
> patch 7332dec055f2457c3
> which makes it less likely to pull tasks when the waking CPU is idle. I
> believe we should consider
> reverting 7332dec055f2 because a significant RDS latency regression has been
> spotted recently on our
> system due to this patch.
> 

The intent of 7332dec055f2 was to prevent harmful cross-node accesses.
It still allowed cache-local migrations on the assumption that the incoming
data was critical enough to justify losing any other cache-hot data. You
state explicitly that "the interrupt CPU isn't as performance critical as
cache from its previous CPU" so that assumption was incorrect, at least
in your case. I don't have a counter example where the interrupt data *is*
more important than any other cache-hot data so the check can go.

I think a revert would not achieve what you want as a plain revert would
still allow an interrupt to pull a task from an arbitrary location as sync
is not checked. A follow-up to your patch or an updated version should not
check available_idle_cpu at all in wake_affine_idle as it's only idle the
wake is from interrupt context and vcpu_is_preempted is not necessarily
justification for pulling a task due to an interrupt.

Something like this but needs testing with your target loads, particularly
the RDS (Relational Database Service?) latency regression;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7b275672c89..e55a3a67a442 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5975,8 +5975,8 @@ static int wake_wide(struct task_struct *p)
  * soonest. For the purpose of speed we only consider the waking and previous
  * CPU.
  *
- * wake_affine_idle() - only considers 'now', it check if the waking CPU is
- *			cache-affine and is (or	will be) idle.
+ * wake_affine_idle() - only considers 'now', it checks if the waker task is a
+ *			sync wakeup from a CPU that should be idle soon.
  *
  * wake_affine_weight() - considers the weight to reflect the average
  *			  scheduling latency of the CPUs. This seems to work
@@ -5985,21 +5985,6 @@ static int wake_wide(struct task_struct *p)
 static int
 wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 {
-	/*
-	 * If this_cpu is idle, it implies the wakeup is from interrupt
-	 * context. Only allow the move if cache is shared. Otherwise an
-	 * interrupt intensive workload could force all tasks onto one
-	 * node depending on the IO topology or IRQ affinity settings.
-	 *
-	 * If the prev_cpu is idle and cache affine then avoid a migration.
-	 * There is no guarantee that the cache hot data from an interrupt
-	 * is more important than cache hot data on the prev_cpu and from
-	 * a cpufreq perspective, it's better to have higher utilisation
-	 * on one CPU.
-	 */
-	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
-
 	if (sync && cpu_rq(this_cpu)->nr_running == 1)
 		return this_cpu;
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-14 11:35 ` Peter Zijlstra
@ 2022-07-14 18:18   ` Libo Chen
  0 siblings, 0 replies; 21+ messages in thread
From: Libo Chen @ 2022-07-14 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vincent.guittot, mgorman, tim.c.chen, 21cnbao, dietmar.eggemann,
	linux-kernel, tglx



On 7/14/22 04:35, Peter Zijlstra wrote:
> On Mon, Jul 11, 2022 at 03:47:04PM -0700, Libo Chen wrote:
>> [1]: https://urldefense.com/v3/__https://lkml.org/lkml/2021/11/5/234__;!!ACWV5N9M2RV99hQ!IwoFmg5mL051R93PCGfCs27IaZENjt_CV7Rp7RCZCGsuNi9gbcMOlNwOCkCosXNX94ZRzMjAuU9khXfg7A$
> Please use lore.kernel.org based links that contain the MsgID of the
> email in question, I think this wants to be:
>
> https://urldefense.com/v3/__https://lkml.kernel.org/r/20211105105136.12137-1-21cnbao@gmail.com__;!!ACWV5N9M2RV99hQ!IwoFmg5mL051R93PCGfCs27IaZENjt_CV7Rp7RCZCGsuNi9gbcMOlNwOCkCosXNX94ZRzMjAuU9Kr4MkBg$
>
> Right?
Oh yes, thanks.

Libo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-14 14:18     ` Mel Gorman
@ 2022-07-14 20:21       ` Libo Chen
  2022-07-15 10:07         ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Libo Chen @ 2022-07-14 20:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tim Chen, peterz, vincent.guittot, 21cnbao, dietmar.eggemann,
	linux-kernel, tglx, Konrad Rzeszutek Wilk



On 7/14/22 07:18, Mel Gorman wrote:
> On Wed, Jul 13, 2022 at 12:17:33PM -0700, Libo Chen wrote:
>>
>> On 7/13/22 09:40, Tim Chen wrote:
>>> On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
>>>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>>>> seems to reduce overeager wakeup pulling and shows noticeable performance
>>>> improvement.[1]
>>>>
>>>> This patch argues that allowing sync for wakeups from interrupt context
>>>> is a bug and fixing it can improve performance even when irq/softirq is
>>>> evenly spread out.
>>>>
>>>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>>>> waker can be any random task that happens to be running on that CPU when the
>>>> interrupt comes in. This is completely different from other wakups where the
>>>> task running on the waking CPU is the actual waker. For example, two tasks
>>>> communicate through a pipe or mutiple tasks access the same critical section,
>>>> etc. This difference is important because with sync we assume the waker will
>>>> get off the runqueue and go to sleep immedately after the wakeup. The
>>>> assumption is built into wake_affine() where it discounts the waker's presence
>>>> from the runqueue when sync is true. The random waker from interrupts bears no
>>>> relation to the wakee and don't usually go to sleep immediately afterwards
>>>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>>>> preepmtion of waker for sync wakeup as it used to before
>>>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>>>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>>>> appropriate too but at least sync wakeup will do what it's supposed to do.
>>> Will there be scenarios where you do want the task being woken up be pulled
>>> to the CPU where the interrupt happened, as the data that needs to be accessed is
>>> on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
>>> packets received.  Sync still seems desirable, at least if there is no task currently
>>> running on the CPU where interrupt happened.  So maybe we should have some consideration
>>> of the load on the CPU/NUMA before deciding whether we should do sync wake for such
>>> interrupt.
>>>
>> There are only two places where sync wakeup matters: wake_affine_idle() and
>> wake_affine_weight().
>> In wake_affine_idle(), it considers pulling if there is one runnable on the
>> waking CPU because
>> of the belief that this runnable will voluntarily get off the runqueue. In
>> wake_affine_weight(),
>> it basically takes off the waker's load again assuming the waker goes to
>> sleep after the wakeup.
>> My argument is that this assumption doesn't really hold for wakeups from the
>> interrupt contexts
>> when the waking CPU is non-idle. Wakeups from task context? sure, it seems
>> to be a reasonable
>> assumption. For your idle case, I totally agree but I don't think having
>> sync or not will actually
>> have any impacts here giving what the code does. Real impact comes fromMel's
>> patch 7332dec055f2457c3
>> which makes it less likely to pull tasks when the waking CPU is idle. I
>> believe we should consider
>> reverting 7332dec055f2 because a significant RDS latency regression has been
>> spotted recently on our
>> system due to this patch.
>>
> The intent of 7332dec055f2 was to prevent harmful cross-node accesses.
> It still allowed cache-local migrations on the assumption that the incoming
> data was critical enough to justify losing any other cache-hot data. You
I am not too against cache-local migrations here because they are still 
under the
same LLC. We really focus on cross-node migrations that sometimes hurt us,
sometimes don't because this is where you need to determine who has the 
warmer L3
cache.
> state explicitly that "the interrupt CPU isn't as performance critical as
> cache from its previous CPU" so that assumption was incorrect, at least
> in your case. I don't have a counter example where the interrupt data *is*
> more important than any other cache-hot data so the check can go.
>
> I think a revert would not achieve what you want as a plain revert would
> still allow an interrupt to pull a task from an arbitrary location as sync
This is the tricky part, I didn't explain it well. For rds-stress, it's a
lot (~30%) better to allow pulling across nodes when the waking CPU is 
idle.
I think this may be an example of interrupt data being more important. 
Something
like below will help a lot for this particular benchmark (rds-stress):

if (available_idle_cpu(this_cpu))
         return this_cpu;

But for db workloads, yes, in general, pulling does more damages than 
goods as
said before esp. under light load. But I think pulling to an idle CPU 
across nodes
is okay, because this usually happens at the beginning of a workload and 
once a
task is pulled it stays there or at least in the same LLC sched domain 
for a while.
What is not okay is when the waking CPU already has a task and the 
wakeup still pulls
the wakee task to that CPU across nodes irrespective of its previous 
CPU. And that's
what this patch tries to address.

Mel, I am thinking about a follow-up patch like below then we can 
continue the discussion
there since this is kinda a separate issue:

-	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
-

+       if (available_idle_cpu(this_cpu))
+               if (cpus_share_cache(this_cpu, prev_cpu))
+                       return available_idle_cpu(prev_cpu) ? prev_cpu : 
this_cpu;
+       else
+               return this_cpu;

> is not checked. A follow-up to your patch or an updated version should not
> check available_idle_cpu at all in wake_affine_idle as it's only idle the
> wake is from interrupt context and vcpu_is_preempted is not necessarily
> justification for pulling a task due to an interrupt.
>
> Something like this but needs testing with your target loads, particularly
> the RDS (Relational Database Service?) latency regression;
Sorry, it's Reliable Datagram Sockets. We run rds-stress to measure rds
bandwidth and latency.

Libo
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b7b275672c89..e55a3a67a442 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5975,8 +5975,8 @@ static int wake_wide(struct task_struct *p)
>    * soonest. For the purpose of speed we only consider the waking and previous
>    * CPU.
>    *
> - * wake_affine_idle() - only considers 'now', it check if the waking CPU is
> - *			cache-affine and is (or	will be) idle.
> + * wake_affine_idle() - only considers 'now', it checks if the waker task is a
> + *			sync wakeup from a CPU that should be idle soon.
>    *
>    * wake_affine_weight() - considers the weight to reflect the average
>    *			  scheduling latency of the CPUs. This seems to work
> @@ -5985,21 +5985,6 @@ static int wake_wide(struct task_struct *p)
>   static int
>   wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>   {
> -	/*
> -	 * If this_cpu is idle, it implies the wakeup is from interrupt
> -	 * context. Only allow the move if cache is shared. Otherwise an
> -	 * interrupt intensive workload could force all tasks onto one
> -	 * node depending on the IO topology or IRQ affinity settings.
> -	 *
> -	 * If the prev_cpu is idle and cache affine then avoid a migration.
> -	 * There is no guarantee that the cache hot data from an interrupt
> -	 * is more important than cache hot data on the prev_cpu and from
> -	 * a cpufreq perspective, it's better to have higher utilisation
> -	 * on one CPU.
> -	 */
> -	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> -		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> -
>   	if (sync && cpu_rq(this_cpu)->nr_running == 1)
>   		return this_cpu;
>   


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-14 20:21       ` Libo Chen
@ 2022-07-15 10:07         ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-07-15 10:07 UTC (permalink / raw)
  To: Libo Chen
  Cc: Tim Chen, peterz, vincent.guittot, 21cnbao, dietmar.eggemann,
	linux-kernel, tglx, Konrad Rzeszutek Wilk

On Thu, Jul 14, 2022 at 01:21:14PM -0700, Libo Chen wrote:
> > state explicitly that "the interrupt CPU isn't as performance critical as
> > cache from its previous CPU" so that assumption was incorrect, at least
> > in your case. I don't have a counter example where the interrupt data *is*
> > more important than any other cache-hot data so the check can go.
> > 
> > I think a revert would not achieve what you want as a plain revert would
> > still allow an interrupt to pull a task from an arbitrary location as sync
> This is the tricky part, I didn't explain it well. For rds-stress, it's a
> lot (~30%) better to allow pulling across nodes when the waking CPU is idle.

Ah, the exact opposite then.

> I think this may be an example of interrupt data being more important.
> Something
> like below will help a lot for this particular benchmark (rds-stress):
> 
> if (available_idle_cpu(this_cpu))
>         return this_cpu;
> 

I see but this will likely regress for workloads that receive interrupts on
arbitrary CPUs that are not related to the tasks preferred location. This
can happen for IO completions for example where interrupts can be delivered
round-robin to many CPUs in the system. It's all described in the changelog
for 7332dec055f2

	Unfortunately, depending on the type of interrupt and IRQ
	configuration, there may not be a strong relationship between the
	CPU an interrupt was delivered on and the CPU a task was running
	on. For example, the interrupts could all be delivered to CPUs on
	one particular node due to the machine topology or IRQ affinity
	configuration. Another example is an interrupt for an IO completion
	which can be delivered to any CPU where there is no guarantee the
	data is either cache hot or even local.

> still pulls
> the wakee task to that CPU across nodes irrespective of its previous CPU.
> And that's
> what this patch tries to address.
> 

> Mel, I am thinking about a follow-up patch like below then we can continue
> the discussion
> there since this is kinda a separate issue:
> 
> -	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> -		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> -
> 
> +       if (available_idle_cpu(this_cpu))
> +               if (cpus_share_cache(this_cpu, prev_cpu))
> +                       return available_idle_cpu(prev_cpu) ? prev_cpu :
> this_cpu;
> +       else
> +               return this_cpu;
> 

That will also pull tasks cross-node and while it might work well for a
network stress test, it will hurt other cases where the interrupt data
is relatively unimportant to the waking task.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip: sched/core] sched/fair: Disallow sync wakeup from interrupt context
  2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
                   ` (2 preceding siblings ...)
  2022-07-14 11:35 ` Peter Zijlstra
@ 2022-07-21  8:44 ` tip-bot2 for Libo Chen
  2022-07-29  4:47 ` [PATCH] sched/fair: no " K Prateek Nayak
  4 siblings, 0 replies; 21+ messages in thread
From: tip-bot2 for Libo Chen @ 2022-07-21  8:44 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Libo Chen, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     14b3f2d9ee8df3b6040f7e21f9fcd1d848938fd9
Gitweb:        https://git.kernel.org/tip/14b3f2d9ee8df3b6040f7e21f9fcd1d848938fd9
Author:        Libo Chen <libo.chen@oracle.com>
AuthorDate:    Mon, 11 Jul 2022 15:47:04 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 21 Jul 2022 10:39:39 +02:00

sched/fair: Disallow sync wakeup from interrupt context

Barry Song first pointed out that replacing sync wakeup with regular wakeup
seems to reduce overeager wakeup pulling and shows noticeable performance
improvement.[1]

This patch argues that allowing sync for wakeups from interrupt context
is a bug and fixing it can improve performance even when irq/softirq is
evenly spread out.

For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
waker can be any random task that happens to be running on that CPU when the
interrupt comes in. This is completely different from other wakups where the
task running on the waking CPU is the actual waker. For example, two tasks
communicate through a pipe or mutiple tasks access the same critical section,
etc. This difference is important because with sync we assume the waker will
get off the runqueue and go to sleep immedately after the wakeup. The
assumption is built into wake_affine() where it discounts the waker's presence
from the runqueue when sync is true. The random waker from interrupts bears no
relation to the wakee and don't usually go to sleep immediately afterwards
unless wakeup granularity is reached. Plus the scheduler no longer enforces the
preepmtion of waker for sync wakeup as it used to before
patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
wakeup preemption for wakeups from interrupt contexts doesn't seem to be
appropriate too but at least sync wakeup will do what it's supposed to do.

Add a check to make sure that sync can only be set when in_task() is true. If
a wakeup is from interrupt context, sync flag will be 0 because in_task()
returns 0.

Tested in two scenarios: wakeups from 1) task contexts, expected to see no
performance changes; 2) interrupt contexts, expected to see better performance
under low/medium load and no regression under heavy load.

Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
daemon is active.

Hackbench (config-scheduler-unbound)
=========
process-pipes:
                Baseline                Patched
Amean   1       0.4300  ( 0.00%)        0.4420  ( -2.79%)
Amean   4       0.8757  ( 0.00%)        0.8774  ( -0.20%)
Amean   7       1.3712  ( 0.00%)        1.3789  ( -0.56%)
Amean   12      2.3541  ( 0.00%)        2.3714  ( -0.73%)
Amean   21      4.2229  ( 0.00%)        4.2439  ( -0.50%)
Amean   30      5.9113  ( 0.00%)        5.9451  ( -0.57%)
Amean   48      9.3873  ( 0.00%)        9.4898  ( -1.09%)
Amean   79      15.9281 ( 0.00%)        16.1385 ( -1.32%)
Amean   110     22.0961 ( 0.00%)        22.3433 ( -1.12%)
Amean   141     28.2973 ( 0.00%)        28.6209 ( -1.14%)
Amean   172     34.4709 ( 0.00%)        34.9347 ( -1.35%)
Amean   203     40.7621 ( 0.00%)        41.2497 ( -1.20%)
Amean   234     47.0416 ( 0.00%)        47.6470 ( -1.29%)
Amean   265     53.3048 ( 0.00%)        54.1625 ( -1.61%)
Amean   288     58.0595 ( 0.00%)        58.8096 ( -1.29%)

process-sockets:
                Baseline                Patched
Amean   1       0.6776   ( 0.00%)       0.6611   ( 2.43%)
Amean   4       2.6183   ( 0.00%)       2.5769   ( 1.58%)
Amean   7       4.5662   ( 0.00%)       4.4801   ( 1.89%)
Amean   12      7.7638   ( 0.00%)       7.6201   ( 1.85%)
Amean   21      13.5335  ( 0.00%)       13.2915  ( 1.79%)
Amean   30      19.3369  ( 0.00%)       18.9811  ( 1.84%)
Amean   48      31.0724  ( 0.00%)       30.6015  ( 1.52%)
Amean   79      51.1881  ( 0.00%)       50.4251  ( 1.49%)
Amean   110     71.3399  ( 0.00%)       70.4578  ( 1.24%)
Amean   141     91.4675  ( 0.00%)       90.3769  ( 1.19%)
Amean   172     111.7463 ( 0.00%)       110.3947 ( 1.21%)
Amean   203     131.6927 ( 0.00%)       130.3270 ( 1.04%)
Amean   234     151.7459 ( 0.00%)       150.1320 ( 1.06%)
Amean   265     171.9101 ( 0.00%)       169.9751 ( 1.13%)
Amean   288     186.9231 ( 0.00%)       184.7706 ( 1.15%)

thread-pipes:
                Baseline                Patched
Amean   1       0.4523  ( 0.00%)        0.4535  ( -0.28%)
Amean   4       0.9041  ( 0.00%)        0.9085  ( -0.48%)
Amean   7       1.4111  ( 0.00%)        1.4146  ( -0.25%)
Amean   12      2.3532  ( 0.00%)        2.3688  ( -0.66%)
Amean   21      4.1550  ( 0.00%)        4.1701  ( -0.36%)
Amean   30      6.1043  ( 0.00%)        6.2391  ( -2.21%)
Amean   48      10.2077 ( 0.00%)        10.3511 ( -1.40%)
Amean   79      16.7922 ( 0.00%)        17.0427 ( -1.49%)
Amean   110     23.3350 ( 0.00%)        23.6522 ( -1.36%)
Amean   141     29.6458 ( 0.00%)        30.2617 ( -2.08%)
Amean   172     35.8649 ( 0.00%)        36.4225 ( -1.55%)
Amean   203     42.4477 ( 0.00%)        42.8332 ( -0.91%)
Amean   234     48.7117 ( 0.00%)        49.4042 ( -1.42%)
Amean   265     54.9171 ( 0.00%)        55.6551 ( -1.34%)
Amean   288     59.5282 ( 0.00%)        60.2903 ( -1.28%)

thread-sockets:
                Baseline                Patched
Amean   1       0.6917   ( 0.00%)       0.6892   ( 0.37%)
Amean   4       2.6651   ( 0.00%)       2.6017   ( 2.38%)
Amean   7       4.6734   ( 0.00%)       4.5637   ( 2.35%)
Amean   12      8.0156   ( 0.00%)       7.8079   ( 2.59%)
Amean   21      14.0451  ( 0.00%)       13.6679  ( 2.69%)
Amean   30      20.0963  ( 0.00%)       19.5657  ( 2.64%)
Amean   48      32.4115  ( 0.00%)       31.6001  ( 2.50%)
Amean   79      53.1104  ( 0.00%)       51.8395  ( 2.39%)
Amean   110     74.0929  ( 0.00%)       72.4391  ( 2.23%)
Amean   141     95.1506  ( 0.00%)       93.0992  ( 2.16%)
Amean   172     116.1969 ( 0.00%)       113.8307 ( 2.04%)
Amean   203     137.4413 ( 0.00%)       134.5247 ( 2.12%)
Amean   234     158.5395 ( 0.00%)       155.2793 ( 2.06%)
Amean   265     179.7729 ( 0.00%)       176.1099 ( 2.04%)
Amean   288     195.5573 ( 0.00%)       191.3977 ( 2.13%)

Pgbench (config-db-pgbench-timed-ro-small)
=======
                Baseline            Patched
Hmean   1       68.54    ( 0.00%)   69.72    ( 1.71%)
Hmean   6       27725.78 ( 0.00%)   34119.11 ( 23.06%)
Hmean   12      55724.26 ( 0.00%)   63158.22 ( 13.34%)
Hmean   22      72806.26 ( 0.00%)   73389.98 ( 0.80%)
Hmean   30      79000.45 ( 0.00%)   75197.02 ( -4.81%)
Hmean   48      78180.16 ( 0.00%)   75074.09 ( -3.97%)
Hmean   80      75001.93 ( 0.00%)   70590.72 ( -5.88%)
Hmean   110     74812.25 ( 0.00%)   74128.57 ( -0.91%)
Hmean   142     74261.05 ( 0.00%)   72910.48 ( -1.82%)
Hmean   144     75375.35 ( 0.00%)   71295.72 ( -5.41%)

For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
conclude that there are no performance differences for wakeups from task context.
For pgbench, after many runs, 10~30% gains are very consistent at lower
client counts (< #cores per socket). For higher client counts, both kernels are
very close, +-5% swings are quite common. Also NET_TX|RX softirq load
does spread out across both NUMA nodes in this test so there is no need to do
any explicit RPS/RFS.

[1]: https://lkml.kernel.org/r/20211105105136.12137-1-21cnbao@gmail.com

Signed-off-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220711224704.1672831-1-libo.chen@oracle.com
---
 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 914096c..2fc4725 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7013,7 +7013,10 @@ unlock:
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 {
-	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+	/*
+	 * Only consider WF_SYNC from task context; since only a task can schedule out.
+	 */
+	int sync = in_task() && (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 	struct sched_domain *tmp, *sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
                   ` (3 preceding siblings ...)
  2022-07-21  8:44 ` [tip: sched/core] sched/fair: Disallow " tip-bot2 for Libo Chen
@ 2022-07-29  4:47 ` K Prateek Nayak
  2022-08-01 13:26   ` Ingo Molnar
  2022-08-01 14:57   ` Libo Chen
  4 siblings, 2 replies; 21+ messages in thread
From: K Prateek Nayak @ 2022-07-29  4:47 UTC (permalink / raw)
  To: Libo Chen, peterz, vincent.guittot, mgorman, tim.c.chen, 21cnbao,
	dietmar.eggemann
  Cc: linux-kernel, tglx

Hello Libo and Peter,

tl;dr

- We observed a major regression with tbench when testing the latest tip
  sched/core at:
  commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
  Reason for the regression are the fewer affine wakeups that leaves the
  client farther away from the data it needs to consume next primed in the
  waker's LLC.
  Such regressions can be expected from tasks that use sockets to communicate
  significant amount of data especially on system with multiple LLCs.

- Other benchmarks have a comparable behavior to the tip at previous commit
  commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
  into core tree when update cookie"

I'll leave more details below.

On 7/12/2022 4:17 AM, Libo Chen wrote:
> Barry Song first pointed out that replacing sync wakeup with regular wakeup
> seems to reduce overeager wakeup pulling and shows noticeable performance
> improvement.[1]
> 
> This patch argues that allowing sync for wakeups from interrupt context
> is a bug and fixing it can improve performance even when irq/softirq is
> evenly spread out.
> 
> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
> waker can be any random task that happens to be running on that CPU when the
> interrupt comes in. This is completely different from other wakups where the
> task running on the waking CPU is the actual waker. For example, two tasks
> communicate through a pipe or mutiple tasks access the same critical section,
> etc. This difference is important because with sync we assume the waker will
> get off the runqueue and go to sleep immedately after the wakeup. The
> assumption is built into wake_affine() where it discounts the waker's presence
> from the runqueue when sync is true. The random waker from interrupts bears no
> relation to the wakee and don't usually go to sleep immediately afterwards
> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
> preepmtion of waker for sync wakeup as it used to before
> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
> appropriate too but at least sync wakeup will do what it's supposed to do.
> 
> Add a check to make sure that sync can only be set when in_task() is true. If
> a wakeup is from interrupt context, sync flag will be 0 because in_task()
> returns 0.
> 
> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
> performance changes; 2) interrupt contexts, expected to see better performance
> under low/medium load and no regression under heavy load.
> 
> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
> daemon is active.
> 
> Hackbench (config-scheduler-unbound)
> =========
> process-pipes:
>                 Baseline                Patched
> Amean   1       0.4300  ( 0.00%)        0.4420  ( -2.79%)
> Amean   4       0.8757  ( 0.00%)        0.8774  ( -0.20%)
> Amean   7       1.3712  ( 0.00%)        1.3789  ( -0.56%)
> Amean   12      2.3541  ( 0.00%)        2.3714  ( -0.73%)
> Amean   21      4.2229  ( 0.00%)        4.2439  ( -0.50%)
> Amean   30      5.9113  ( 0.00%)        5.9451  ( -0.57%)
> Amean   48      9.3873  ( 0.00%)        9.4898  ( -1.09%)
> Amean   79      15.9281 ( 0.00%)        16.1385 ( -1.32%)
> Amean   110     22.0961 ( 0.00%)        22.3433 ( -1.12%)
> Amean   141     28.2973 ( 0.00%)        28.6209 ( -1.14%)
> Amean   172     34.4709 ( 0.00%)        34.9347 ( -1.35%)
> Amean   203     40.7621 ( 0.00%)        41.2497 ( -1.20%)
> Amean   234     47.0416 ( 0.00%)        47.6470 ( -1.29%)
> Amean   265     53.3048 ( 0.00%)        54.1625 ( -1.61%)
> Amean   288     58.0595 ( 0.00%)        58.8096 ( -1.29%)
> 
> process-sockets:
>                 Baseline                Patched
> Amean   1       0.6776   ( 0.00%)       0.6611   ( 2.43%)
> Amean   4       2.6183   ( 0.00%)       2.5769   ( 1.58%)
> Amean   7       4.5662   ( 0.00%)       4.4801   ( 1.89%)
> Amean   12      7.7638   ( 0.00%)       7.6201   ( 1.85%)
> Amean   21      13.5335  ( 0.00%)       13.2915  ( 1.79%)
> Amean   30      19.3369  ( 0.00%)       18.9811  ( 1.84%)
> Amean   48      31.0724  ( 0.00%)       30.6015  ( 1.52%)
> Amean   79      51.1881  ( 0.00%)       50.4251  ( 1.49%)
> Amean   110     71.3399  ( 0.00%)       70.4578  ( 1.24%)
> Amean   141     91.4675  ( 0.00%)       90.3769  ( 1.19%)
> Amean   172     111.7463 ( 0.00%)       110.3947 ( 1.21%)
> Amean   203     131.6927 ( 0.00%)       130.3270 ( 1.04%)
> Amean   234     151.7459 ( 0.00%)       150.1320 ( 1.06%)
> Amean   265     171.9101 ( 0.00%)       169.9751 ( 1.13%)
> Amean   288     186.9231 ( 0.00%)       184.7706 ( 1.15%)
> 
> thread-pipes:
>                 Baseline                Patched
> Amean   1       0.4523  ( 0.00%)        0.4535  ( -0.28%)
> Amean   4       0.9041  ( 0.00%)        0.9085  ( -0.48%)
> Amean   7       1.4111  ( 0.00%)        1.4146  ( -0.25%)
> Amean   12      2.3532  ( 0.00%)        2.3688  ( -0.66%)
> Amean   21      4.1550  ( 0.00%)        4.1701  ( -0.36%)
> Amean   30      6.1043  ( 0.00%)        6.2391  ( -2.21%)
> Amean   48      10.2077 ( 0.00%)        10.3511 ( -1.40%)
> Amean   79      16.7922 ( 0.00%)        17.0427 ( -1.49%)
> Amean   110     23.3350 ( 0.00%)        23.6522 ( -1.36%)
> Amean   141     29.6458 ( 0.00%)        30.2617 ( -2.08%)
> Amean   172     35.8649 ( 0.00%)        36.4225 ( -1.55%)
> Amean   203     42.4477 ( 0.00%)        42.8332 ( -0.91%)
> Amean   234     48.7117 ( 0.00%)        49.4042 ( -1.42%)
> Amean   265     54.9171 ( 0.00%)        55.6551 ( -1.34%)
> Amean   288     59.5282 ( 0.00%)        60.2903 ( -1.28%)
> 
> thread-sockets:
>                 Baseline                Patched
> Amean   1       0.6917   ( 0.00%)       0.6892   ( 0.37%)
> Amean   4       2.6651   ( 0.00%)       2.6017   ( 2.38%)
> Amean   7       4.6734   ( 0.00%)       4.5637   ( 2.35%)
> Amean   12      8.0156   ( 0.00%)       7.8079   ( 2.59%)
> Amean   21      14.0451  ( 0.00%)       13.6679  ( 2.69%)
> Amean   30      20.0963  ( 0.00%)       19.5657  ( 2.64%)
> Amean   48      32.4115  ( 0.00%)       31.6001  ( 2.50%)
> Amean   79      53.1104  ( 0.00%)       51.8395  ( 2.39%)
> Amean   110     74.0929  ( 0.00%)       72.4391  ( 2.23%)
> Amean   141     95.1506  ( 0.00%)       93.0992  ( 2.16%)
> Amean   172     116.1969 ( 0.00%)       113.8307 ( 2.04%)
> Amean   203     137.4413 ( 0.00%)       134.5247 ( 2.12%)
> Amean   234     158.5395 ( 0.00%)       155.2793 ( 2.06%)
> Amean   265     179.7729 ( 0.00%)       176.1099 ( 2.04%)
> Amean   288     195.5573 ( 0.00%)       191.3977 ( 2.13%)
> 
> Pgbench (config-db-pgbench-timed-ro-small)
> =======
>                 Baseline            Patched
> Hmean   1       68.54    ( 0.00%)   69.72    ( 1.71%)
> Hmean   6       27725.78 ( 0.00%)   34119.11 ( 23.06%)
> Hmean   12      55724.26 ( 0.00%)   63158.22 ( 13.34%)
> Hmean   22      72806.26 ( 0.00%)   73389.98 ( 0.80%)
> Hmean   30      79000.45 ( 0.00%)   75197.02 ( -4.81%)
> Hmean   48      78180.16 ( 0.00%)   75074.09 ( -3.97%)
> Hmean   80      75001.93 ( 0.00%)   70590.72 ( -5.88%)
> Hmean   110     74812.25 ( 0.00%)   74128.57 ( -0.91%)
> Hmean   142     74261.05 ( 0.00%)   72910.48 ( -1.82%)
> Hmean   144     75375.35 ( 0.00%)   71295.72 ( -5.41%)

The two tests kernels used are:

- tip at commit: 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
- tip at commit: 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue into core tree when update cookie"

Below are the tbench result on a dual socket Zen3 machine
running in NPS1 mode. Following is the NUMA configuration
NPS1 mode:

- NPS1: Each socket is a NUMA node.
  Total 2 NUMA nodes in the dual socket machine.

  Node 0: 0-63,   128-191
  Node 1: 64-127, 192-255

Clients: tip (91caa5ae2424)      tip (14b3f2d9ee8d)
    1    569.24 (0.00 pct)       283.63 (-50.17 pct)    *
    2    1104.76 (0.00 pct)      590.45 (-46.55 pct)    *
    4    2058.78 (0.00 pct)      1080.63 (-47.51 pct)   *
    8    3590.20 (0.00 pct)      2098.05 (-41.56 pct)   *
   16    6119.21 (0.00 pct)      4348.40 (-28.93 pct)   *
   32    11383.91 (0.00 pct)     8417.55 (-26.05 pct)   *
   64    21910.01 (0.00 pct)     19405.11 (-11.43 pct)  *
  128    33105.27 (0.00 pct)     29791.80 (-10.00 pct)  *
  256    45550.99 (0.00 pct)     45847.10 (0.65 pct)
  512    57753.81 (0.00 pct)     49481.17 (-14.32 pct)  *
 1024    55684.33 (0.00 pct)     48748.38 (-12.45 pct)  *

This regression is consistently reproducible.

Below are the statistics gathered from schedstat data:

Kernel                                                     :        tip + remove 14b3f2d9ee8d                    tip
HEAD commit                                                :             91caa5ae2424                       14b3f2d9ee8d                                                                
sched_yield cnt                                            :                   11                                 17
Legacy counter can be ignored                              :                    0                                  0
schedule called                                            :             12621212                           15104022
schedule left the processor idle                           :              6306653 ( 49.96% of times )        7547375       ( 49.96% of times )
try_to_wake_up was called                                  :              6310778                            7552147
try_to_wake_up was called to wake up the local cpu         :                12305 ( 0.19% of times )           12816       ( 0.16% of times ) 
total time by tasks on this processor (in jiffies)         :          78497712520                        72461915902  
total time waiting tasks on this processor (in jiffies)    :             56398803 ( 0.07% of times )        34631178       ( 0.04% of times ) 
total timeslices run on this cpu                           :              6314548                            7556630   

Wakeups on same                                    SMT     :                   39 ( 0.00062 )                    263 ( 0.00348 )
Wakeups on same                                    MC      :              6297015 ( 99.78% of time ) <---       1079 ( 0.01429 )
Wakeups on same                                    DIE     :                  559 ( 0.00886 )                7537909 ( 99.81147 ) <--- With the patch, the task will prefer
Wakeups on same                                    NUMA    :                  860 ( 0.01363 )                     80 ( 0.00106 )       to wake on the same LLC where it previously
Affine wakeups on same                             SMT     :                   25 ( 0.00040 )                    255 ( 0.00338 )       ran as compared to the LLC of waker.
Affine wakeups on same                             MC      :              6282684 ( 99.55% of time ) <---        961 ( 0.01272 )       This results in performance degradation as
Affine wakeups on same                             DIE     :                  523 ( 0.00829 )                7537220 ( 99.80235 ) <--- the task is farther away from data it will
Affine wakeups on same                             NUMA    :                  839 ( 0.01329 )                     46 ( 0.00061 )       consume next.

All the statistics are comparable except for the reduced number of affine
wakeup on the waker's LLC that resulting in task being placed on the previous
LLC farther away from the data that resides in the waker's LLC that the wakee
will consume next.

All wakeups in the tbench, happens in_serving_softirq() making in_taks() false
for all the cases where sync would have been true otherwise.

We wanted to highlight there are workloads which would still benefit from
affine wakeups even when it happens in an interrupt context. It would be
great if we could spot such cases and bias wakeups towards waker's LLC.

Other benchmarks results are comparable to the tip in most cases.
All benchmarks were run on machine configured in NPS1 mode.
Following are the results:

~~~~~~~~~
hackbench
~~~~~~~~~

Test:             tip (91caa5ae2424)      tip (14b3f2d9ee8d)
 1-groups:         4.22 (0.00 pct)         4.48 (-6.16 pct)     *
 1-groups:         4.22 (0.00 pct)         4.30 (-1.89 pct)     [Verification run]
 2-groups:         5.01 (0.00 pct)         4.87 (2.79 pct)
 4-groups:         5.49 (0.00 pct)         5.34 (2.73 pct)
 8-groups:         5.64 (0.00 pct)         5.50 (2.48 pct)
16-groups:         7.54 (0.00 pct)         7.34 (2.65 pct)

~~~~~~~~
schbench
~~~~~~~~

#workers: tip (91caa5ae2424)     tip (14b3f2d9ee8d)
  1:      22.00 (0.00 pct)        22.00 (0.00 pct)
  2:      22.00 (0.00 pct)        27.00 (-22.72 pct)    * Known to have run to run
  4:      33.00 (0.00 pct)        38.00 (-15.15 pct)    * variations.
  8:      48.00 (0.00 pct)        51.00 (-6.25 pct)     *
 16:      70.00 (0.00 pct)        70.00 (0.00 pct)
 32:     118.00 (0.00 pct)       120.00 (-1.69 pct)
 64:     217.00 (0.00 pct)       223.00 (-2.76 pct)
128:     485.00 (0.00 pct)       488.00 (-0.61 pct)
256:     1066.00 (0.00 pct)      1054.00 (1.12 pct)
512:     47296.00 (0.00 pct)     47168.00 (0.27 pct)

Note: schbench results at lower worker count have a large
run-to-run variance and depends on certain characteristics
of new-idle balance.

~~~~~~
stream
~~~~~~

- 10 runs

Test:     tip (91caa5ae2424)      tip (14b3f2d9ee8d)
 Copy:   336140.45 (0.00 pct)    334362.29 (-0.52 pct)
Scale:   214679.13 (0.00 pct)    218016.44 (1.55 pct)
  Add:   251691.67 (0.00 pct)    249734.04 (-0.77 pct)
Triad:   262174.93 (0.00 pct)    260063.57 (-0.80 pct)

- 100 runs

Test:     tip (91caa5ae2424)      tip (14b3f2d9ee8d)
 Copy:   336576.38 (0.00 pct)    334646.27 (-0.57 pct)
Scale:   219124.86 (0.00 pct)    223480.29 (1.98 pct)
  Add:   251796.93 (0.00 pct)    250845.95 (-0.37 pct)
Triad:   262286.47 (0.00 pct)    258020.57 (-1.62 pct)

~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~

tip (91caa5ae2424):   290479.00 (var: 1.53)
tip (14b3f2d9ee8d):   287361.67 (var: 0.80) (-1.07 pct)

>
> [..snip..]
> 

We also ran tbench on a dual socket Icelake Xeon system (2 x 32C/64T)
(Intel Xeon Platinum 8362) and following are the results:

 Clients:       tip (91caa5ae2424)       tip (14b3f2d9ee8d)
    1           131.56 (0.00 pct)        145.05 (10.25 pct)
    2           279.46 (0.00 pct)        245.94 (-11.99 pct)	*
    4           552.19 (0.00 pct)        577.94 (4.66 pct)
    8           1176.12 (0.00 pct)       1309.40 (11.33 pct)
   16           2685.86 (0.00 pct)       2724.96 (1.45 pct)
   32           11582.50 (0.00 pct)      10817.42 (-6.60 pct)	*
   64           18309.47 (0.00 pct)      18287.92 (-0.11 pct)
  128           15689.97 (0.00 pct)      15322.08 (-2.34 pct)
  256           37130.34 (0.00 pct)      37332.91 (0.54 pct)
  512           36448.68 (0.00 pct)      36162.17 (-0.78 pct)

Both kernel versions show lot of run to run variance but the two highlighted
cases are relatively more stable and the case with 32 clients is outside of
the NUMA imbalance threshold.

Below are detailed stats of each individual runs on Icelacke machine:

Kernel:          tip            tip
           (91caa5ae2424)  (14b3f2d9ee8d)

o 1 Clients

Min           : 126.23        128.01
Max           : 137.75        168.99
Median        : 131.20        143.83
AMean         : 131.73        146.94
GMean         : 131.64        145.99
HMean         : 131.56        145.05
AMean Stddev  : 5.78          20.66
AMean CoefVar : 4.39 pct      14.06 pct

o 2 clients *

Min           : 261.71        234.26
Max           : 293.38        253.03
Median        : 285.27        251.43
AMean         : 280.12        246.24
GMean         : 279.79        246.09
HMean         : 279.46        245.94
AMean Stddev  : 16.45         10.40 
AMean CoefVar : 5.87 pct      4.22 pct

o 4 clients

Min           : 515.98        497.81
Max           : 611.56        744.49
Median        : 537.71        543.82
AMean         : 555.08        595.37
GMean         : 553.61        586.31
HMean         : 552.19        577.94
AMean Stddev  : 50.10         131.17
AMean CoefVar : 9.03 pct      22.03 pct

o 8 clients

Min           : 1101.30       1178.59
Max           : 1293.18       1442.30
Median        : 1150.15       1334.56
AMean         : 1181.54       1318.48
GMean         : 1178.80       1313.97
HMean         : 1176.12       1309.40
AMean Stddev  : 99.72         132.59
AMean CoefVar : 8.44 pct      10.06 pct

o 16 clients

Min           : 2478.03       2522.28
Max           : 2829.35       3043.49
Median        : 2777.96       2660.31
AMean         : 2695.11       2742.03
GMean         : 2690.54       2733.37
HMean         : 2685.86       2724.96
AMean Stddev  : 189.75        270.04
AMean CoefVar : 7.04 pct      9.85 pct

o 32 clients *

Min           : 11479.70      10569.00
Max           : 11670.50      11154.90
Median        : 11598.90      10744.90
AMean         : 11583.03      10822.93
GMean         : 11582.77      10820.17
HMean         : 11582.50      10817.42
AMean Stddev  : 96.38         300.64
AMean CoefVar : 0.83 pct      2.78 pct

Please let me know if you would like me to gather any more data
on the test system.
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-29  4:47 ` [PATCH] sched/fair: no " K Prateek Nayak
@ 2022-08-01 13:26   ` Ingo Molnar
  2022-08-01 14:59     ` Libo Chen
  2022-08-01 14:57   ` Libo Chen
  1 sibling, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2022-08-01 13:26 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Libo Chen, peterz, vincent.guittot, mgorman, tim.c.chen, 21cnbao,
	dietmar.eggemann, linux-kernel, tglx


* K Prateek Nayak <kprateek.nayak@amd.com> wrote:

> Hello Libo and Peter,
> 
> tl;dr
> 
> - We observed a major regression with tbench when testing the latest tip
>   sched/core at:
>   commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
>   Reason for the regression are the fewer affine wakeups that leaves the
>   client farther away from the data it needs to consume next primed in the
>   waker's LLC.
>   Such regressions can be expected from tasks that use sockets to communicate
>   significant amount of data especially on system with multiple LLCs.
> 
> - Other benchmarks have a comparable behavior to the tip at previous commit
>   commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
>   into core tree when update cookie"
> 
> I'll leave more details below.

Mel Gorman also warned about this negative side-effect in:

   Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context
   Date: Fri, 15 Jul 2022 11:07:38 +0100
   Message-ID: <20220715100738.GD3493@suse.de>

   https://lore.kernel.org/all/20220715100738.GD3493@suse.de/

I've reverted this commit (14b3f2d9ee8df3b) for the time being from 
tip:sched/core.

Thanks for the heads-up!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-07-29  4:47 ` [PATCH] sched/fair: no " K Prateek Nayak
  2022-08-01 13:26   ` Ingo Molnar
@ 2022-08-01 14:57   ` Libo Chen
  2022-08-02  4:40     ` K Prateek Nayak
  1 sibling, 1 reply; 21+ messages in thread
From: Libo Chen @ 2022-08-01 14:57 UTC (permalink / raw)
  To: K Prateek Nayak, peterz, vincent.guittot, mgorman, tim.c.chen,
	21cnbao, dietmar.eggemann
  Cc: linux-kernel, tglx



On 7/28/22 21:47, K Prateek Nayak wrote:
> Hello Libo and Peter,
>
> tl;dr
>
> - We observed a major regression with tbench when testing the latest tip
>    sched/core at:
>    commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
>    Reason for the regression are the fewer affine wakeups that leaves the
>    client farther away from the data it needs to consume next primed in the
>    waker's LLC.
>    Such regressions can be expected from tasks that use sockets to communicate
>    significant amount of data especially on system with multiple LLCs.
>
> - Other benchmarks have a comparable behavior to the tip at previous commit
>    commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
>    into core tree when update cookie"
>
> I'll leave more details below.
>
> On 7/12/2022 4:17 AM, Libo Chen wrote:
>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>> seems to reduce overeager wakeup pulling and shows noticeable performance
>> improvement.[1]
>>
>> This patch argues that allowing sync for wakeups from interrupt context
>> is a bug and fixing it can improve performance even when irq/softirq is
>> evenly spread out.
>>
>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>> waker can be any random task that happens to be running on that CPU when the
>> interrupt comes in. This is completely different from other wakups where the
>> task running on the waking CPU is the actual waker. For example, two tasks
>> communicate through a pipe or mutiple tasks access the same critical section,
>> etc. This difference is important because with sync we assume the waker will
>> get off the runqueue and go to sleep immedately after the wakeup. The
>> assumption is built into wake_affine() where it discounts the waker's presence
>> from the runqueue when sync is true. The random waker from interrupts bears no
>> relation to the wakee and don't usually go to sleep immediately afterwards
>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>> preepmtion of waker for sync wakeup as it used to before
>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>> appropriate too but at least sync wakeup will do what it's supposed to do.
>>
>> Add a check to make sure that sync can only be set when in_task() is true. If
>> a wakeup is from interrupt context, sync flag will be 0 because in_task()
>> returns 0.
>>
>> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
>> performance changes; 2) interrupt contexts, expected to see better performance
>> under low/medium load and no regression under heavy load.
>>
>> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
>> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
>> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
>> daemon is active.
>>
>> Hackbench (config-scheduler-unbound)
>> =========
>> process-pipes:
>>                  Baseline                Patched
>> Amean   1       0.4300  ( 0.00%)        0.4420  ( -2.79%)
>> Amean   4       0.8757  ( 0.00%)        0.8774  ( -0.20%)
>> Amean   7       1.3712  ( 0.00%)        1.3789  ( -0.56%)
>> Amean   12      2.3541  ( 0.00%)        2.3714  ( -0.73%)
>> Amean   21      4.2229  ( 0.00%)        4.2439  ( -0.50%)
>> Amean   30      5.9113  ( 0.00%)        5.9451  ( -0.57%)
>> Amean   48      9.3873  ( 0.00%)        9.4898  ( -1.09%)
>> Amean   79      15.9281 ( 0.00%)        16.1385 ( -1.32%)
>> Amean   110     22.0961 ( 0.00%)        22.3433 ( -1.12%)
>> Amean   141     28.2973 ( 0.00%)        28.6209 ( -1.14%)
>> Amean   172     34.4709 ( 0.00%)        34.9347 ( -1.35%)
>> Amean   203     40.7621 ( 0.00%)        41.2497 ( -1.20%)
>> Amean   234     47.0416 ( 0.00%)        47.6470 ( -1.29%)
>> Amean   265     53.3048 ( 0.00%)        54.1625 ( -1.61%)
>> Amean   288     58.0595 ( 0.00%)        58.8096 ( -1.29%)
>>
>> process-sockets:
>>                  Baseline                Patched
>> Amean   1       0.6776   ( 0.00%)       0.6611   ( 2.43%)
>> Amean   4       2.6183   ( 0.00%)       2.5769   ( 1.58%)
>> Amean   7       4.5662   ( 0.00%)       4.4801   ( 1.89%)
>> Amean   12      7.7638   ( 0.00%)       7.6201   ( 1.85%)
>> Amean   21      13.5335  ( 0.00%)       13.2915  ( 1.79%)
>> Amean   30      19.3369  ( 0.00%)       18.9811  ( 1.84%)
>> Amean   48      31.0724  ( 0.00%)       30.6015  ( 1.52%)
>> Amean   79      51.1881  ( 0.00%)       50.4251  ( 1.49%)
>> Amean   110     71.3399  ( 0.00%)       70.4578  ( 1.24%)
>> Amean   141     91.4675  ( 0.00%)       90.3769  ( 1.19%)
>> Amean   172     111.7463 ( 0.00%)       110.3947 ( 1.21%)
>> Amean   203     131.6927 ( 0.00%)       130.3270 ( 1.04%)
>> Amean   234     151.7459 ( 0.00%)       150.1320 ( 1.06%)
>> Amean   265     171.9101 ( 0.00%)       169.9751 ( 1.13%)
>> Amean   288     186.9231 ( 0.00%)       184.7706 ( 1.15%)
>>
>> thread-pipes:
>>                  Baseline                Patched
>> Amean   1       0.4523  ( 0.00%)        0.4535  ( -0.28%)
>> Amean   4       0.9041  ( 0.00%)        0.9085  ( -0.48%)
>> Amean   7       1.4111  ( 0.00%)        1.4146  ( -0.25%)
>> Amean   12      2.3532  ( 0.00%)        2.3688  ( -0.66%)
>> Amean   21      4.1550  ( 0.00%)        4.1701  ( -0.36%)
>> Amean   30      6.1043  ( 0.00%)        6.2391  ( -2.21%)
>> Amean   48      10.2077 ( 0.00%)        10.3511 ( -1.40%)
>> Amean   79      16.7922 ( 0.00%)        17.0427 ( -1.49%)
>> Amean   110     23.3350 ( 0.00%)        23.6522 ( -1.36%)
>> Amean   141     29.6458 ( 0.00%)        30.2617 ( -2.08%)
>> Amean   172     35.8649 ( 0.00%)        36.4225 ( -1.55%)
>> Amean   203     42.4477 ( 0.00%)        42.8332 ( -0.91%)
>> Amean   234     48.7117 ( 0.00%)        49.4042 ( -1.42%)
>> Amean   265     54.9171 ( 0.00%)        55.6551 ( -1.34%)
>> Amean   288     59.5282 ( 0.00%)        60.2903 ( -1.28%)
>>
>> thread-sockets:
>>                  Baseline                Patched
>> Amean   1       0.6917   ( 0.00%)       0.6892   ( 0.37%)
>> Amean   4       2.6651   ( 0.00%)       2.6017   ( 2.38%)
>> Amean   7       4.6734   ( 0.00%)       4.5637   ( 2.35%)
>> Amean   12      8.0156   ( 0.00%)       7.8079   ( 2.59%)
>> Amean   21      14.0451  ( 0.00%)       13.6679  ( 2.69%)
>> Amean   30      20.0963  ( 0.00%)       19.5657  ( 2.64%)
>> Amean   48      32.4115  ( 0.00%)       31.6001  ( 2.50%)
>> Amean   79      53.1104  ( 0.00%)       51.8395  ( 2.39%)
>> Amean   110     74.0929  ( 0.00%)       72.4391  ( 2.23%)
>> Amean   141     95.1506  ( 0.00%)       93.0992  ( 2.16%)
>> Amean   172     116.1969 ( 0.00%)       113.8307 ( 2.04%)
>> Amean   203     137.4413 ( 0.00%)       134.5247 ( 2.12%)
>> Amean   234     158.5395 ( 0.00%)       155.2793 ( 2.06%)
>> Amean   265     179.7729 ( 0.00%)       176.1099 ( 2.04%)
>> Amean   288     195.5573 ( 0.00%)       191.3977 ( 2.13%)
>>
>> Pgbench (config-db-pgbench-timed-ro-small)
>> =======
>>                  Baseline            Patched
>> Hmean   1       68.54    ( 0.00%)   69.72    ( 1.71%)
>> Hmean   6       27725.78 ( 0.00%)   34119.11 ( 23.06%)
>> Hmean   12      55724.26 ( 0.00%)   63158.22 ( 13.34%)
>> Hmean   22      72806.26 ( 0.00%)   73389.98 ( 0.80%)
>> Hmean   30      79000.45 ( 0.00%)   75197.02 ( -4.81%)
>> Hmean   48      78180.16 ( 0.00%)   75074.09 ( -3.97%)
>> Hmean   80      75001.93 ( 0.00%)   70590.72 ( -5.88%)
>> Hmean   110     74812.25 ( 0.00%)   74128.57 ( -0.91%)
>> Hmean   142     74261.05 ( 0.00%)   72910.48 ( -1.82%)
>> Hmean   144     75375.35 ( 0.00%)   71295.72 ( -5.41%)
> The two tests kernels used are:
>
> - tip at commit: 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
> - tip at commit: 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue into core tree when update cookie"
>
> Below are the tbench result on a dual socket Zen3 machine
> running in NPS1 mode. Following is the NUMA configuration
> NPS1 mode:
>
> - NPS1: Each socket is a NUMA node.
>    Total 2 NUMA nodes in the dual socket machine.
>
>    Node 0: 0-63,   128-191
>    Node 1: 64-127, 192-255
>
> Clients: tip (91caa5ae2424)      tip (14b3f2d9ee8d)
>      1    569.24 (0.00 pct)       283.63 (-50.17 pct)    *
>      2    1104.76 (0.00 pct)      590.45 (-46.55 pct)    *
>      4    2058.78 (0.00 pct)      1080.63 (-47.51 pct)   *
>      8    3590.20 (0.00 pct)      2098.05 (-41.56 pct)   *
>     16    6119.21 (0.00 pct)      4348.40 (-28.93 pct)   *
>     32    11383.91 (0.00 pct)     8417.55 (-26.05 pct)   *
>     64    21910.01 (0.00 pct)     19405.11 (-11.43 pct)  *
>    128    33105.27 (0.00 pct)     29791.80 (-10.00 pct)  *
>    256    45550.99 (0.00 pct)     45847.10 (0.65 pct)
>    512    57753.81 (0.00 pct)     49481.17 (-14.32 pct)  *
>   1024    55684.33 (0.00 pct)     48748.38 (-12.45 pct)  *
>
> This regression is consistently reproducible.
I ran tbench with 1 client on my 8 nodes zen2 machine because 1~4 
clients count generally shouldn't be affected by this patch. I do see 
throughput regresses with the patch but
the latency improves pretty much equally. Furthermore, I also don't see 
tbench tasks being separated in different LLC domains from my ftrace, 
they are almost always in the same CCXes.
What I do see is there are a lot less interrupts and context switches, 
and average CPU frequency is slower too with the patch. This is bizarre 
that Intel doesn't seem to be impacted.
Trying to understand why right now.
> Below are the statistics gathered from schedstat data:
>
> Kernel                                                     :        tip + remove 14b3f2d9ee8d                    tip
> HEAD commit                                                :             91caa5ae2424                       14b3f2d9ee8d
> sched_yield cnt                                            :                   11                                 17
> Legacy counter can be ignored                              :                    0                                  0
> schedule called                                            :             12621212                           15104022
> schedule left the processor idle                           :              6306653 ( 49.96% of times )        7547375       ( 49.96% of times )
> try_to_wake_up was called                                  :              6310778                            7552147
> try_to_wake_up was called to wake up the local cpu         :                12305 ( 0.19% of times )           12816       ( 0.16% of times )
> total time by tasks on this processor (in jiffies)         :          78497712520                        72461915902
> total time waiting tasks on this processor (in jiffies)    :             56398803 ( 0.07% of times )        34631178       ( 0.04% of times )
> total timeslices run on this cpu                           :              6314548                            7556630
>
> Wakeups on same                                    SMT     :                   39 ( 0.00062 )                    263 ( 0.00348 )
> Wakeups on same                                    MC      :              6297015 ( 99.78% of time ) <---       1079 ( 0.01429 )
> Wakeups on same                                    DIE     :                  559 ( 0.00886 )                7537909 ( 99.81147 ) <--- With the patch, the task will prefer
I don't have a zen3 right now. What is the span of your MC domain as 
well as DIE?

Thanks for the testing.

Libo
> Wakeups on same                                    NUMA    :                  860 ( 0.01363 )                     80 ( 0.00106 )       to wake on the same LLC where it previously
> Affine wakeups on same                             SMT     :                   25 ( 0.00040 )                    255 ( 0.00338 )       ran as compared to the LLC of waker.
> Affine wakeups on same                             MC      :              6282684 ( 99.55% of time ) <---        961 ( 0.01272 )       This results in performance degradation as
> Affine wakeups on same                             DIE     :                  523 ( 0.00829 )                7537220 ( 99.80235 ) <--- the task is farther away from data it will
> Affine wakeups on same                             NUMA    :                  839 ( 0.01329 )                     46 ( 0.00061 )       consume next.
>
> All the statistics are comparable except for the reduced number of affine
> wakeup on the waker's LLC that resulting in task being placed on the previous
> LLC farther away from the data that resides in the waker's LLC that the wakee
> will consume next.
>
> All wakeups in the tbench, happens in_serving_softirq() making in_taks() false
> for all the cases where sync would have been true otherwise.
>
> We wanted to highlight there are workloads which would still benefit from
> affine wakeups even when it happens in an interrupt context. It would be
> great if we could spot such cases and bias wakeups towards waker's LLC.
>
> Other benchmarks results are comparable to the tip in most cases.
> All benchmarks were run on machine configured in NPS1 mode.
> Following are the results:
>
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
>
> Test:             tip (91caa5ae2424)      tip (14b3f2d9ee8d)
>   1-groups:         4.22 (0.00 pct)         4.48 (-6.16 pct)     *
>   1-groups:         4.22 (0.00 pct)         4.30 (-1.89 pct)     [Verification run]
>   2-groups:         5.01 (0.00 pct)         4.87 (2.79 pct)
>   4-groups:         5.49 (0.00 pct)         5.34 (2.73 pct)
>   8-groups:         5.64 (0.00 pct)         5.50 (2.48 pct)
> 16-groups:         7.54 (0.00 pct)         7.34 (2.65 pct)
>
> ~~~~~~~~
> schbench
> ~~~~~~~~
>
> #workers: tip (91caa5ae2424)     tip (14b3f2d9ee8d)
>    1:      22.00 (0.00 pct)        22.00 (0.00 pct)
>    2:      22.00 (0.00 pct)        27.00 (-22.72 pct)    * Known to have run to run
>    4:      33.00 (0.00 pct)        38.00 (-15.15 pct)    * variations.
>    8:      48.00 (0.00 pct)        51.00 (-6.25 pct)     *
>   16:      70.00 (0.00 pct)        70.00 (0.00 pct)
>   32:     118.00 (0.00 pct)       120.00 (-1.69 pct)
>   64:     217.00 (0.00 pct)       223.00 (-2.76 pct)
> 128:     485.00 (0.00 pct)       488.00 (-0.61 pct)
> 256:     1066.00 (0.00 pct)      1054.00 (1.12 pct)
> 512:     47296.00 (0.00 pct)     47168.00 (0.27 pct)
>
> Note: schbench results at lower worker count have a large
> run-to-run variance and depends on certain characteristics
> of new-idle balance.
>
> ~~~~~~
> stream
> ~~~~~~
>
> - 10 runs
>
> Test:     tip (91caa5ae2424)      tip (14b3f2d9ee8d)
>   Copy:   336140.45 (0.00 pct)    334362.29 (-0.52 pct)
> Scale:   214679.13 (0.00 pct)    218016.44 (1.55 pct)
>    Add:   251691.67 (0.00 pct)    249734.04 (-0.77 pct)
> Triad:   262174.93 (0.00 pct)    260063.57 (-0.80 pct)
>
> - 100 runs
>
> Test:     tip (91caa5ae2424)      tip (14b3f2d9ee8d)
>   Copy:   336576.38 (0.00 pct)    334646.27 (-0.57 pct)
> Scale:   219124.86 (0.00 pct)    223480.29 (1.98 pct)
>    Add:   251796.93 (0.00 pct)    250845.95 (-0.37 pct)
> Triad:   262286.47 (0.00 pct)    258020.57 (-1.62 pct)
>
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
>
> tip (91caa5ae2424):   290479.00 (var: 1.53)
> tip (14b3f2d9ee8d):   287361.67 (var: 0.80) (-1.07 pct)
>
>> [..snip..]
>>
> We also ran tbench on a dual socket Icelake Xeon system (2 x 32C/64T)
> (Intel Xeon Platinum 8362) and following are the results:
>
>   Clients:       tip (91caa5ae2424)       tip (14b3f2d9ee8d)
>      1           131.56 (0.00 pct)        145.05 (10.25 pct)
>      2           279.46 (0.00 pct)        245.94 (-11.99 pct)	*
>      4           552.19 (0.00 pct)        577.94 (4.66 pct)
>      8           1176.12 (0.00 pct)       1309.40 (11.33 pct)
>     16           2685.86 (0.00 pct)       2724.96 (1.45 pct)
>     32           11582.50 (0.00 pct)      10817.42 (-6.60 pct)	*
>     64           18309.47 (0.00 pct)      18287.92 (-0.11 pct)
>    128           15689.97 (0.00 pct)      15322.08 (-2.34 pct)
>    256           37130.34 (0.00 pct)      37332.91 (0.54 pct)
>    512           36448.68 (0.00 pct)      36162.17 (-0.78 pct)
>
> Both kernel versions show lot of run to run variance but the two highlighted
> cases are relatively more stable and the case with 32 clients is outside of
> the NUMA imbalance threshold.
>
> Below are detailed stats of each individual runs on Icelacke machine:
>
> Kernel:          tip            tip
>             (91caa5ae2424)  (14b3f2d9ee8d)
>
> o 1 Clients
>
> Min           : 126.23        128.01
> Max           : 137.75        168.99
> Median        : 131.20        143.83
> AMean         : 131.73        146.94
> GMean         : 131.64        145.99
> HMean         : 131.56        145.05
> AMean Stddev  : 5.78          20.66
> AMean CoefVar : 4.39 pct      14.06 pct
>
> o 2 clients *
>
> Min           : 261.71        234.26
> Max           : 293.38        253.03
> Median        : 285.27        251.43
> AMean         : 280.12        246.24
> GMean         : 279.79        246.09
> HMean         : 279.46        245.94
> AMean Stddev  : 16.45         10.40
> AMean CoefVar : 5.87 pct      4.22 pct
>
> o 4 clients
>
> Min           : 515.98        497.81
> Max           : 611.56        744.49
> Median        : 537.71        543.82
> AMean         : 555.08        595.37
> GMean         : 553.61        586.31
> HMean         : 552.19        577.94
> AMean Stddev  : 50.10         131.17
> AMean CoefVar : 9.03 pct      22.03 pct
>
> o 8 clients
>
> Min           : 1101.30       1178.59
> Max           : 1293.18       1442.30
> Median        : 1150.15       1334.56
> AMean         : 1181.54       1318.48
> GMean         : 1178.80       1313.97
> HMean         : 1176.12       1309.40
> AMean Stddev  : 99.72         132.59
> AMean CoefVar : 8.44 pct      10.06 pct
>
> o 16 clients
>
> Min           : 2478.03       2522.28
> Max           : 2829.35       3043.49
> Median        : 2777.96       2660.31
> AMean         : 2695.11       2742.03
> GMean         : 2690.54       2733.37
> HMean         : 2685.86       2724.96
> AMean Stddev  : 189.75        270.04
> AMean CoefVar : 7.04 pct      9.85 pct
>
> o 32 clients *
>
> Min           : 11479.70      10569.00
> Max           : 11670.50      11154.90
> Median        : 11598.90      10744.90
> AMean         : 11583.03      10822.93
> GMean         : 11582.77      10820.17
> HMean         : 11582.50      10817.42
> AMean Stddev  : 96.38         300.64
> AMean CoefVar : 0.83 pct      2.78 pct
>
> Please let me know if you would like me to gather any more data
> on the test system.
> --
> Thanks and Regards,
> Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-08-01 13:26   ` Ingo Molnar
@ 2022-08-01 14:59     ` Libo Chen
  2022-08-03  9:18       ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Libo Chen @ 2022-08-01 14:59 UTC (permalink / raw)
  To: Ingo Molnar, K Prateek Nayak
  Cc: peterz, vincent.guittot, mgorman, tim.c.chen, 21cnbao,
	dietmar.eggemann, linux-kernel, tglx



On 8/1/22 06:26, Ingo Molnar wrote:
> * K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
>> Hello Libo and Peter,
>>
>> tl;dr
>>
>> - We observed a major regression with tbench when testing the latest tip
>>    sched/core at:
>>    commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
>>    Reason for the regression are the fewer affine wakeups that leaves the
>>    client farther away from the data it needs to consume next primed in the
>>    waker's LLC.
>>    Such regressions can be expected from tasks that use sockets to communicate
>>    significant amount of data especially on system with multiple LLCs.
>>
>> - Other benchmarks have a comparable behavior to the tip at previous commit
>>    commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
>>    into core tree when update cookie"
>>
>> I'll leave more details below.
> Mel Gorman also warned about this negative side-effect in:
>
>     Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context
>     Date: Fri, 15 Jul 2022 11:07:38 +0100
>     Message-ID: <20220715100738.GD3493@suse.de>
>
>     https://urldefense.com/v3/__https://lore.kernel.org/all/20220715100738.GD3493@suse.de/__;!!ACWV5N9M2RV99hQ!PQsIeuK0UwII-A0xS-B3plepNniNeyw14OJowT1cYL-tnuN99MkWfg9C8P60tVFFrnxj0NEanUmEkA$
?? Mel was talking about a completely different thing, I brought up a 
different patch that I wanted to revert and Mel thought it would hurt 
other workloads which don't benefit from pulling but
as you can see, tbench somehow benefits from it, at least according to 
one metric from one workload.

Libo
> I've reverted this commit (14b3f2d9ee8df3b) for the time being from
> tip:sched/core.
>
> Thanks for the heads-up!
>
> Thanks,
>
> 	Ingo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-08-01 14:57   ` Libo Chen
@ 2022-08-02  4:40     ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2022-08-02  4:40 UTC (permalink / raw)
  To: Libo Chen, peterz, vincent.guittot, mgorman, tim.c.chen, 21cnbao,
	dietmar.eggemann
  Cc: linux-kernel, tglx

Hello Libo,

Thank you for looking into this.

On 8/1/2022 8:27 PM, Libo Chen wrote:
> 
> 
> On 7/28/22 21:47, K Prateek Nayak wrote:
>> Hello Libo and Peter,
>>
>> tl;dr
>>
>> - We observed a major regression with tbench when testing the latest tip
>>    sched/core at:
>>    commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
>>    Reason for the regression are the fewer affine wakeups that leaves the
>>    client farther away from the data it needs to consume next primed in the
>>    waker's LLC.
>>    Such regressions can be expected from tasks that use sockets to communicate
>>    significant amount of data especially on system with multiple LLCs.
>>
>> - Other benchmarks have a comparable behavior to the tip at previous commit
>>    commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
>>    into core tree when update cookie"
>>
>> I'll leave more details below.
>>
>> On 7/12/2022 4:17 AM, Libo Chen wrote:
>>> [..snip..]
>>
>> The two tests kernels used are:
>>
>> - tip at commit: 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
>> - tip at commit: 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue into core tree when update cookie"
>>
>> Below are the tbench result on a dual socket Zen3 machine
>> running in NPS1 mode. Following is the NUMA configuration
>> NPS1 mode:
>>
>> - NPS1: Each socket is a NUMA node.
>>    Total 2 NUMA nodes in the dual socket machine.
>>
>>    Node 0: 0-63,   128-191
>>    Node 1: 64-127, 192-255
>>
>> Clients: tip (91caa5ae2424)      tip (14b3f2d9ee8d)
>>      1    569.24 (0.00 pct)       283.63 (-50.17 pct)    *
>>      2    1104.76 (0.00 pct)      590.45 (-46.55 pct)    *
>>      4    2058.78 (0.00 pct)      1080.63 (-47.51 pct)   *
>>      8    3590.20 (0.00 pct)      2098.05 (-41.56 pct)   *
>>     16    6119.21 (0.00 pct)      4348.40 (-28.93 pct)   *
>>     32    11383.91 (0.00 pct)     8417.55 (-26.05 pct)   *
>>     64    21910.01 (0.00 pct)     19405.11 (-11.43 pct)  *
>>    128    33105.27 (0.00 pct)     29791.80 (-10.00 pct)  *
>>    256    45550.99 (0.00 pct)     45847.10 (0.65 pct)
>>    512    57753.81 (0.00 pct)     49481.17 (-14.32 pct)  *
>>   1024    55684.33 (0.00 pct)     48748.38 (-12.45 pct)  *
>>
>> This regression is consistently reproducible.
> I ran tbench with 1 client on my 8 nodes zen2 machine because 1~4 clients count generally shouldn't be affected by this patch. I do see throughput regresses with the patch but
> the latency improves pretty much equally. Furthermore, I also don't see tbench tasks being separated in different LLC domains from my ftrace, they are almost always in the same CCXes.
> What I do see is there are a lot less interrupts and context switches, and average CPU frequency is slower too with the patch. This is bizarre that Intel doesn't seem to be impacted.
> Trying to understand why right now.

Thank you for analyzing this. I see a drop in max latency with the patch but the
average latency have increased on the patched kernel. Following are the logs from
one of the runs for 1 client case:

- tip (91caa5ae2424)

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Deltree                     28     0.000     0.001
 Flush                    76361     0.008     0.018
 Close                   800338     0.008     0.080
 LockX                     3546     0.008     0.015
 Mkdir                       14     0.008     0.009
 Rename                   46131     0.008     0.050
 ReadX                  1707761     0.009     0.139
 WriteX                  543274     0.012     0.092
 Unlink                  220019     0.008     0.083
 UnlockX                   3546     0.008     0.016
 FIND_FIRST              381795     0.008     0.079
 SET_FILE_INFORMATION     88740     0.008     0.080
 QUERY_FILE_INFORMATION  173062     0.008     0.061
 QUERY_PATH_INFORMATION  987524     0.008     0.070
 QUERY_FS_INFORMATION    181068     0.008     0.049
 NTCreateX              1089543     0.008     0.083

Throughput 570.36 MB/sec  1 clients  1 procs  max_latency=0.140 ms

- tip (14b3f2d9ee8d)

Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Deltree                     14     0.000     0.001
 Flush                    38993     0.017     0.059
 Close                   408696     0.017     0.085
 LockX                     1810     0.017     0.023
 Mkdir                        7     0.016     0.017
 Rename                   23555     0.017     0.052
 ReadX                   871996     0.018     0.097
 WriteX                  277343     0.025     0.105
 Unlink                  112357     0.017     0.055
 UnlockX                   1810     0.017     0.023
 FIND_FIRST              194961     0.017     0.089
 SET_FILE_INFORMATION     45312     0.017     0.032
 QUERY_FILE_INFORMATION   88356     0.017     0.078
 QUERY_PATH_INFORMATION  504289     0.017     0.119
 QUERY_FS_INFORMATION     92460     0.017     0.085
 NTCreateX               556374     0.017     0.097

Throughput 291.163 MB/sec  1 clients  1 procs  max_latency=0.119 ms

I had only analyzed the schedstat data which showed a clear shift
in the number of affine wakeups. I haven't measured the average CPU
frequency during the runs. The numbers reported are with the performance
governor. I'll try to get more data on the CPU frequency front.

>> Below are the statistics gathered from schedstat data:
>>
>> Kernel                                                     :        tip + remove 14b3f2d9ee8d                    tip
>> HEAD commit                                                :             91caa5ae2424                       14b3f2d9ee8d
>> sched_yield cnt                                            :                   11                                 17
>> Legacy counter can be ignored                              :                    0                                  0
>> schedule called                                            :             12621212                           15104022
>> schedule left the processor idle                           :              6306653 ( 49.96% of times )        7547375       ( 49.96% of times )
>> try_to_wake_up was called                                  :              6310778                            7552147
>> try_to_wake_up was called to wake up the local cpu         :                12305 ( 0.19% of times )           12816       ( 0.16% of times )
>> total time by tasks on this processor (in jiffies)         :          78497712520                        72461915902
>> total time waiting tasks on this processor (in jiffies)    :             56398803 ( 0.07% of times )        34631178       ( 0.04% of times )
>> total timeslices run on this cpu                           :              6314548                            7556630
>>
>> Wakeups on same                                    SMT     :                   39 ( 0.00062 )                    263 ( 0.00348 )
>> Wakeups on same                                    MC      :              6297015 ( 99.78% of time ) <---       1079 ( 0.01429 )
>> Wakeups on same                                    DIE     :                  559 ( 0.00886 )                7537909 ( 99.81147 ) <--- With the patch, the task will prefer
> I don't have a zen3 right now. What is the span of your MC domain as well as DIE?

on Zen3, a group in MC domain consists of the 16 CPUs on the same CCD.
On a dual socket Zen3 system (2 x 64C/128T) running in NPS 1 mode,
the DIE domain will consists of all the CPUs on the same socket. There are two
DIE groups in the dual socket test system. Following are the span of each:

- DIE0: 0-63,128-191

    DIE 0 MC 0: 0-7,128-135
    DIE 0 MC 1: 8-15,136-143
    DIE 0 MC 2: 16-23,144-151
    DIE 0 MC 3: 24-31,152-159
    DIE 0 MC 4: 32-39,160-167
    DIE 0 MC 5: 40-47,168-175
    DIE 0 MC 6: 48-55,176-183
    DIE 0 MC 7: 56-63,184-191

- DIE1: 64-127,192-255

    DIE 1 MC 0: 64-71,192-199
    DIE 1 MC 1: 72-79,200-207
    DIE 1 MC 2: 80-87,208-215
    DIE 1 MC 3: 88-95,216-223
    DIE 1 MC 4: 96-103,224-231
    DIE 1 MC 5: 104-111,232-239
    DIE 1 MC 6: 112-119,240-247
    DIE 1 MC 7: 120-127,248-255
> 
> Thanks for the testing.
> 
> Libo
>> Wakeups on same                                    NUMA    :                  860 ( 0.01363 )                     80 ( 0.00106 )       to wake on the same LLC where it previously
>> Affine wakeups on same                             SMT     :                   25 ( 0.00040 )                    255 ( 0.00338 )       ran as compared to the LLC of waker.
>> Affine wakeups on same                             MC      :              6282684 ( 99.55% of time ) <---        961 ( 0.01272 )       This results in performance degradation as
>> Affine wakeups on same                             DIE     :                  523 ( 0.00829 )                7537220 ( 99.80235 ) <--- the task is farther away from data it will
>> Affine wakeups on same                             NUMA    :                  839 ( 0.01329 )                     46 ( 0.00061 )       consume next.
>>
>> All the statistics are comparable except for the reduced number of affine
>> wakeup on the waker's LLC that resulting in task being placed on the previous
>> LLC farther away from the data that resides in the waker's LLC that the wakee
>> will consume next.
>>
>> All wakeups in the tbench, happens in_serving_softirq() making in_taks() false
>> for all the cases where sync would have been true otherwise.
>>
>> We wanted to highlight there are workloads which would still benefit from
>> affine wakeups even when it happens in an interrupt context. It would be
>> great if we could spot such cases and bias wakeups towards waker's LLC.
>>
>> Other benchmarks results are comparable to the tip in most cases.
>> All benchmarks were run on machine configured in NPS1 mode.
>> Following are the results:
>>
>> ~~~~~~~~~
>> hackbench
>> ~~~~~~~~~
>>
>> Test:             tip (91caa5ae2424)      tip (14b3f2d9ee8d)
>>   1-groups:         4.22 (0.00 pct)         4.48 (-6.16 pct)     *
>>   1-groups:         4.22 (0.00 pct)         4.30 (-1.89 pct)     [Verification run]
>>   2-groups:         5.01 (0.00 pct)         4.87 (2.79 pct)
>>   4-groups:         5.49 (0.00 pct)         5.34 (2.73 pct)
>>   8-groups:         5.64 (0.00 pct)         5.50 (2.48 pct)
>> 16-groups:         7.54 (0.00 pct)         7.34 (2.65 pct)
>>
>> ~~~~~~~~
>> schbench
>> ~~~~~~~~
>>
>> #workers: tip (91caa5ae2424)     tip (14b3f2d9ee8d)
>>    1:      22.00 (0.00 pct)        22.00 (0.00 pct)
>>    2:      22.00 (0.00 pct)        27.00 (-22.72 pct)    * Known to have run to run
>>    4:      33.00 (0.00 pct)        38.00 (-15.15 pct)    * variations.
>>    8:      48.00 (0.00 pct)        51.00 (-6.25 pct)     *
>>   16:      70.00 (0.00 pct)        70.00 (0.00 pct)
>>   32:     118.00 (0.00 pct)       120.00 (-1.69 pct)
>>   64:     217.00 (0.00 pct)       223.00 (-2.76 pct)
>> 128:     485.00 (0.00 pct)       488.00 (-0.61 pct)
>> 256:     1066.00 (0.00 pct)      1054.00 (1.12 pct)
>> 512:     47296.00 (0.00 pct)     47168.00 (0.27 pct)
>>
>> Note: schbench results at lower worker count have a large
>> run-to-run variance and depends on certain characteristics
>> of new-idle balance.
>>
>> ~~~~~~
>> stream
>> ~~~~~~
>>
>> - 10 runs
>>
>> Test:     tip (91caa5ae2424)      tip (14b3f2d9ee8d)
>>   Copy:   336140.45 (0.00 pct)    334362.29 (-0.52 pct)
>> Scale:   214679.13 (0.00 pct)    218016.44 (1.55 pct)
>>    Add:   251691.67 (0.00 pct)    249734.04 (-0.77 pct)
>> Triad:   262174.93 (0.00 pct)    260063.57 (-0.80 pct)
>>
>> - 100 runs
>>
>> Test:     tip (91caa5ae2424)      tip (14b3f2d9ee8d)
>>   Copy:   336576.38 (0.00 pct)    334646.27 (-0.57 pct)
>> Scale:   219124.86 (0.00 pct)    223480.29 (1.98 pct)
>>    Add:   251796.93 (0.00 pct)    250845.95 (-0.37 pct)
>> Triad:   262286.47 (0.00 pct)    258020.57 (-1.62 pct)
>>
>> ~~~~~~~~~~~~
>> ycsb-mongodb
>> ~~~~~~~~~~~~
>>
>> tip (91caa5ae2424):   290479.00 (var: 1.53)
>> tip (14b3f2d9ee8d):   287361.67 (var: 0.80) (-1.07 pct)
>>
>>> [..snip..]
> 

Thank you again for looking into this issue and for sharing the
observations on the Zen2 system.
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-08-01 14:59     ` Libo Chen
@ 2022-08-03  9:18       ` Ingo Molnar
  2022-08-03 19:37         ` Libo Chen
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2022-08-03  9:18 UTC (permalink / raw)
  To: Libo Chen
  Cc: K Prateek Nayak, peterz, vincent.guittot, mgorman, tim.c.chen,
	21cnbao, dietmar.eggemann, linux-kernel, tglx


* Libo Chen <libo.chen@oracle.com> wrote:

> 
> 
> On 8/1/22 06:26, Ingo Molnar wrote:
> > * K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > 
> > > Hello Libo and Peter,
> > > 
> > > tl;dr
> > > 
> > > - We observed a major regression with tbench when testing the latest tip
> > >    sched/core at:
> > >    commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
> > >    Reason for the regression are the fewer affine wakeups that leaves the
> > >    client farther away from the data it needs to consume next primed in the
> > >    waker's LLC.
> > >    Such regressions can be expected from tasks that use sockets to communicate
> > >    significant amount of data especially on system with multiple LLCs.
> > > 
> > > - Other benchmarks have a comparable behavior to the tip at previous commit
> > >    commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
> > >    into core tree when update cookie"
> > > 
> > > I'll leave more details below.
> > Mel Gorman also warned about this negative side-effect in:
> > 
> >     Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context
> >     Date: Fri, 15 Jul 2022 11:07:38 +0100
> >     Message-ID: <20220715100738.GD3493@suse.de>
> > 
> >     https://urldefense.com/v3/__https://lore.kernel.org/all/20220715100738.GD3493@suse.de/__;!!ACWV5N9M2RV99hQ!PQsIeuK0UwII-A0xS-B3plepNniNeyw14OJowT1cYL-tnuN99MkWfg9C8P60tVFFrnxj0NEanUmEkA$
> ?? Mel was talking about a completely different thing, I brought up a
> different patch that I wanted to revert and Mel thought it would hurt other
> workloads which don't benefit from pulling but
> as you can see, tbench somehow benefits from it, at least according to one
> metric from one workload.

Yeah - but nevertheless the discussion with Mel was open-ended AFAICS, and 
the 'major tbench regression' report by K Prateek Nayak above still stands 
and needs to be investigated/understood, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-08-03  9:18       ` Ingo Molnar
@ 2022-08-03 19:37         ` Libo Chen
  2022-08-04  8:55           ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Libo Chen @ 2022-08-03 19:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: K Prateek Nayak, peterz, vincent.guittot, mgorman, tim.c.chen,
	21cnbao, dietmar.eggemann, linux-kernel, tglx



On 8/3/22 02:18, Ingo Molnar wrote:
> * Libo Chen <libo.chen@oracle.com> wrote:
>
>>
>> On 8/1/22 06:26, Ingo Molnar wrote:
>>> * K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>>
>>>> Hello Libo and Peter,
>>>>
>>>> tl;dr
>>>>
>>>> - We observed a major regression with tbench when testing the latest tip
>>>>     sched/core at:
>>>>     commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
>>>>     Reason for the regression are the fewer affine wakeups that leaves the
>>>>     client farther away from the data it needs to consume next primed in the
>>>>     waker's LLC.
>>>>     Such regressions can be expected from tasks that use sockets to communicate
>>>>     significant amount of data especially on system with multiple LLCs.
>>>>
>>>> - Other benchmarks have a comparable behavior to the tip at previous commit
>>>>     commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
>>>>     into core tree when update cookie"
>>>>
>>>> I'll leave more details below.
>>> Mel Gorman also warned about this negative side-effect in:
>>>
>>>      Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context
>>>      Date: Fri, 15 Jul 2022 11:07:38 +0100
>>>      Message-ID: <20220715100738.GD3493@suse.de>
>>>
>>>      https://urldefense.com/v3/__https://lore.kernel.org/all/20220715100738.GD3493@suse.de/__;!!ACWV5N9M2RV99hQ!PQsIeuK0UwII-A0xS-B3plepNniNeyw14OJowT1cYL-tnuN99MkWfg9C8P60tVFFrnxj0NEanUmEkA$
>> ?? Mel was talking about a completely different thing, I brought up a
>> different patch that I wanted to revert and Mel thought it would hurt other
>> workloads which don't benefit from pulling but
>> as you can see, tbench somehow benefits from it, at least according to one
>> metric from one workload.
> Yeah - but nevertheless the discussion with Mel was open-ended AFAICS, and
> the 'major tbench regression' report by K Prateek Nayak above still stands
> and needs to be investigated/understood, right?
Oh yes, I have no issue with holding the patch back until the regression 
is fully understood. I was just a little confused on your reference to 
Mel's comments. Anyway, I will post my investigation soon.


Libo
> Thanks,
>
> 	Ingo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-08-03 19:37         ` Libo Chen
@ 2022-08-04  8:55           ` Ingo Molnar
  2022-08-04  9:51             ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2022-08-04  8:55 UTC (permalink / raw)
  To: Libo Chen
  Cc: K Prateek Nayak, peterz, vincent.guittot, mgorman, tim.c.chen,
	21cnbao, dietmar.eggemann, linux-kernel, tglx


* Libo Chen <libo.chen@oracle.com> wrote:

> Oh yes, I have no issue with holding the patch back until the regression 
> is fully understood. I was just a little confused on your reference to 
> Mel's comments. [...]

Yeah, that was just me getting confused about which change Mel was 
referring to, as I was looking for external confirmation saying what I was 
thinking about the patch: in_task()/in_interrupt() heuristics rarely do 
well. ;-)

> Anyway, I will post my investigation soon.

Thx - and measurements will always be able to override any negative 
expectations of mine, so my comments weren't a NAK.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: no sync wakeup from interrupt context
  2022-08-04  8:55           ` Ingo Molnar
@ 2022-08-04  9:51             ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-08-04  9:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Libo Chen, K Prateek Nayak, peterz, vincent.guittot, tim.c.chen,
	21cnbao, dietmar.eggemann, linux-kernel, tglx

On Thu, Aug 04, 2022 at 10:55:12AM +0200, Ingo Molnar wrote:
> 
> * Libo Chen <libo.chen@oracle.com> wrote:
> 
> > Oh yes, I have no issue with holding the patch back until the regression 
> > is fully understood. I was just a little confused on your reference to 
> > Mel's comments. [...]
> 
> Yeah, that was just me getting confused about which change Mel was 
> referring to, as I was looking for external confirmation saying what I was 
> thinking about the patch: in_task()/in_interrupt() heuristics rarely do 
> well. ;-)
> 

Even though I was referring to something else, the reported regression
is still a regression.  Prateek's report and what I brought up in
https://lore.kernel.org/all/20220715100738.GD3493@suse.de/ are both simply
examples where changes to affine wakeup can have surprising results.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-08-04  9:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
2022-07-13 16:40 ` Tim Chen
     [not found]   ` <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
2022-07-13 19:34     ` Libo Chen
2022-07-13 20:51     ` Tim Chen
2022-07-13 21:37       ` Libo Chen
2022-07-14 14:18     ` Mel Gorman
2022-07-14 20:21       ` Libo Chen
2022-07-15 10:07         ` Mel Gorman
2022-07-14 11:26 ` Peter Zijlstra
2022-07-14 11:35 ` Peter Zijlstra
2022-07-14 18:18   ` Libo Chen
2022-07-21  8:44 ` [tip: sched/core] sched/fair: Disallow " tip-bot2 for Libo Chen
2022-07-29  4:47 ` [PATCH] sched/fair: no " K Prateek Nayak
2022-08-01 13:26   ` Ingo Molnar
2022-08-01 14:59     ` Libo Chen
2022-08-03  9:18       ` Ingo Molnar
2022-08-03 19:37         ` Libo Chen
2022-08-04  8:55           ` Ingo Molnar
2022-08-04  9:51             ` Mel Gorman
2022-08-01 14:57   ` Libo Chen
2022-08-02  4:40     ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).