netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
@ 2023-11-16 18:58 Tobias Huschle
  2023-11-17  9:23 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Tobias Huschle @ 2023-11-16 18:58 UTC (permalink / raw)
  To: Linux Kernel, kvm, virtualization, netdev; +Cc: Peterz, mst, jasowang

Hi,

when testing the EEVDF scheduler we stumbled upon a performance 
regression in a uperf scenario and would like to
kindly ask for feedback on whether we are going into the right direction 
with our analysis so far.

The base scenario are two KVM guests running on an s390 LPAR. One guest 
hosts the uperf server, one the uperf client.
With EEVDF we observe a regression of ~50% for a strburst test.
For a more detailed description of the setup see the section TEST 
SUMMARY at the bottom.

Bisecting led us to the following commit which appears to introduce the 
regression:
86bfbb7ce4f6 sched/fair: Add lag based placement

We then compared the last good commit we identified with a recent level 
of the devel branch.
The issue still persists on 6.7 rc1 although there is some improvement 
(down from 62% regression to 49%)

All analysis described further are based on a 6.6 rc7 kernel.

We sampled perf data to get an idea on what is going wrong and ended up 
seeing an dramatic increase in the maximum
wait times from 3ms up to 366ms. See section WAIT DELAYS below for more 
details.

We then collected tracing data to get a better insight into what is 
going on.
The trace excerpt in section TRACE EXCERPT shows one example (of 
multiple per test run) of the problematic scenario where
a kworker(pid=6525) has to wait for 39,718 ms.

Short summary:
The mentioned kworker has been scheduled to CPU 14 before the tracing 
was enabled.
A vhost process is migrated onto CPU 14.
The vruntimes of kworker and vhost differ significantly (86642125805 vs 
4242563284 -> factor 20)
The vhost process wants to wake up the kworker, therefore the kworker is 
placed onto the runqueue again and set to runnable.
The vhost process continues to execute, waking up other vhost processes 
on other CPUs.

So far this behavior is not different to what we see on pre-EEVDF 
kernels.

On timestamp 576.162767, the vhost process triggers the last wake up of 
another vhost on another CPU.
Until timestamp 576.171155, we see no other activity. Now, the vhost 
process ends its time slice.
Then, vhost gets re-assigned new time slices 4 times and gets then 
migrated off to CPU 15.
This does not occur with older kernels.
The kworker has to wait for the migration to happen in order to be able 
to execute again.
This is due to the fact, that the vruntime of the kworker is 
significantly larger than the one of vhost.


We observed the large difference in vruntime between kworker and vhost 
in the same magnitude on
a kernel built based on the parent of the commit mentioned above.
With EEVDF, the kworker is doomed to wait until the vhost either catches 
up on vruntime (which would take 86 seconds)
or the vhost is migrated off of the CPU.

We found some options which sound plausible but we are not sure if they 
are valid or not:

1. The wake up path has a dependency on the vruntime metrics that now 
delays the execution of the kworker.
2. The previous commit af4cf40470c2 (sched/fair: Add 
cfs_rq::avg_vruntime) which updates the way cfs_rq->min_vruntime and
     cfs_rq->avg_runtime are set might have introduced an issue which is 
uncovered with the commit mentioned above.
3. An assumption in the vhost code which causes vhost to rely on being 
scheduled off in time to allow the kworker to proceed.

We also stumbled upon the following mailing thread: 
https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
That conversation, and the patches derived from it lead to the 
assumption that the wake up path might be adjustable in a way
that this case in particular can be addressed.
At the same time, the vast difference in vruntimes is concerning since, 
at least for some time frame, both processes are on the runqueue.

We would be glad to hear some feedback on which paths to pursue and 
which might just be a dead end in the first place.


#################### TRACE EXCERPT ####################
The sched_place trace event was added to the end of the place_entity 
function and outputs:
sev -> sched_entity vruntime
sed -> sched_entity deadline
sel -> sched_entity vlag
avg -> cfs_rq avg_vruntime
min -> cfs_rq min_vruntime
cpu -> cpu of cfs_rq
nr  -> cfs_rq nr_running
---
     CPU 3/KVM-2950    [014] d....   576.161432: sched_migrate_task: 
comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
--> migrates task from cpu 15 to 14
     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: 
comm=vhost-2920 pid=2941 sev=4242563284 sed=4245563284 sel=0 
avg=4242563284 min=4242563284 cpu=14 nr=0
--> places vhost 2920 on CPU 14 with vruntime 4242563284
     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0 
sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 
cpu=14 nr=0
     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0 
sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 
cpu=14 nr=0
     CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0 
sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 
cpu=14 nr=0
     CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0 
sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 
cpu=14 nr=0
     CPU 3/KVM-2950    [014] dN...   576.161434: sched_stat_wait: 
comm=vhost-2920 pid=2941 delay=9958 [ns]
     CPU 3/KVM-2950    [014] d....   576.161435: sched_switch: 
prev_comm=CPU 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> 
next_comm=vhost-2920 next_pid=2941 next_prio=120
    vhost-2920-2941    [014] D....   576.161439: sched_waking: 
comm=vhost-2286 pid=2309 prio=120 target_cpu=008
    vhost-2920-2941    [014] d....   576.161446: sched_waking: 
comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
    vhost-2920-2941    [014] d....   576.161447: sched_place: 
comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0 
avg=86642125805 min=86642125805 cpu=14 nr=1
--> places kworker 6525 on cpu 14 with vruntime 86642125805
-->  which is far larger than vhost vruntime of  4242563284
    vhost-2920-2941    [014] d....   576.161447: sched_stat_blocked: 
comm=kworker/14:0 pid=6525 delay=10143757 [ns]
    vhost-2920-2941    [014] dN...   576.161447: sched_wakeup: 
comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
    vhost-2920-2941    [014] dN...   576.161448: sched_stat_runtime: 
comm=vhost-2920 pid=2941 runtime=13884 [ns] vruntime=4242577168 [ns]
--> vhost 2920 finishes after 13884 ns of runtime
    vhost-2920-2941    [014] dN...   576.161448: sched_stat_wait: 
comm=kworker/14:0 pid=6525 delay=0 [ns]
    vhost-2920-2941    [014] d....   576.161448: sched_switch: 
prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==> 
next_comm=kworker/14:0 next_pid=6525 next_prio=120
--> switch to kworker
  kworker/14:0-6525    [014] d....   576.161449: sched_waking: comm=CPU 
2/KVM pid=2949 prio=120 target_cpu=007
  kworker/14:0-6525    [014] d....   576.161450: sched_stat_runtime: 
comm=kworker/14:0 pid=6525 runtime=3714 [ns] vruntime=86642129519 [ns]
--> kworker finshes after 3714 ns of runtime
  kworker/14:0-6525    [014] d....   576.161450: sched_stat_wait: 
comm=vhost-2920 pid=2941 delay=3714 [ns]
  kworker/14:0-6525    [014] d....   576.161451: sched_switch: 
prev_comm=kworker/14:0 prev_pid=6525 prev_prio=120 prev_state=I ==> 
next_comm=vhost-2920 next_pid=2941 next_prio=120
--> switch back to vhost
    vhost-2920-2941    [014] d....   576.161478: sched_waking: 
comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
    vhost-2920-2941    [014] d....   576.161478: sched_place: 
comm=kworker/14:0 pid=6525 sev=86642191859 sed=86645191859 sel=-1150 
avg=86642188144 min=86642188144 cpu=14 nr=1
--> kworker placed again on cpu 14 with vruntime 86642191859, the 
problem occurs only if lag <= 0, having lag=0 does not always hit the 
problem though
    vhost-2920-2941    [014] d....   576.161478: sched_stat_blocked: 
comm=kworker/14:0 pid=6525 delay=27943 [ns]
    vhost-2920-2941    [014] d....   576.161479: sched_wakeup: 
comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
    vhost-2920-2941    [014] D....   576.161511: sched_waking: 
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
    vhost-2920-2941    [014] D....   576.161512: sched_waking: 
comm=vhost-2286 pid=2309 prio=120 target_cpu=008
    vhost-2920-2941    [014] D....   576.161516: sched_waking: 
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
    vhost-2920-2941    [014] D....   576.161773: sched_waking: 
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
    vhost-2920-2941    [014] D....   576.161775: sched_waking: 
comm=vhost-2286 pid=2309 prio=120 target_cpu=008
    vhost-2920-2941    [014] D....   576.162103: sched_waking: 
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
    vhost-2920-2941    [014] D....   576.162105: sched_waking: 
comm=vhost-2286 pid=2307 prio=120 target_cpu=021
    vhost-2920-2941    [014] D....   576.162326: sched_waking: 
comm=vhost-2286 pid=2305 prio=120 target_cpu=004
    vhost-2920-2941    [014] D....   576.162437: sched_waking: 
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
    vhost-2920-2941    [014] D....   576.162767: sched_waking: 
comm=vhost-2286 pid=2305 prio=120 target_cpu=004
    vhost-2920-2941    [014] d.h..   576.171155: sched_stat_runtime: 
comm=vhost-2920 pid=2941 runtime=9704465 [ns] vruntime=4252281633 [ns]
    vhost-2920-2941    [014] d.h..   576.181155: sched_stat_runtime: 
comm=vhost-2920 pid=2941 runtime=10000377 [ns] vruntime=4262282010 [ns]
    vhost-2920-2941    [014] d.h..   576.191154: sched_stat_runtime: 
comm=vhost-2920 pid=2941 runtime=9999514 [ns] vruntime=4272281524 [ns]
    vhost-2920-2941    [014] d.h..   576.201155: sched_stat_runtime: 
comm=vhost-2920 pid=2941 runtime=10000246 [ns] vruntime=4282281770 [ns]
--> vhost gets rescheduled multiple times because its vruntime is 
significantly smaller than the vruntime of the kworker
    vhost-2920-2941    [014] dNh..   576.201176: sched_wakeup: 
comm=migration/14 pid=85 prio=0 target_cpu=014
    vhost-2920-2941    [014] dN...   576.201191: sched_stat_runtime: 
comm=vhost-2920 pid=2941 runtime=25190 [ns] vruntime=4282306960 [ns]
    vhost-2920-2941    [014] d....   576.201192: sched_switch: 
prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==> 
next_comm=migration/14 next_pid=85 next_prio=0
  migration/14-85      [014] d..1.   576.201194: sched_migrate_task: 
comm=vhost-2920 pid=2941 prio=120 orig_cpu=14 dest_cpu=15
--> vhost gets migrated off of cpu 14
  migration/14-85      [014] d..1.   576.201194: sched_place: 
comm=vhost-2920 pid=2941 sev=3198666923 sed=3201666923 sel=0 
avg=3198666923 min=3198666923 cpu=15 nr=0
  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0 
sev=12775683594 sed=12779398224 sel=0 avg=12775683594 min=12775683594 
cpu=15 nr=0
  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0 
sev=33655559178 sed=33661025369 sel=0 avg=33655559178 min=33655559178 
cpu=15 nr=0
  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0 
sev=42240572785 sed=42244083642 sel=0 avg=42240572785 min=42240572785 
cpu=15 nr=0
  migration/14-85      [014] d..1.   576.201196: sched_place: comm= pid=0 
sev=70190876523 sed=70194789898 sel=-13068763 avg=70190876523 
min=70190876523 cpu=15 nr=0
  migration/14-85      [014] d....   576.201198: sched_stat_wait: 
comm=kworker/14:0 pid=6525 delay=39718472 [ns]
  migration/14-85      [014] d....   576.201198: sched_switch: 
prev_comm=migration/14 prev_pid=85 prev_prio=0 prev_state=S ==> 
next_comm=kworker/14:0 next_pid=6525 next_prio=120
  --> only now, kworker is eligible to run again, after a delay of 
39718472 ns
  kworker/14:0-6525    [014] d....   576.201200: sched_waking: comm=CPU 
0/KVM pid=2947 prio=120 target_cpu=012
  kworker/14:0-6525    [014] d....   576.201290: sched_stat_runtime: 
comm=kworker/14:0 pid=6525 runtime=92941 [ns] vruntime=86642284800 [ns]

#################### WAIT DELAYS - PERF LATENCY ####################
last good commit --> perf sched latency -s max
  
-------------------------------------------------------------------------------------------------------------------------------------------
   Task                  |   Runtime ms  | Switches | Avg delay ms    | 
Max delay ms    | Max delay start           | Max delay end          |
  
-------------------------------------------------------------------------------------------------------------------------------------------
   CPU 2/KVM:(2)         |   5399.650 ms |   108698 | avg:   0.003 ms | 
max:   3.077 ms | max start:   544.090322 s | max end:   544.093399 s
   CPU 7/KVM:(2)         |   5111.132 ms |    69632 | avg:   0.003 ms | 
max:   2.980 ms | max start:   544.690994 s | max end:   544.693974 s
   kworker/22:3-ev:723   |    342.944 ms |    63417 | avg:   0.005 ms | 
max:   1.880 ms | max start:   545.235430 s | max end:   545.237310 s
   CPU 0/KVM:(2)         |   8171.431 ms |   433099 | avg:   0.003 ms | 
max:   1.004 ms | max start:   547.970344 s | max end:   547.971348 s
   CPU 1/KVM:(2)         |   5486.260 ms |   258702 | avg:   0.003 ms | 
max:   1.002 ms | max start:   548.782514 s | max end:   548.783516 s
   CPU 5/KVM:(2)         |   4766.143 ms |    65727 | avg:   0.003 ms | 
max:   0.997 ms | max start:   545.313610 s | max end:   545.314607 s
   vhost-2268:(6)        |  13206.503 ms |   315030 | avg:   0.003 ms | 
max:   0.989 ms | max start:   550.887761 s | max end:   550.888749 s
   vhost-2892:(6)        |  14467.268 ms |   214005 | avg:   0.003 ms | 
max:   0.981 ms | max start:   545.213819 s | max end:   545.214800 s
   CPU 3/KVM:(2)         |   5538.908 ms |    85105 | avg:   0.003 ms | 
max:   0.883 ms | max start:   547.138139 s | max end:   547.139023 s
   CPU 6/KVM:(2)         |   5289.827 ms |    72301 | avg:   0.003 ms | 
max:   0.836 ms | max start:   551.094590 s | max end:   551.095425 s

6.6 rc7 --> perf sched latency -s max
-------------------------------------------------------------------------------------------------------------------------------------------
   Task                  |   Runtime ms  | Switches | Avg delay ms    | 
Max delay ms    | Max delay start           | Max delay end          |
  
-------------------------------------------------------------------------------------------------------------------------------------------
   kworker/19:2-ev:1071  |     69.482 ms |    12700 | avg:   0.050 ms | 
max: 366.314 ms | max start: 54705.674294 s | max end: 54706.040607 s
   kworker/13:1-ev:184   |     78.048 ms |    14645 | avg:   0.067 ms | 
max: 287.738 ms | max start: 54710.312863 s | max end: 54710.600602 s
   kworker/12:1-ev:46148 |    138.488 ms |    26660 | avg:   0.021 ms | 
max: 147.414 ms | max start: 54706.133161 s | max end: 54706.280576 s
   kworker/16:2-ev:33076 |    149.175 ms |    29491 | avg:   0.026 ms | 
max: 139.752 ms | max start: 54708.410845 s | max end: 54708.550597 s
   CPU 3/KVM:(2)         |   1934.714 ms |    41896 | avg:   0.007 ms | 
max:  92.126 ms | max start: 54713.158498 s | max end: 54713.250624 s
   kworker/7:2-eve:17001 |     68.164 ms |    11820 | avg:   0.045 ms | 
max:  69.717 ms | max start: 54707.100903 s | max end: 54707.170619 s
   kworker/17:1-ev:46510 |     68.804 ms |    13328 | avg:   0.037 ms | 
max:  67.894 ms | max start: 54711.022711 s | max end: 54711.090605 s
   kworker/21:1-ev:45782 |     68.906 ms |    13215 | avg:   0.021 ms | 
max:  59.473 ms | max start: 54709.351135 s | max end: 54709.410608 s
   ksoftirqd/17:101      |      0.041 ms |        2 | avg:  25.028 ms | 
max:  50.047 ms | max start: 54711.040578 s | max end: 54711.090625 s

#################### TEST SUMMARY ####################
  Setup description:
- single KVM host with 2 identical guests
- guests are connected virtually via Open vSwitch
- guests run uperf streaming read workload with 50 parallel connections
- one guests acts as uperf client, the other one as uperf server

Regression:
kernel-6.5.0-rc2: 78 Gb/s (before 86bfbb7ce4f6 sched/fair: Add lag based 
placement)
kernel-6.5.0-rc2: 29 Gb/s (with 86bfbb7ce4f6 sched/fair: Add lag based 
placement)
kernel-6.7.0-rc1: 41 Gb/s

KVM host:
- 12 dedicated IFLs, SMT-2 (24 Linux CPUs)
- 64 GiB memory
- FEDORA 38
- kernel commandline: transparent_hugepage=never audit_enable=0 audit=0 
audit_debug=0 selinux=0

KVM guests:
- 8 vCPUs
- 8 GiB memory
- RHEL 9.2
- kernel: 5.14.0-162.6.1.el9_1.s390x
- kernel commandline: transparent_hugepage=never audit_enable=0 audit=0 
audit_debug=0 selinux=0

Open vSwitch:
- Open vSwitch with 2 ports, each with mtu=32768 and qlen=15000
- Open vSwitch ports attached to guests via virtio-net
- each guest has 4 vhost-queues

Domain xml snippet for Open vSwitch port:
<interface type="bridge" dev="OVS">
   <source bridge="vswitch0"/>
   <mac address="02:bb:97:28:02:02"/>
   <virtualport type="openvswitch"/>
   <model type="virtio"/>
   <target dev="vport1"/>
   <driver name="vhost" queues="4"/>
   <address type="ccw" cssid="0xfe" ssid="0x0" devno="0x0002"/>
</interface>

Benchmark: uperf
- workload: str-readx30k, 50 active parallel connections
- uperf server permanently sends data in 30720-byte chunks
- uperf client receives and acknowledges this data
- Server: uperf -s
- Client: uperf -a -i 30 -m uperf.xml

uperf.xml:
<?xml version="1.0"?>
<profile name="strburst">
   <group nprocs="50">
     <transaction iterations="1">
       <flowop type="connect" options="remotehost=10.161.28.3 
protocol=tcp  "/>
     </transaction>
     <transaction duration="300">
       <flowop type="read" options="count=640 size=30k"/>
     </transaction>
     <transaction iterations="1">
       <flowop type="disconnect" />
     </transaction>
   </group>
</profile>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-16 18:58 EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement) Tobias Huschle
@ 2023-11-17  9:23 ` Peter Zijlstra
  2023-11-17  9:58   ` Peter Zijlstra
                     ` (2 more replies)
  2023-11-18  7:33 ` Abel Wu
  2023-11-19 13:29 ` Bagas Sanjaya
  2 siblings, 3 replies; 58+ messages in thread
From: Peter Zijlstra @ 2023-11-17  9:23 UTC (permalink / raw)
  To: Tobias Huschle; +Cc: Linux Kernel, kvm, virtualization, netdev, mst, jasowang


Your email is pretty badly mangled by wrapping, please try and
reconfigure your MUA, esp. the trace and debug output is unreadable.

On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:

> The base scenario are two KVM guests running on an s390 LPAR. One guest
> hosts the uperf server, one the uperf client.
> With EEVDF we observe a regression of ~50% for a strburst test.
> For a more detailed description of the setup see the section TEST SUMMARY at
> the bottom.

Well, that's not good :/

> Short summary:
> The mentioned kworker has been scheduled to CPU 14 before the tracing was
> enabled.
> A vhost process is migrated onto CPU 14.
> The vruntimes of kworker and vhost differ significantly (86642125805 vs
> 4242563284 -> factor 20)

So bear with me, I know absolutely nothing about virt stuff. I suspect
there's cgroups involved because shiny or something.

kworkers are typically not in cgroups and are part of the root cgroup,
but what's a vhost and where does it live?

Also, what are their weights / nice values?

> The vhost process wants to wake up the kworker, therefore the kworker is
> placed onto the runqueue again and set to runnable.
> The vhost process continues to execute, waking up other vhost processes on
> other CPUs.
> 
> So far this behavior is not different to what we see on pre-EEVDF kernels.
> 
> On timestamp 576.162767, the vhost process triggers the last wake up of
> another vhost on another CPU.
> Until timestamp 576.171155, we see no other activity. Now, the vhost process
> ends its time slice.
> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
> off to CPU 15.

So why does this vhost stay on the CPU if it doesn't have anything to
do? (I've not tried to make sense of the trace, that's just too
painful).

> This does not occur with older kernels.
> The kworker has to wait for the migration to happen in order to be able to
> execute again.
> This is due to the fact, that the vruntime of the kworker is significantly
> larger than the one of vhost.

That's, weird. Can you add a trace_printk() to update_entity_lag() and
have it print out the lag, limit and vlag (post clamping) values? And
also in place_entity() for the reverse process, lag pre and post scaling
or something.

After confirming both tasks are indeed in the same cgroup ofcourse,
because if they're not, vruntime will be meaningless to compare and we
should look elsewhere.

Also, what HZ and what preemption mode are you running? If kworker is
somehow vastly over-shooting it's slice -- keeps running way past the
avg_vruntime, then it will build up a giant lag and you get what you
describe, next time it wakes up it gets placed far to the right (exactly
where it was when it 'finally' went to sleep, relatively speaking).

> We found some options which sound plausible but we are not sure if they are
> valid or not:
> 
> 1. The wake up path has a dependency on the vruntime metrics that now delays
> the execution of the kworker.
> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
> which updates the way cfs_rq->min_vruntime and
>     cfs_rq->avg_runtime are set might have introduced an issue which is
> uncovered with the commit mentioned above.

Suppose you have a few tasks (of equal weight) on you virtual timeline
like so:

   ---------+---+---+---+---+------
            ^       ^
	    |       `avg_vruntime
	    `-min_vruntime

Then the above would be more or less the relative placements of these
values. avg_vruntime is the weighted average of the various vruntimes
and is therefore always in the 'middle' of the tasks, and not somewhere
out-there.

min_vruntime is a monotonically increasing 'minimum' that's left-ish on
the tree (there's a few cases where a new task can be placed left of
min_vruntime and its no longer actuall the minimum, but whatever).

These values should be relatively close to one another, depending
ofcourse on the spread of the tasks. So I don't think this is causing
trouble.

Anyway, the big difference with lag based placement is that where
previously tasks (that do not migrate) retain their old vruntime and on
placing they get pulled forward to at least min_vruntime, so a task that
wildly overshoots, but then doesn't run for significant time can still
be overtaken and then when placed again be 'okay'.

Now OTOH, with lag-based placement,  we strictly preserve their relative
offset vs avg_vruntime. So if they were *far* too the right when they go
to sleep, they will again be there on placement.

Sleeping doesn't help them anymore.

Now, IF this is the problem, I might have a patch that helps:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=119feac4fcc77001cd9bf199b25f08d232289a5c

That branch is based on v6.7-rc1 and then some, but I think it's
relatively easy to rebase the lot on v6.6 (which I'm assuming you're
on).

I'm a little conflicted on the patch, conceptually I like what it does,
but the code it turned into is quite horrible. I've tried implementing
it differently a number of times but always ended up with things that
either didn't work or were worse.

But if it works, it works I suppose.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-17  9:23 ` Peter Zijlstra
@ 2023-11-17  9:58   ` Peter Zijlstra
  2023-11-17 12:24   ` Tobias Huschle
  2023-11-18  5:14   ` Abel Wu
  2 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2023-11-17  9:58 UTC (permalink / raw)
  To: Tobias Huschle; +Cc: Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:
> Now, IF this is the problem, I might have a patch that helps:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=119feac4fcc77001cd9bf199b25f08d232289a5c

And then I turn around and wipe the repository invalidating that link.

The sched/eevdf branch should be re-instated (with different SHA1), but
I'll include the patch below for reference.

---
Subject: sched/eevdf: Delay dequeue
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Sep 15 00:48:45 CEST 2023

For tasks that have negative-lag (have received 'excess' service), delay the
dequeue and keep them in the runnable tree until they're eligible again. Or
rather, keep them until they're selected again, since finding their eligibility
crossover point is expensive.

The effect is a bit like sleeper bonus, the tasks keep contending for service
until either they get a wakeup or until they're selected again and are really
dequeued.

This means that any actual dequeue happens with positive lag (serviced owed)
and are more readily ran when woken next.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    1 
 kernel/sched/core.c     |   88 +++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/fair.c     |   11 ++++++
 kernel/sched/features.h |   11 ++++++
 kernel/sched/sched.h    |    3 +
 5 files changed, 97 insertions(+), 17 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -916,6 +916,7 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
+	unsigned			sched_delayed:1;
 
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3856,12 +3856,23 @@ static int ttwu_runnable(struct task_str
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
+		update_rq_clock(rq);
+		if (unlikely(p->sched_delayed)) {
+			p->sched_delayed = 0;
+			/* mustn't run a delayed task */
+			WARN_ON_ONCE(task_on_cpu(rq, p));
+			if (sched_feat(GENTLE_DELAY)) {
+				dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
+				if (p->se.vlag > 0)
+					p->se.vlag = 0;
+				enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
+			}
+		}
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
 			 * it should preempt the task that is current now.
 			 */
-			update_rq_clock(rq);
 			wakeup_preempt(rq, p, wake_flags);
 		}
 		ttwu_do_wakeup(p);
@@ -6565,6 +6576,24 @@ pick_next_task(struct rq *rq, struct tas
 # define SM_MASK_PREEMPT	SM_PREEMPT
 #endif
 
+static void deschedule_task(struct rq *rq, struct task_struct *p, unsigned long prev_state)
+{
+	p->sched_contributes_to_load =
+		(prev_state & TASK_UNINTERRUPTIBLE) &&
+		!(prev_state & TASK_NOLOAD) &&
+		!(prev_state & TASK_FROZEN);
+
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible++;
+
+	deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+	if (p->in_iowait) {
+		atomic_inc(&rq->nr_iowait);
+		delayacct_blkio_start();
+	}
+}
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6650,6 +6679,8 @@ static void __sched notrace __schedule(u
 
 	switch_count = &prev->nivcsw;
 
+	WARN_ON_ONCE(prev->sched_delayed);
+
 	/*
 	 * We must load prev->state once (task_struct::state is volatile), such
 	 * that we form a control dependency vs deactivate_task() below.
@@ -6659,14 +6690,6 @@ static void __sched notrace __schedule(u
 		if (signal_pending_state(prev_state, prev)) {
 			WRITE_ONCE(prev->__state, TASK_RUNNING);
 		} else {
-			prev->sched_contributes_to_load =
-				(prev_state & TASK_UNINTERRUPTIBLE) &&
-				!(prev_state & TASK_NOLOAD) &&
-				!(prev_state & TASK_FROZEN);
-
-			if (prev->sched_contributes_to_load)
-				rq->nr_uninterruptible++;
-
 			/*
 			 * __schedule()			ttwu()
 			 *   prev_state = prev->state;    if (p->on_rq && ...)
@@ -6678,17 +6701,50 @@ static void __sched notrace __schedule(u
 			 *
 			 * After this, schedule() must not care about p->state any more.
 			 */
-			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
-			if (prev->in_iowait) {
-				atomic_inc(&rq->nr_iowait);
-				delayacct_blkio_start();
-			}
+			if (sched_feat(DELAY_DEQUEUE) &&
+			    prev->sched_class->delay_dequeue_task &&
+			    prev->sched_class->delay_dequeue_task(rq, prev))
+				prev->sched_delayed = 1;
+			else
+				deschedule_task(rq, prev, prev_state);
 		}
 		switch_count = &prev->nvcsw;
 	}
 
-	next = pick_next_task(rq, prev, &rf);
+	for (struct task_struct *tmp = prev;;) {
+		unsigned long tmp_state;
+
+		next = pick_next_task(rq, tmp, &rf);
+		if (unlikely(tmp != prev))
+			finish_task(tmp);
+
+		if (likely(!next->sched_delayed))
+			break;
+
+		next->sched_delayed = 0;
+
+		/*
+		 * A sched_delayed task must not be runnable at this point, see
+		 * ttwu_runnable().
+		 */
+		tmp_state = READ_ONCE(next->__state);
+		if (WARN_ON_ONCE(!tmp_state))
+			break;
+
+		prepare_task(next);
+		/*
+		 * Order ->on_cpu and ->on_rq, see the comments in
+		 * try_to_wake_up(). Normally this is smp_mb__after_spinlock()
+		 * above.
+		 */
+		smp_wmb();
+		deschedule_task(rq, next, tmp_state);
+		if (sched_feat(GENTLE_DELAY) && next->se.vlag > 0)
+			next->se.vlag = 0;
+
+		tmp = next;
+	}
+
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8540,6 +8540,16 @@ static struct task_struct *__pick_next_t
 	return pick_next_task_fair(rq, NULL, NULL);
 }
 
+static bool delay_dequeue_task_fair(struct rq *rq, struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	update_curr(cfs_rq);
+
+	return !entity_eligible(cfs_rq, se);
+}
+
 /*
  * Account for a descheduled task:
  */
@@ -13151,6 +13161,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 	.wakeup_preempt		= check_preempt_wakeup_fair,
 
+	.delay_dequeue_task	= delay_dequeue_task_fair,
 	.pick_next_task		= __pick_next_task_fair,
 	.put_prev_task		= put_prev_task_fair,
 	.set_next_task          = set_next_task_fair,
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,17 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  */
 SCHED_FEAT(PLACE_SLEEPER, false)
 SCHED_FEAT(GENTLE_SLEEPER, true)
+/*
+ * Delay dequeueing tasks until they get selected or woken.
+ *
+ * By delaying the dequeue for non-eligible tasks, they remain in the
+ * competition and can burn off their negative lag. When they get selected
+ * they'll have positive lag by definition.
+ *
+ * GENTLE_DELAY clips the lag on dequeue (or wakeup) to 0.
+ */
+SCHED_FEAT(DELAY_DEQUEUE, true)
+SCHED_FEAT(GENTLE_DELAY, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2254,6 +2254,7 @@ struct sched_class {
 
 	void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
 
+	bool (*delay_dequeue_task)(struct rq *rq, struct task_struct *p);
 	struct task_struct *(*pick_next_task)(struct rq *rq);
 
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
@@ -2307,7 +2308,7 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+//	WARN_ON_ONCE(rq->curr != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-17  9:23 ` Peter Zijlstra
  2023-11-17  9:58   ` Peter Zijlstra
@ 2023-11-17 12:24   ` Tobias Huschle
  2023-11-17 12:37     ` Peter Zijlstra
  2023-11-18  5:14   ` Abel Wu
  2 siblings, 1 reply; 58+ messages in thread
From: Tobias Huschle @ 2023-11-17 12:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:
> 
> Your email is pretty badly mangled by wrapping, please try and
> reconfigure your MUA, esp. the trace and debug output is unreadable.

Just saw that .. sorry, will append the trace and latency data again.

[...]

> 
> So bear with me, I know absolutely nothing about virt stuff. I suspect
> there's cgroups involved because shiny or something.
> 
> kworkers are typically not in cgroups and are part of the root cgroup,
> but what's a vhost and where does it live?

The qemu instances of the two KVM guests are placed into cgroups.
The vhosts run within the context of these qemu instances (4 threads per guest).
So they are also put into those cgroups.

I'll answer the other questions you brought up as well, but I guess that one 
is most critical: 

> 
> After confirming both tasks are indeed in the same cgroup ofcourse,
> because if they're not, vruntime will be meaningless to compare and we
> should look elsewhere.

In that case we probably have to go with elsewhere ... which is good to know.

> 
> Also, what are their weights / nice values?
> 

Everything runs under default priority of 120. No nice values are set.

[...]

> 
> So why does this vhost stay on the CPU if it doesn't have anything to
> do? (I've not tried to make sense of the trace, that's just too
> painful).

It does something, we just don't see anything scheduler related in the trace anymore.
So far we haven't gone down the path of looking deeper into vhost.
We actually don't know what it's doing at the point where we took the trace.

[...]

> 
> That's, weird. Can you add a trace_printk() to update_entity_lag() and
> have it print out the lag, limit and vlag (post clamping) values? And
> also in place_entity() for the reverse process, lag pre and post scaling
> or something.

There is already a trace statement in the log for place_entity, that's were we
found some of the irritating numbers, will add another one for update_entity_lag 
and send an update once we have it.

Unless there is no sense in doing so because of the involvement of cgroups.

> 
> Also, what HZ and what preemption mode are you running? If kworker is

HZ=100
We run a non-preemptable kernel.

> somehow vastly over-shooting it's slice -- keeps running way past the
> avg_vruntime, then it will build up a giant lag and you get what you
> describe, next time it wakes up it gets placed far to the right (exactly
> where it was when it 'finally' went to sleep, relatively speaking).

That's one of the things that irritated us, the kworker has basically no lag.
I hope the reformatted trace below helps claryfing things.
If this is ok due to them not being in the same cgroup, then that's just how it is.

[...]

> 
> Now OTOH, with lag-based placement,  we strictly preserve their relative
> offset vs avg_vruntime. So if they were *far* too the right when they go
> to sleep, they will again be there on placement.

Yea, that's what I gathered from the EEVDF paper and the 3 strategies discussed
how to handle tasks that "rejoin the competition".

I'll do some catching up on how cgroups play into this.

> 
> Sleeping doesn't help them anymore.
> 
> Now, IF this is the problem, I might have a patch that helps:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=119feac4fcc77001cd9bf199b25f08d232289a5c
> 
> That branch is based on v6.7-rc1 and then some, but I think it's
> relatively easy to rebase the lot on v6.6 (which I'm assuming you're
> on).
> 
> I'm a little conflicted on the patch, conceptually I like what it does,
> but the code it turned into is quite horrible. I've tried implementing
> it differently a number of times but always ended up with things that
> either didn't work or were worse.
> 
> But if it works, it works I suppose.
> 

I'll check out the patch, thanks for the pointer.

Here's the hopefully unmangled data:

#################### TRACE EXCERPT ####################
The sched_place trace event was added to the end of the place_entity function and outputs:
sev -> sched_entity vruntime
sed -> sched_entity deadline
sel -> sched_entity vlag
avg -> cfs_rq avg_vruntime
min -> cfs_rq min_vruntime
cpu -> cpu of cfs_rq
nr  -> cfs_rq nr_running
---
    CPU 3/KVM-2950    [014] d....   576.161432: sched_migrate_task: comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
--> migrates task from cpu 15 to 14
    CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm=vhost-2920 pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284 cpu=14 nr=0
--> places vhost 2920 on CPU 14 with vruntime 4242563284
    CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0 sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14 nr=0
    CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0 sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 cpu=14 nr=0
    CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0 sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 cpu=14 nr=0
    CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0 sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 cpu=14 nr=0
    CPU 3/KVM-2950    [014] dN...   576.161434: sched_stat_wait: comm=vhost-2920 pid=2941 delay=9958 [ns]
    CPU 3/KVM-2950    [014] d....   576.161435: sched_switch: prev_comm=CPU 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> next_comm=vhost-2920 next_pid=2941 next_prio=120
   vhost-2920-2941    [014] D....   576.161439: sched_waking: comm=vhost-2286 pid=2309 prio=120 target_cpu=008
   vhost-2920-2941    [014] d....   576.161446: sched_waking: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
   vhost-2920-2941    [014] d....   576.161447: sched_place: comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0 avg=86642125805 min=86642125805 cpu=14 nr=1
--> places kworker 6525 on cpu 14 with vruntime 86642125805
-->  which is far larger than vhost vruntime of  4242563284
   vhost-2920-2941    [014] d....   576.161447: sched_stat_blocked: comm=kworker/14:0 pid=6525 delay=10143757 [ns]
   vhost-2920-2941    [014] dN...   576.161447: sched_wakeup: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
   vhost-2920-2941    [014] dN...   576.161448: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=13884 [ns] vruntime=4242577168 [ns]
--> vhost 2920 finishes after 13884 ns of runtime
   vhost-2920-2941    [014] dN...   576.161448: sched_stat_wait: comm=kworker/14:0 pid=6525 delay=0 [ns]
   vhost-2920-2941    [014] d....   576.161448: sched_switch: prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==> next_comm=kworker/14:0 next_pid=6525 next_prio=120
--> switch to kworker
 kworker/14:0-6525    [014] d....   576.161449: sched_waking: comm=CPU 2/KVM pid=2949 prio=120 target_cpu=007
 kworker/14:0-6525    [014] d....   576.161450: sched_stat_runtime: comm=kworker/14:0 pid=6525 runtime=3714 [ns] vruntime=86642129519 [ns]
--> kworker finshes after 3714 ns of runtime
 kworker/14:0-6525    [014] d....   576.161450: sched_stat_wait: comm=vhost-2920 pid=2941 delay=3714 [ns]
 kworker/14:0-6525    [014] d....   576.161451: sched_switch: prev_comm=kworker/14:0 prev_pid=6525 prev_prio=120 prev_state=I ==> next_comm=vhost-2920 next_pid=2941 next_prio=120
--> switch back to vhost
   vhost-2920-2941    [014] d....   576.161478: sched_waking: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
   vhost-2920-2941    [014] d....   576.161478: sched_place: comm=kworker/14:0 pid=6525 sev=86642191859 sed=86645191859 sel=-1150 avg=86642188144 min=86642188144 cpu=14 nr=1
--> kworker placed again on cpu 14 with vruntime 86642191859, the problem occurs only if lag <= 0, having lag=0 does not always hit the problem though
   vhost-2920-2941    [014] d....   576.161478: sched_stat_blocked: comm=kworker/14:0 pid=6525 delay=27943 [ns]
   vhost-2920-2941    [014] d....   576.161479: sched_wakeup: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
   vhost-2920-2941    [014] D....   576.161511: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
   vhost-2920-2941    [014] D....   576.161512: sched_waking: comm=vhost-2286 pid=2309 prio=120 target_cpu=008
   vhost-2920-2941    [014] D....   576.161516: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
   vhost-2920-2941    [014] D....   576.161773: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
   vhost-2920-2941    [014] D....   576.161775: sched_waking: comm=vhost-2286 pid=2309 prio=120 target_cpu=008
   vhost-2920-2941    [014] D....   576.162103: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
   vhost-2920-2941    [014] D....   576.162105: sched_waking: comm=vhost-2286 pid=2307 prio=120 target_cpu=021
   vhost-2920-2941    [014] D....   576.162326: sched_waking: comm=vhost-2286 pid=2305 prio=120 target_cpu=004
   vhost-2920-2941    [014] D....   576.162437: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
   vhost-2920-2941    [014] D....   576.162767: sched_waking: comm=vhost-2286 pid=2305 prio=120 target_cpu=004
   vhost-2920-2941    [014] d.h..   576.171155: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=9704465 [ns] vruntime=4252281633 [ns]
   vhost-2920-2941    [014] d.h..   576.181155: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=10000377 [ns] vruntime=4262282010 [ns]
   vhost-2920-2941    [014] d.h..   576.191154: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=9999514 [ns] vruntime=4272281524 [ns]
   vhost-2920-2941    [014] d.h..   576.201155: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=10000246 [ns] vruntime=4282281770 [ns]
--> vhost gets rescheduled multiple times because its vruntime is significantly smaller than the vruntime of the kworker
   vhost-2920-2941    [014] dNh..   576.201176: sched_wakeup: comm=migration/14 pid=85 prio=0 target_cpu=014
   vhost-2920-2941    [014] dN...   576.201191: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=25190 [ns] vruntime=4282306960 [ns]
   vhost-2920-2941    [014] d....   576.201192: sched_switch: prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==> next_comm=migration/14 next_pid=85 next_prio=0
 migration/14-85      [014] d..1.   576.201194: sched_migrate_task: comm=vhost-2920 pid=2941 prio=120 orig_cpu=14 dest_cpu=15
--> vhost gets migrated off of cpu 14
 migration/14-85      [014] d..1.   576.201194: sched_place: comm=vhost-2920 pid=2941 sev=3198666923 sed=3201666923 sel=0 avg=3198666923 min=3198666923 cpu=15 nr=0
 migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0 sev=12775683594 sed=12779398224 sel=0 avg=12775683594 min=12775683594 cpu=15 nr=0
 migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0 sev=33655559178 sed=33661025369 sel=0 avg=33655559178 min=33655559178 cpu=15 nr=0
 migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0 sev=42240572785 sed=42244083642 sel=0 avg=42240572785 min=42240572785 cpu=15 nr=0
 migration/14-85      [014] d..1.   576.201196: sched_place: comm= pid=0 sev=70190876523 sed=70194789898 sel=-13068763 avg=70190876523 min=70190876523 cpu=15 nr=0
 migration/14-85      [014] d....   576.201198: sched_stat_wait: comm=kworker/14:0 pid=6525 delay=39718472 [ns]
 migration/14-85      [014] d....   576.201198: sched_switch: prev_comm=migration/14 prev_pid=85 prev_prio=0 prev_state=S ==> next_comm=kworker/14:0 next_pid=6525 next_prio=120
 --> only now, kworker is eligible to run again, after a delay of 39718472 ns
 kworker/14:0-6525    [014] d....   576.201200: sched_waking: comm=CPU 0/KVM pid=2947 prio=120 target_cpu=012
 kworker/14:0-6525    [014] d....   576.201290: sched_stat_runtime: comm=kworker/14:0 pid=6525 runtime=92941 [ns] vruntime=86642284800 [ns]

#################### WAIT DELAYS - PERF LATENCY ####################
last good commit --> perf sched latency -s max
 -------------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Max delay start           | Max delay end          |
 -------------------------------------------------------------------------------------------------------------------------------------------
  CPU 2/KVM:(2)         |   5399.650 ms |   108698 | avg:   0.003 ms | max:   3.077 ms | max start:   544.090322 s | max end:   544.093399 s
  CPU 7/KVM:(2)         |   5111.132 ms |    69632 | avg:   0.003 ms | max:   2.980 ms | max start:   544.690994 s | max end:   544.693974 s
  kworker/22:3-ev:723   |    342.944 ms |    63417 | avg:   0.005 ms | max:   1.880 ms | max start:   545.235430 s | max end:   545.237310 s
  CPU 0/KVM:(2)         |   8171.431 ms |   433099 | avg:   0.003 ms | max:   1.004 ms | max start:   547.970344 s | max end:   547.971348 s
  CPU 1/KVM:(2)         |   5486.260 ms |   258702 | avg:   0.003 ms | max:   1.002 ms | max start:   548.782514 s | max end:   548.783516 s
  CPU 5/KVM:(2)         |   4766.143 ms |    65727 | avg:   0.003 ms | max:   0.997 ms | max start:   545.313610 s | max end:   545.314607 s
  vhost-2268:(6)        |  13206.503 ms |   315030 | avg:   0.003 ms | max:   0.989 ms | max start:   550.887761 s | max end:   550.888749 s
  vhost-2892:(6)        |  14467.268 ms |   214005 | avg:   0.003 ms | max:   0.981 ms | max start:   545.213819 s | max end:   545.214800 s
  CPU 3/KVM:(2)         |   5538.908 ms |    85105 | avg:   0.003 ms | max:   0.883 ms | max start:   547.138139 s | max end:   547.139023 s
  CPU 6/KVM:(2)         |   5289.827 ms |    72301 | avg:   0.003 ms | max:   0.836 ms | max start:   551.094590 s | max end:   551.095425 s

6.6 rc7 --> perf sched latency -s max
-------------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Max delay start           | Max delay end          |
 -------------------------------------------------------------------------------------------------------------------------------------------
  kworker/19:2-ev:1071  |     69.482 ms |    12700 | avg:   0.050 ms | max: 366.314 ms | max start: 54705.674294 s | max end: 54706.040607 s
  kworker/13:1-ev:184   |     78.048 ms |    14645 | avg:   0.067 ms | max: 287.738 ms | max start: 54710.312863 s | max end: 54710.600602 s
  kworker/12:1-ev:46148 |    138.488 ms |    26660 | avg:   0.021 ms | max: 147.414 ms | max start: 54706.133161 s | max end: 54706.280576 s
  kworker/16:2-ev:33076 |    149.175 ms |    29491 | avg:   0.026 ms | max: 139.752 ms | max start: 54708.410845 s | max end: 54708.550597 s
  CPU 3/KVM:(2)         |   1934.714 ms |    41896 | avg:   0.007 ms | max:  92.126 ms | max start: 54713.158498 s | max end: 54713.250624 s
  kworker/7:2-eve:17001 |     68.164 ms |    11820 | avg:   0.045 ms | max:  69.717 ms | max start: 54707.100903 s | max end: 54707.170619 s
  kworker/17:1-ev:46510 |     68.804 ms |    13328 | avg:   0.037 ms | max:  67.894 ms | max start: 54711.022711 s | max end: 54711.090605 s
  kworker/21:1-ev:45782 |     68.906 ms |    13215 | avg:   0.021 ms | max:  59.473 ms | max start: 54709.351135 s | max end: 54709.410608 s
  ksoftirqd/17:101      |      0.041 ms |        2 | avg:  25.028 ms | max:  50.047 ms | max start: 54711.040578 s | max end: 54711.090625 s

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-17 12:24   ` Tobias Huschle
@ 2023-11-17 12:37     ` Peter Zijlstra
  2023-11-17 13:07       ` Abel Wu
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2023-11-17 12:37 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Linux Kernel, kvm, virtualization, netdev, mst, jasowang, wuyun.abel

On Fri, Nov 17, 2023 at 01:24:21PM +0100, Tobias Huschle wrote:
> On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:

> > kworkers are typically not in cgroups and are part of the root cgroup,
> > but what's a vhost and where does it live?
> 
> The qemu instances of the two KVM guests are placed into cgroups.
> The vhosts run within the context of these qemu instances (4 threads per guest).
> So they are also put into those cgroups.
> 
> I'll answer the other questions you brought up as well, but I guess that one 
> is most critical: 
> 
> > 
> > After confirming both tasks are indeed in the same cgroup ofcourse,
> > because if they're not, vruntime will be meaningless to compare and we
> > should look elsewhere.
> 
> In that case we probably have to go with elsewhere ... which is good to know.

Ah, so if this is a cgroup issue, it might be worth trying this patch
that we have in tip/sched/urgent.

I'll try and read the rest of the email a little later, gotta run
errands first.

---

commit eab03c23c2a162085b13200d7942fc5a00b5ccc8
Author: Abel Wu <wuyun.abel@bytedance.com>
Date:   Tue Nov 7 17:05:07 2023 +0800

    sched/eevdf: Fix vruntime adjustment on reweight
    
    vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
    it gets re-weighted, and the calculations can be simplified based
    on the fact that re-weight won't change the w-average of all the
    entities. Please check the proofs in comments.
    
    But adjusting vruntime can also cause position change in RB-tree
    hence require re-queue to fix up which might be costly. This might
    be avoided by deferring adjustment to the time the entity actually
    leaves tree (dequeue/pick), but that will negatively affect task
    selection and probably not good enough either.
    
    Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
    Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2048138ce54b..025d90925bf6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3666,41 +3666,140 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			   unsigned long weight)
+{
+	unsigned long old_weight = se->load.weight;
+	u64 avruntime = avg_vruntime(cfs_rq);
+	s64 vlag, vslice;
+
+	/*
+	 * VRUNTIME
+	 * ========
+	 *
+	 * COROLLARY #1: The virtual runtime of the entity needs to be
+	 * adjusted if re-weight at !0-lag point.
+	 *
+	 * Proof: For contradiction assume this is not true, so we can
+	 * re-weight without changing vruntime at !0-lag point.
+	 *
+	 *             Weight	VRuntime   Avg-VRuntime
+	 *     before    w          v            V
+	 *      after    w'         v'           V'
+	 *
+	 * Since lag needs to be preserved through re-weight:
+	 *
+	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
+	 *	==>	V' = (V - v)*w/w' + v		(1)
+	 *
+	 * Let W be the total weight of the entities before reweight,
+	 * since V' is the new weighted average of entities:
+	 *
+	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
+	 *
+	 * by using (1) & (2) we obtain:
+	 *
+	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+	 *
+	 * Since we are doing at !0-lag point which means V != v, we
+	 * can simplify (3):
+	 *
+	 *	==>	W / (W + w' - w) = w / w'
+	 *	==>	Ww' = Ww + ww' - ww
+	 *	==>	W * (w' - w) = w * (w' - w)
+	 *	==>	W = w	(re-weight indicates w' != w)
+	 *
+	 * So the cfs_rq contains only one entity, hence vruntime of
+	 * the entity @v should always equal to the cfs_rq's weighted
+	 * average vruntime @V, which means we will always re-weight
+	 * at 0-lag point, thus breach assumption. Proof completed.
+	 *
+	 *
+	 * COROLLARY #2: Re-weight does NOT affect weighted average
+	 * vruntime of all the entities.
+	 *
+	 * Proof: According to corollary #1, Eq. (1) should be:
+	 *
+	 *	(V - v)*w = (V' - v')*w'
+	 *	==>    v' = V' - (V - v)*w/w'		(4)
+	 *
+	 * According to the weighted average formula, we have:
+	 *
+	 *	V' = (WV - wv + w'v') / (W - w + w')
+	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
+	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
+	 *	   = (WV + w'V' - Vw) / (W - w + w')
+	 *
+	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
+	 *	==>	V' * (W - w) = (W - w) * V	(5)
+	 *
+	 * If the entity is the only one in the cfs_rq, then reweight
+	 * always occurs at 0-lag point, so V won't change. Or else
+	 * there are other entities, hence W != w, then Eq. (5) turns
+	 * into V' = V. So V won't change in either case, proof done.
+	 *
+	 *
+	 * So according to corollary #1 & #2, the effect of re-weight
+	 * on vruntime should be:
+	 *
+	 *	v' = V' - (V - v) * w / w'		(4)
+	 *	   = V  - (V - v) * w / w'
+	 *	   = V  - vl * w / w'
+	 *	   = V  - vl'
+	 */
+	if (avruntime != se->vruntime) {
+		vlag = (s64)(avruntime - se->vruntime);
+		vlag = div_s64(vlag * old_weight, weight);
+		se->vruntime = avruntime - vlag;
+	}
+
+	/*
+	 * DEADLINE
+	 * ========
+	 *
+	 * When the weight changes, the virtual time slope changes and
+	 * we should adjust the relative virtual deadline accordingly.
+	 *
+	 *	d' = v' + (d - v)*w/w'
+	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
+	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
+	 *	   = V  + (d - V)*w/w'
+	 */
+	vslice = (s64)(se->deadline - avruntime);
+	vslice = div_s64(vslice * old_weight, weight);
+	se->deadline = avruntime + vslice;
+}
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
-	unsigned long old_weight = se->load.weight;
+	bool curr = cfs_rq->curr == se;
 
 	if (se->on_rq) {
 		/* commit outstanding execution time */
-		if (cfs_rq->curr == se)
+		if (curr)
 			update_curr(cfs_rq);
 		else
-			avg_vruntime_sub(cfs_rq, se);
+			__dequeue_entity(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	update_load_set(&se->load, weight);
-
 	if (!se->on_rq) {
 		/*
 		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
 		 * we need to scale se->vlag when w_i changes.
 		 */
-		se->vlag = div_s64(se->vlag * old_weight, weight);
+		se->vlag = div_s64(se->vlag * se->load.weight, weight);
 	} else {
-		s64 deadline = se->deadline - se->vruntime;
-		/*
-		 * When the weight changes, the virtual time slope changes and
-		 * we should adjust the relative virtual deadline accordingly.
-		 */
-		deadline = div_s64(deadline * old_weight, weight);
-		se->deadline = se->vruntime + deadline;
-		if (se != cfs_rq->curr)
-			min_deadline_cb_propagate(&se->run_node, NULL);
+		reweight_eevdf(cfs_rq, se, weight);
 	}
 
+	update_load_set(&se->load, weight);
+
 #ifdef CONFIG_SMP
 	do {
 		u32 divider = get_pelt_divider(&se->avg);
@@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
-		if (cfs_rq->curr != se)
-			avg_vruntime_add(cfs_rq, se);
+		if (!curr) {
+			/*
+			 * The entity's vruntime has been adjusted, so let's check
+			 * whether the rq-wide min_vruntime needs updated too. Since
+			 * the calculations above require stable min_vruntime rather
+			 * than up-to-date one, we do the update at the end of the
+			 * reweight process.
+			 */
+			__enqueue_entity(cfs_rq, se);
+			update_min_vruntime(cfs_rq);
+		}
 	}
 }
 
@@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
 
 #ifndef CONFIG_SMP
 	shares = READ_ONCE(gcfs_rq->tg->shares);
-
-	if (likely(se->load.weight == shares))
-		return;
 #else
-	shares   = calc_group_shares(gcfs_rq);
+	shares = calc_group_shares(gcfs_rq);
 #endif
-
-	reweight_entity(cfs_rq_of(se), se, shares);
+	if (unlikely(se->load.weight != shares))
+		reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-17 12:37     ` Peter Zijlstra
@ 2023-11-17 13:07       ` Abel Wu
  2023-11-21 13:17         ` Tobias Huschle
  0 siblings, 1 reply; 58+ messages in thread
From: Abel Wu @ 2023-11-17 13:07 UTC (permalink / raw)
  To: Peter Zijlstra, Tobias Huschle
  Cc: Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On 11/17/23 8:37 PM, Peter Zijlstra Wrote:
> On Fri, Nov 17, 2023 at 01:24:21PM +0100, Tobias Huschle wrote:
>> On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:
> 
>>> kworkers are typically not in cgroups and are part of the root cgroup,
>>> but what's a vhost and where does it live?
>>
>> The qemu instances of the two KVM guests are placed into cgroups.
>> The vhosts run within the context of these qemu instances (4 threads per guest).
>> So they are also put into those cgroups.
>>
>> I'll answer the other questions you brought up as well, but I guess that one
>> is most critical:
>>
>>>
>>> After confirming both tasks are indeed in the same cgroup ofcourse,
>>> because if they're not, vruntime will be meaningless to compare and we
>>> should look elsewhere.
>>
>> In that case we probably have to go with elsewhere ... which is good to know.
> 
> Ah, so if this is a cgroup issue, it might be worth trying this patch
> that we have in tip/sched/urgent.

And please also apply this fix:
https://lore.kernel.org/all/20231117080106.12890-1-s921975628@gmail.com/

> 
> I'll try and read the rest of the email a little later, gotta run
> errands first.
> 
> ---
> 
> commit eab03c23c2a162085b13200d7942fc5a00b5ccc8
> Author: Abel Wu <wuyun.abel@bytedance.com>
> Date:   Tue Nov 7 17:05:07 2023 +0800
> 
>      sched/eevdf: Fix vruntime adjustment on reweight
>      
>      vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
>      it gets re-weighted, and the calculations can be simplified based
>      on the fact that re-weight won't change the w-average of all the
>      entities. Please check the proofs in comments.
>      
>      But adjusting vruntime can also cause position change in RB-tree
>      hence require re-queue to fix up which might be costly. This might
>      be avoided by deferring adjustment to the time the entity actually
>      leaves tree (dequeue/pick), but that will negatively affect task
>      selection and probably not good enough either.
>      
>      Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
>      Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>      Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>      Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2048138ce54b..025d90925bf6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3666,41 +3666,140 @@ static inline void
>   dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
>   #endif
>   
> +static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +			   unsigned long weight)
> +{
> +	unsigned long old_weight = se->load.weight;
> +	u64 avruntime = avg_vruntime(cfs_rq);
> +	s64 vlag, vslice;
> +
> +	/*
> +	 * VRUNTIME
> +	 * ========
> +	 *
> +	 * COROLLARY #1: The virtual runtime of the entity needs to be
> +	 * adjusted if re-weight at !0-lag point.
> +	 *
> +	 * Proof: For contradiction assume this is not true, so we can
> +	 * re-weight without changing vruntime at !0-lag point.
> +	 *
> +	 *             Weight	VRuntime   Avg-VRuntime
> +	 *     before    w          v            V
> +	 *      after    w'         v'           V'
> +	 *
> +	 * Since lag needs to be preserved through re-weight:
> +	 *
> +	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
> +	 *	==>	V' = (V - v)*w/w' + v		(1)
> +	 *
> +	 * Let W be the total weight of the entities before reweight,
> +	 * since V' is the new weighted average of entities:
> +	 *
> +	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
> +	 *
> +	 * by using (1) & (2) we obtain:
> +	 *
> +	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
> +	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
> +	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
> +	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
> +	 *
> +	 * Since we are doing at !0-lag point which means V != v, we
> +	 * can simplify (3):
> +	 *
> +	 *	==>	W / (W + w' - w) = w / w'
> +	 *	==>	Ww' = Ww + ww' - ww
> +	 *	==>	W * (w' - w) = w * (w' - w)
> +	 *	==>	W = w	(re-weight indicates w' != w)
> +	 *
> +	 * So the cfs_rq contains only one entity, hence vruntime of
> +	 * the entity @v should always equal to the cfs_rq's weighted
> +	 * average vruntime @V, which means we will always re-weight
> +	 * at 0-lag point, thus breach assumption. Proof completed.
> +	 *
> +	 *
> +	 * COROLLARY #2: Re-weight does NOT affect weighted average
> +	 * vruntime of all the entities.
> +	 *
> +	 * Proof: According to corollary #1, Eq. (1) should be:
> +	 *
> +	 *	(V - v)*w = (V' - v')*w'
> +	 *	==>    v' = V' - (V - v)*w/w'		(4)
> +	 *
> +	 * According to the weighted average formula, we have:
> +	 *
> +	 *	V' = (WV - wv + w'v') / (W - w + w')
> +	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
> +	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
> +	 *	   = (WV + w'V' - Vw) / (W - w + w')
> +	 *
> +	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
> +	 *	==>	V' * (W - w) = (W - w) * V	(5)
> +	 *
> +	 * If the entity is the only one in the cfs_rq, then reweight
> +	 * always occurs at 0-lag point, so V won't change. Or else
> +	 * there are other entities, hence W != w, then Eq. (5) turns
> +	 * into V' = V. So V won't change in either case, proof done.
> +	 *
> +	 *
> +	 * So according to corollary #1 & #2, the effect of re-weight
> +	 * on vruntime should be:
> +	 *
> +	 *	v' = V' - (V - v) * w / w'		(4)
> +	 *	   = V  - (V - v) * w / w'
> +	 *	   = V  - vl * w / w'
> +	 *	   = V  - vl'
> +	 */
> +	if (avruntime != se->vruntime) {
> +		vlag = (s64)(avruntime - se->vruntime);
> +		vlag = div_s64(vlag * old_weight, weight);
> +		se->vruntime = avruntime - vlag;
> +	}
> +
> +	/*
> +	 * DEADLINE
> +	 * ========
> +	 *
> +	 * When the weight changes, the virtual time slope changes and
> +	 * we should adjust the relative virtual deadline accordingly.
> +	 *
> +	 *	d' = v' + (d - v)*w/w'
> +	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
> +	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
> +	 *	   = V  + (d - V)*w/w'
> +	 */
> +	vslice = (s64)(se->deadline - avruntime);
> +	vslice = div_s64(vslice * old_weight, weight);
> +	se->deadline = avruntime + vslice;
> +}
> +
>   static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>   			    unsigned long weight)
>   {
> -	unsigned long old_weight = se->load.weight;
> +	bool curr = cfs_rq->curr == se;
>   
>   	if (se->on_rq) {
>   		/* commit outstanding execution time */
> -		if (cfs_rq->curr == se)
> +		if (curr)
>   			update_curr(cfs_rq);
>   		else
> -			avg_vruntime_sub(cfs_rq, se);
> +			__dequeue_entity(cfs_rq, se);
>   		update_load_sub(&cfs_rq->load, se->load.weight);
>   	}
>   	dequeue_load_avg(cfs_rq, se);
>   
> -	update_load_set(&se->load, weight);
> -
>   	if (!se->on_rq) {
>   		/*
>   		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
>   		 * we need to scale se->vlag when w_i changes.
>   		 */
> -		se->vlag = div_s64(se->vlag * old_weight, weight);
> +		se->vlag = div_s64(se->vlag * se->load.weight, weight);
>   	} else {
> -		s64 deadline = se->deadline - se->vruntime;
> -		/*
> -		 * When the weight changes, the virtual time slope changes and
> -		 * we should adjust the relative virtual deadline accordingly.
> -		 */
> -		deadline = div_s64(deadline * old_weight, weight);
> -		se->deadline = se->vruntime + deadline;
> -		if (se != cfs_rq->curr)
> -			min_deadline_cb_propagate(&se->run_node, NULL);
> +		reweight_eevdf(cfs_rq, se, weight);
>   	}
>   
> +	update_load_set(&se->load, weight);
> +
>   #ifdef CONFIG_SMP
>   	do {
>   		u32 divider = get_pelt_divider(&se->avg);
> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>   	enqueue_load_avg(cfs_rq, se);
>   	if (se->on_rq) {
>   		update_load_add(&cfs_rq->load, se->load.weight);
> -		if (cfs_rq->curr != se)
> -			avg_vruntime_add(cfs_rq, se);
> +		if (!curr) {
> +			/*
> +			 * The entity's vruntime has been adjusted, so let's check
> +			 * whether the rq-wide min_vruntime needs updated too. Since
> +			 * the calculations above require stable min_vruntime rather
> +			 * than up-to-date one, we do the update at the end of the
> +			 * reweight process.
> +			 */
> +			__enqueue_entity(cfs_rq, se);
> +			update_min_vruntime(cfs_rq);
> +		}
>   	}
>   }
>   
> @@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
>   
>   #ifndef CONFIG_SMP
>   	shares = READ_ONCE(gcfs_rq->tg->shares);
> -
> -	if (likely(se->load.weight == shares))
> -		return;
>   #else
> -	shares   = calc_group_shares(gcfs_rq);
> +	shares = calc_group_shares(gcfs_rq);
>   #endif
> -
> -	reweight_entity(cfs_rq_of(se), se, shares);
> +	if (unlikely(se->load.weight != shares))
> +		reweight_entity(cfs_rq_of(se), se, shares);
>   }
>   
>   #else /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-17  9:23 ` Peter Zijlstra
  2023-11-17  9:58   ` Peter Zijlstra
  2023-11-17 12:24   ` Tobias Huschle
@ 2023-11-18  5:14   ` Abel Wu
  2023-11-20 10:56     ` Peter Zijlstra
  2 siblings, 1 reply; 58+ messages in thread
From: Abel Wu @ 2023-11-18  5:14 UTC (permalink / raw)
  To: Peter Zijlstra, Tobias Huschle
  Cc: Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On 11/17/23 5:23 PM, Peter Zijlstra Wrote:
> 
> Your email is pretty badly mangled by wrapping, please try and
> reconfigure your MUA, esp. the trace and debug output is unreadable.
> 
> On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:
> 
>> The base scenario are two KVM guests running on an s390 LPAR. One guest
>> hosts the uperf server, one the uperf client.
>> With EEVDF we observe a regression of ~50% for a strburst test.
>> For a more detailed description of the setup see the section TEST SUMMARY at
>> the bottom.
> 
> Well, that's not good :/
> 
>> Short summary:
>> The mentioned kworker has been scheduled to CPU 14 before the tracing was
>> enabled.
>> A vhost process is migrated onto CPU 14.
>> The vruntimes of kworker and vhost differ significantly (86642125805 vs
>> 4242563284 -> factor 20)
> 
> So bear with me, I know absolutely nothing about virt stuff. I suspect
> there's cgroups involved because shiny or something.
> 
> kworkers are typically not in cgroups and are part of the root cgroup,
> but what's a vhost and where does it live?
> 
> Also, what are their weights / nice values?
> 
>> The vhost process wants to wake up the kworker, therefore the kworker is
>> placed onto the runqueue again and set to runnable.
>> The vhost process continues to execute, waking up other vhost processes on
>> other CPUs.
>>
>> So far this behavior is not different to what we see on pre-EEVDF kernels.
>>
>> On timestamp 576.162767, the vhost process triggers the last wake up of
>> another vhost on another CPU.
>> Until timestamp 576.171155, we see no other activity. Now, the vhost process
>> ends its time slice.
>> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
>> off to CPU 15.
> 
> So why does this vhost stay on the CPU if it doesn't have anything to
> do? (I've not tried to make sense of the trace, that's just too
> painful).
> 
>> This does not occur with older kernels.
>> The kworker has to wait for the migration to happen in order to be able to
>> execute again.
>> This is due to the fact, that the vruntime of the kworker is significantly
>> larger than the one of vhost.
> 
> That's, weird. Can you add a trace_printk() to update_entity_lag() and
> have it print out the lag, limit and vlag (post clamping) values? And
> also in place_entity() for the reverse process, lag pre and post scaling
> or something.
> 
> After confirming both tasks are indeed in the same cgroup ofcourse,
> because if they're not, vruntime will be meaningless to compare and we
> should look elsewhere.
> 
> Also, what HZ and what preemption mode are you running? If kworker is
> somehow vastly over-shooting it's slice -- keeps running way past the
> avg_vruntime, then it will build up a giant lag and you get what you
> describe, next time it wakes up it gets placed far to the right (exactly
> where it was when it 'finally' went to sleep, relatively speaking).
> 
>> We found some options which sound plausible but we are not sure if they are
>> valid or not:
>>
>> 1. The wake up path has a dependency on the vruntime metrics that now delays
>> the execution of the kworker.
>> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
>> which updates the way cfs_rq->min_vruntime and
>>      cfs_rq->avg_runtime are set might have introduced an issue which is
>> uncovered with the commit mentioned above.
> 
> Suppose you have a few tasks (of equal weight) on you virtual timeline
> like so:
> 
>     ---------+---+---+---+---+------
>              ^       ^
> 	    |       `avg_vruntime
> 	    `-min_vruntime
> 
> Then the above would be more or less the relative placements of these
> values. avg_vruntime is the weighted average of the various vruntimes
> and is therefore always in the 'middle' of the tasks, and not somewhere
> out-there.
> 
> min_vruntime is a monotonically increasing 'minimum' that's left-ish on
> the tree (there's a few cases where a new task can be placed left of
> min_vruntime and its no longer actuall the minimum, but whatever).
> 
> These values should be relatively close to one another, depending
> ofcourse on the spread of the tasks. So I don't think this is causing
> trouble.
> 
> Anyway, the big difference with lag based placement is that where
> previously tasks (that do not migrate) retain their old vruntime and on
> placing they get pulled forward to at least min_vruntime, so a task that
> wildly overshoots, but then doesn't run for significant time can still
> be overtaken and then when placed again be 'okay'.
> 
> Now OTOH, with lag-based placement,  we strictly preserve their relative
> offset vs avg_vruntime. So if they were *far* too the right when they go
> to sleep, they will again be there on placement.

Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:

	vl' = vl

But in fact it is scaled on placement:

	vl' = vl * W/(W + w)

Does this intended? And to illustrate my understanding of strategy #1:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07f555857698..a24ef8b297ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5131,7 +5131,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  	 *
  	 * EEVDF: placement strategy #1 / #2
  	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running && se->vlag) {
  		struct sched_entity *curr = cfs_rq->curr;
  		unsigned long load;
  
@@ -5150,7 +5150,10 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  		 * To avoid the 'w_i' term all over the place, we only track
  		 * the virtual lag:
  		 *
-		 *   vl_i = V - v_i <=> v_i = V - vl_i
+		 *   vl_i = V' - v_i <=> v_i = V' - vl_i
+		 *
+		 * Where V' is the new weighted average after placing this
+		 * entity, and v_i is its newly assigned vruntime.
  		 *
  		 * And we take V to be the weighted average of all v:
  		 *
@@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  		 * vl_i is given by:
  		 *
  		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
-		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
-		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
-		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
-		 *      = V - w_i*vl_i / (W + w_i)
-		 *
-		 * And the actual lag after adding an entity with vl_i is:
-		 *
-		 *   vl'_i = V' - v_i
-		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
-		 *         = vl_i - w_i*vl_i / (W + w_i)
-		 *
-		 * Which is strictly less than vl_i. So in order to preserve lag
-		 * we should inflate the lag before placement such that the
-		 * effective lag after placement comes out right.
-		 *
-		 * As such, invert the above relation for vl'_i to get the vl_i
-		 * we need to use such that the lag after placement is the lag
-		 * we computed before dequeue.
+		 *      = (W*V + w_i*(V' - vl_i)) / (W + w_i)
+		 *      = V - w_i*vl_i / W
  		 *
-		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
-		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
-		 *
-		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
-		 *                   = W*vl_i
-		 *
-		 *   vl_i = (W + w_i)*vl'_i / W
  		 */
  		load = cfs_rq->avg_load;
  		if (curr && curr->on_rq)
  			load += scale_load_down(curr->load.weight);
-
-		lag *= load + scale_load_down(se->load.weight);
  		if (WARN_ON_ONCE(!load))
  			load = 1;
-		lag = div_s64(lag, load);
+
+		vruntime -= div_s64(lag * scale_load_down(se->load.weight), load);
  	}
  
  	se->vruntime = vruntime - lag;

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-16 18:58 EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement) Tobias Huschle
  2023-11-17  9:23 ` Peter Zijlstra
@ 2023-11-18  7:33 ` Abel Wu
  2023-11-18 15:29   ` Honglei Wang
  2023-11-19 13:29 ` Bagas Sanjaya
  2 siblings, 1 reply; 58+ messages in thread
From: Abel Wu @ 2023-11-18  7:33 UTC (permalink / raw)
  To: Tobias Huschle, Linux Kernel, kvm, virtualization, netdev
  Cc: Peterz, mst, jasowang

On 11/17/23 2:58 AM, Tobias Huschle Wrote:
> #################### TRACE EXCERPT ####################
> The sched_place trace event was added to the end of the place_entity function and outputs:
> sev -> sched_entity vruntime
> sed -> sched_entity deadline
> sel -> sched_entity vlag
> avg -> cfs_rq avg_vruntime
> min -> cfs_rq min_vruntime
> cpu -> cpu of cfs_rq
> nr  -> cfs_rq nr_running
> ---
>      CPU 3/KVM-2950    [014] d....   576.161432: sched_migrate_task: comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
> --> migrates task from cpu 15 to 14
>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm=vhost-2920 pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284 cpu=14 nr=0
> --> places vhost 2920 on CPU 14 with vruntime 4242563284
>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0 sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14 nr=0
>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0 sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 cpu=14 nr=0
>      CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0 sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 cpu=14 nr=0
>      CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0 sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 cpu=14 nr=0

As the following 2 lines indicates that vhost-2920 is on_rq so can be
picked as next, thus its cfs_rq must have at least one entity.

While the above 4 lines shows nr=0, so the "comm= pid=0" task(s) can't
be in the same cgroup with vhost-2920.

Say vhost is in cgroupA, and "comm= pid=0" task with sev=86640641980
is in cgroupB ...

>      CPU 3/KVM-2950    [014] dN...   576.161434: sched_stat_wait: comm=vhost-2920 pid=2941 delay=9958 [ns]
>      CPU 3/KVM-2950    [014] d....   576.161435: sched_switch: prev_comm=CPU 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> next_comm=vhost-2920 next_pid=2941 next_prio=120
>     vhost-2920-2941    [014] D....   576.161439: sched_waking: comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>     vhost-2920-2941    [014] d....   576.161446: sched_waking: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>     vhost-2920-2941    [014] d....   576.161447: sched_place: comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0 avg=86642125805 min=86642125805 cpu=14 nr=1
> --> places kworker 6525 on cpu 14 with vruntime 86642125805
> -->  which is far larger than vhost vruntime of  4242563284

Here nr=1 means there is another entity in the same cfs_rq with the
newly woken kworker, but which? According to the vruntime, I would
assume kworker is in cgroupB.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-18  7:33 ` Abel Wu
@ 2023-11-18 15:29   ` Honglei Wang
  0 siblings, 0 replies; 58+ messages in thread
From: Honglei Wang @ 2023-11-18 15:29 UTC (permalink / raw)
  To: Abel Wu, Tobias Huschle, Linux Kernel, kvm, virtualization, netdev
  Cc: Peterz, mst, jasowang



On 2023/11/18 15:33, Abel Wu wrote:
> On 11/17/23 2:58 AM, Tobias Huschle Wrote:
>> #################### TRACE EXCERPT ####################
>> The sched_place trace event was added to the end of the place_entity 
>> function and outputs:
>> sev -> sched_entity vruntime
>> sed -> sched_entity deadline
>> sel -> sched_entity vlag
>> avg -> cfs_rq avg_vruntime
>> min -> cfs_rq min_vruntime
>> cpu -> cpu of cfs_rq
>> nr  -> cfs_rq nr_running
>> ---
>>      CPU 3/KVM-2950    [014] d....   576.161432: sched_migrate_task: 
>> comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
>> --> migrates task from cpu 15 to 14
>>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place: 
>> comm=vhost-2920 pid=2941 sev=4242563284 sed=4245563284 sel=0 
>> avg=4242563284 min=4242563284 cpu=14 nr=0
>> --> places vhost 2920 on CPU 14 with vruntime 4242563284
>>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= 
>> pid=0 sev=16329848593 sed=16334604010 sel=0 avg=16329848593 
>> min=16329848593 cpu=14 nr=0
>>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= 
>> pid=0 sev=42560661157 sed=42627443765 sel=0 avg=42560661157 
>> min=42560661157 cpu=14 nr=0
>>      CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= 
>> pid=0 sev=53846627372 sed=54125900099 sel=0 avg=53846627372 
>> min=53846627372 cpu=14 nr=0
>>      CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= 
>> pid=0 sev=86640641980 sed=87255041979 sel=0 avg=86640641980 
>> min=86640641980 cpu=14 nr=0
> 
> As the following 2 lines indicates that vhost-2920 is on_rq so can be
> picked as next, thus its cfs_rq must have at least one entity.
> 
> While the above 4 lines shows nr=0, so the "comm= pid=0" task(s) can't
> be in the same cgroup with vhost-2920.
> 
> Say vhost is in cgroupA, and "comm= pid=0" task with sev=86640641980
> is in cgroupB ...
> 
This looks like an hierarchy enqueue staff. The temporary trace can get 
comm and pid of vhost-2920, but failed for the other 4. I think the 
reason is they were just se but not tasks. Seems this came from the 
for_each_sched_entity(se) when doing enqueue vhost-2920. And the last 
one with cfs_rq vruntime=86640641980 might be the root cgroup which was 
on same level with kworkers.

So just from this tiny part of the trace log, there won't be thousands 
ms level difference. Actually, it might be only 86642125805-86640641980 
= 1.5 ms.

correct me if there is anything wrong..

Thanks,
Honglei
>>      CPU 3/KVM-2950    [014] dN...   576.161434: sched_stat_wait: 
>> comm=vhost-2920 pid=2941 delay=9958 [ns]
>>      CPU 3/KVM-2950    [014] d....   576.161435: sched_switch: 
>> prev_comm=CPU 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> 
>> next_comm=vhost-2920 next_pid=2941 next_prio=120
>>     vhost-2920-2941    [014] D....   576.161439: sched_waking: 
>> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>>     vhost-2920-2941    [014] d....   576.161446: sched_waking: 
>> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>>     vhost-2920-2941    [014] d....   576.161447: sched_place: 
>> comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0 
>> avg=86642125805 min=86642125805 cpu=14 nr=1
>> --> places kworker 6525 on cpu 14 with vruntime 86642125805
>> -->  which is far larger than vhost vruntime of  4242563284
> 
> Here nr=1 means there is another entity in the same cfs_rq with the
> newly woken kworker, but which? According to the vruntime, I would
> assume kworker is in cgroupB.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-16 18:58 EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement) Tobias Huschle
  2023-11-17  9:23 ` Peter Zijlstra
  2023-11-18  7:33 ` Abel Wu
@ 2023-11-19 13:29 ` Bagas Sanjaya
  2 siblings, 0 replies; 58+ messages in thread
From: Bagas Sanjaya @ 2023-11-19 13:29 UTC (permalink / raw)
  To: Tobias Huschle, Linux Kernel Mailing List, Linux KVM,
	Linux Virtualization, Linux Networking
  Cc: Peter Zijlstra, Ingo Molnar, Abel Wu, Honglei Wang, mst, jasowang

[-- Attachment #1: Type: text/plain, Size: 18726 bytes --]

On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:
> Hi,
> 
> when testing the EEVDF scheduler we stumbled upon a performance regression
> in a uperf scenario and would like to
> kindly ask for feedback on whether we are going into the right direction
> with our analysis so far.
> 
> The base scenario are two KVM guests running on an s390 LPAR. One guest
> hosts the uperf server, one the uperf client.
> With EEVDF we observe a regression of ~50% for a strburst test.
> For a more detailed description of the setup see the section TEST SUMMARY at
> the bottom.
> 
> Bisecting led us to the following commit which appears to introduce the
> regression:
> 86bfbb7ce4f6 sched/fair: Add lag based placement
> 
> We then compared the last good commit we identified with a recent level of
> the devel branch.
> The issue still persists on 6.7 rc1 although there is some improvement (down
> from 62% regression to 49%)
> 
> All analysis described further are based on a 6.6 rc7 kernel.
> 
> We sampled perf data to get an idea on what is going wrong and ended up
> seeing an dramatic increase in the maximum
> wait times from 3ms up to 366ms. See section WAIT DELAYS below for more
> details.
> 
> We then collected tracing data to get a better insight into what is going
> on.
> The trace excerpt in section TRACE EXCERPT shows one example (of multiple
> per test run) of the problematic scenario where
> a kworker(pid=6525) has to wait for 39,718 ms.
> 
> Short summary:
> The mentioned kworker has been scheduled to CPU 14 before the tracing was
> enabled.
> A vhost process is migrated onto CPU 14.
> The vruntimes of kworker and vhost differ significantly (86642125805 vs
> 4242563284 -> factor 20)
> The vhost process wants to wake up the kworker, therefore the kworker is
> placed onto the runqueue again and set to runnable.
> The vhost process continues to execute, waking up other vhost processes on
> other CPUs.
> 
> So far this behavior is not different to what we see on pre-EEVDF kernels.
> 
> On timestamp 576.162767, the vhost process triggers the last wake up of
> another vhost on another CPU.
> Until timestamp 576.171155, we see no other activity. Now, the vhost process
> ends its time slice.
> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
> off to CPU 15.
> This does not occur with older kernels.
> The kworker has to wait for the migration to happen in order to be able to
> execute again.
> This is due to the fact, that the vruntime of the kworker is significantly
> larger than the one of vhost.
> 
> 
> We observed the large difference in vruntime between kworker and vhost in
> the same magnitude on
> a kernel built based on the parent of the commit mentioned above.
> With EEVDF, the kworker is doomed to wait until the vhost either catches up
> on vruntime (which would take 86 seconds)
> or the vhost is migrated off of the CPU.
> 
> We found some options which sound plausible but we are not sure if they are
> valid or not:
> 
> 1. The wake up path has a dependency on the vruntime metrics that now delays
> the execution of the kworker.
> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
> which updates the way cfs_rq->min_vruntime and
>     cfs_rq->avg_runtime are set might have introduced an issue which is
> uncovered with the commit mentioned above.
> 3. An assumption in the vhost code which causes vhost to rely on being
> scheduled off in time to allow the kworker to proceed.
> 
> We also stumbled upon the following mailing thread:
> https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> That conversation, and the patches derived from it lead to the assumption
> that the wake up path might be adjustable in a way
> that this case in particular can be addressed.
> At the same time, the vast difference in vruntimes is concerning since, at
> least for some time frame, both processes are on the runqueue.
> 
> We would be glad to hear some feedback on which paths to pursue and which
> might just be a dead end in the first place.
> 
> 
> #################### TRACE EXCERPT ####################
> The sched_place trace event was added to the end of the place_entity
> function and outputs:
> sev -> sched_entity vruntime
> sed -> sched_entity deadline
> sel -> sched_entity vlag
> avg -> cfs_rq avg_vruntime
> min -> cfs_rq min_vruntime
> cpu -> cpu of cfs_rq
> nr  -> cfs_rq nr_running
> ---
>     CPU 3/KVM-2950    [014] d....   576.161432: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
> --> migrates task from cpu 15 to 14
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm=vhost-2920
> pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284
> cpu=14 nr=0
> --> places vhost 2920 on CPU 14 with vruntime 4242563284
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0
> sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14
> nr=0
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0
> sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 cpu=14
> nr=0
>     CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0
> sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 cpu=14
> nr=0
>     CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0
> sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 cpu=14
> nr=0
>     CPU 3/KVM-2950    [014] dN...   576.161434: sched_stat_wait:
> comm=vhost-2920 pid=2941 delay=9958 [ns]
>     CPU 3/KVM-2950    [014] d....   576.161435: sched_switch: prev_comm=CPU
> 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> next_comm=vhost-2920
> next_pid=2941 next_prio=120
>    vhost-2920-2941    [014] D....   576.161439: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>    vhost-2920-2941    [014] d....   576.161446: sched_waking:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] d....   576.161447: sched_place:
> comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0
> avg=86642125805 min=86642125805 cpu=14 nr=1
> --> places kworker 6525 on cpu 14 with vruntime 86642125805
> -->  which is far larger than vhost vruntime of  4242563284
>    vhost-2920-2941    [014] d....   576.161447: sched_stat_blocked:
> comm=kworker/14:0 pid=6525 delay=10143757 [ns]
>    vhost-2920-2941    [014] dN...   576.161447: sched_wakeup:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] dN...   576.161448: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=13884 [ns] vruntime=4242577168 [ns]
> --> vhost 2920 finishes after 13884 ns of runtime
>    vhost-2920-2941    [014] dN...   576.161448: sched_stat_wait:
> comm=kworker/14:0 pid=6525 delay=0 [ns]
>    vhost-2920-2941    [014] d....   576.161448: sched_switch:
> prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
> next_comm=kworker/14:0 next_pid=6525 next_prio=120
> --> switch to kworker
>  kworker/14:0-6525    [014] d....   576.161449: sched_waking: comm=CPU 2/KVM
> pid=2949 prio=120 target_cpu=007
>  kworker/14:0-6525    [014] d....   576.161450: sched_stat_runtime:
> comm=kworker/14:0 pid=6525 runtime=3714 [ns] vruntime=86642129519 [ns]
> --> kworker finshes after 3714 ns of runtime
>  kworker/14:0-6525    [014] d....   576.161450: sched_stat_wait:
> comm=vhost-2920 pid=2941 delay=3714 [ns]
>  kworker/14:0-6525    [014] d....   576.161451: sched_switch:
> prev_comm=kworker/14:0 prev_pid=6525 prev_prio=120 prev_state=I ==>
> next_comm=vhost-2920 next_pid=2941 next_prio=120
> --> switch back to vhost
>    vhost-2920-2941    [014] d....   576.161478: sched_waking:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] d....   576.161478: sched_place:
> comm=kworker/14:0 pid=6525 sev=86642191859 sed=86645191859 sel=-1150
> avg=86642188144 min=86642188144 cpu=14 nr=1
> --> kworker placed again on cpu 14 with vruntime 86642191859, the problem
> occurs only if lag <= 0, having lag=0 does not always hit the problem though
>    vhost-2920-2941    [014] d....   576.161478: sched_stat_blocked:
> comm=kworker/14:0 pid=6525 delay=27943 [ns]
>    vhost-2920-2941    [014] d....   576.161479: sched_wakeup:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] D....   576.161511: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.161512: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>    vhost-2920-2941    [014] D....   576.161516: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.161773: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.161775: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>    vhost-2920-2941    [014] D....   576.162103: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.162105: sched_waking:
> comm=vhost-2286 pid=2307 prio=120 target_cpu=021
>    vhost-2920-2941    [014] D....   576.162326: sched_waking:
> comm=vhost-2286 pid=2305 prio=120 target_cpu=004
>    vhost-2920-2941    [014] D....   576.162437: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.162767: sched_waking:
> comm=vhost-2286 pid=2305 prio=120 target_cpu=004
>    vhost-2920-2941    [014] d.h..   576.171155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=9704465 [ns] vruntime=4252281633 [ns]
>    vhost-2920-2941    [014] d.h..   576.181155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=10000377 [ns] vruntime=4262282010 [ns]
>    vhost-2920-2941    [014] d.h..   576.191154: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=9999514 [ns] vruntime=4272281524 [ns]
>    vhost-2920-2941    [014] d.h..   576.201155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=10000246 [ns] vruntime=4282281770 [ns]
> --> vhost gets rescheduled multiple times because its vruntime is
> significantly smaller than the vruntime of the kworker
>    vhost-2920-2941    [014] dNh..   576.201176: sched_wakeup:
> comm=migration/14 pid=85 prio=0 target_cpu=014
>    vhost-2920-2941    [014] dN...   576.201191: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=25190 [ns] vruntime=4282306960 [ns]
>    vhost-2920-2941    [014] d....   576.201192: sched_switch:
> prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
> next_comm=migration/14 next_pid=85 next_prio=0
>  migration/14-85      [014] d..1.   576.201194: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=14 dest_cpu=15
> --> vhost gets migrated off of cpu 14
>  migration/14-85      [014] d..1.   576.201194: sched_place: comm=vhost-2920
> pid=2941 sev=3198666923 sed=3201666923 sel=0 avg=3198666923 min=3198666923
> cpu=15 nr=0
>  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0
> sev=12775683594 sed=12779398224 sel=0 avg=12775683594 min=12775683594 cpu=15
> nr=0
>  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0
> sev=33655559178 sed=33661025369 sel=0 avg=33655559178 min=33655559178 cpu=15
> nr=0
>  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0
> sev=42240572785 sed=42244083642 sel=0 avg=42240572785 min=42240572785 cpu=15
> nr=0
>  migration/14-85      [014] d..1.   576.201196: sched_place: comm= pid=0
> sev=70190876523 sed=70194789898 sel=-13068763 avg=70190876523
> min=70190876523 cpu=15 nr=0
>  migration/14-85      [014] d....   576.201198: sched_stat_wait:
> comm=kworker/14:0 pid=6525 delay=39718472 [ns]
>  migration/14-85      [014] d....   576.201198: sched_switch:
> prev_comm=migration/14 prev_pid=85 prev_prio=0 prev_state=S ==>
> next_comm=kworker/14:0 next_pid=6525 next_prio=120
>  --> only now, kworker is eligible to run again, after a delay of 39718472
> ns
>  kworker/14:0-6525    [014] d....   576.201200: sched_waking: comm=CPU 0/KVM
> pid=2947 prio=120 target_cpu=012
>  kworker/14:0-6525    [014] d....   576.201290: sched_stat_runtime:
> comm=kworker/14:0 pid=6525 runtime=92941 [ns] vruntime=86642284800 [ns]
> 
> #################### WAIT DELAYS - PERF LATENCY ####################
> last good commit --> perf sched latency -s max
> -------------------------------------------------------------------------------------------------------------------------------------------
>   Task                  |   Runtime ms  | Switches | Avg delay ms    | Max
> delay ms    | Max delay start           | Max delay end          |
> -------------------------------------------------------------------------------------------------------------------------------------------
>   CPU 2/KVM:(2)         |   5399.650 ms |   108698 | avg:   0.003 ms | max:
> 3.077 ms | max start:   544.090322 s | max end:   544.093399 s
>   CPU 7/KVM:(2)         |   5111.132 ms |    69632 | avg:   0.003 ms | max:
> 2.980 ms | max start:   544.690994 s | max end:   544.693974 s
>   kworker/22:3-ev:723   |    342.944 ms |    63417 | avg:   0.005 ms | max:
> 1.880 ms | max start:   545.235430 s | max end:   545.237310 s
>   CPU 0/KVM:(2)         |   8171.431 ms |   433099 | avg:   0.003 ms | max:
> 1.004 ms | max start:   547.970344 s | max end:   547.971348 s
>   CPU 1/KVM:(2)         |   5486.260 ms |   258702 | avg:   0.003 ms | max:
> 1.002 ms | max start:   548.782514 s | max end:   548.783516 s
>   CPU 5/KVM:(2)         |   4766.143 ms |    65727 | avg:   0.003 ms | max:
> 0.997 ms | max start:   545.313610 s | max end:   545.314607 s
>   vhost-2268:(6)        |  13206.503 ms |   315030 | avg:   0.003 ms | max:
> 0.989 ms | max start:   550.887761 s | max end:   550.888749 s
>   vhost-2892:(6)        |  14467.268 ms |   214005 | avg:   0.003 ms | max:
> 0.981 ms | max start:   545.213819 s | max end:   545.214800 s
>   CPU 3/KVM:(2)         |   5538.908 ms |    85105 | avg:   0.003 ms | max:
> 0.883 ms | max start:   547.138139 s | max end:   547.139023 s
>   CPU 6/KVM:(2)         |   5289.827 ms |    72301 | avg:   0.003 ms | max:
> 0.836 ms | max start:   551.094590 s | max end:   551.095425 s
> 
> 6.6 rc7 --> perf sched latency -s max
> -------------------------------------------------------------------------------------------------------------------------------------------
>   Task                  |   Runtime ms  | Switches | Avg delay ms    | Max
> delay ms    | Max delay start           | Max delay end          |
> -------------------------------------------------------------------------------------------------------------------------------------------
>   kworker/19:2-ev:1071  |     69.482 ms |    12700 | avg:   0.050 ms | max:
> 366.314 ms | max start: 54705.674294 s | max end: 54706.040607 s
>   kworker/13:1-ev:184   |     78.048 ms |    14645 | avg:   0.067 ms | max:
> 287.738 ms | max start: 54710.312863 s | max end: 54710.600602 s
>   kworker/12:1-ev:46148 |    138.488 ms |    26660 | avg:   0.021 ms | max:
> 147.414 ms | max start: 54706.133161 s | max end: 54706.280576 s
>   kworker/16:2-ev:33076 |    149.175 ms |    29491 | avg:   0.026 ms | max:
> 139.752 ms | max start: 54708.410845 s | max end: 54708.550597 s
>   CPU 3/KVM:(2)         |   1934.714 ms |    41896 | avg:   0.007 ms | max:
> 92.126 ms | max start: 54713.158498 s | max end: 54713.250624 s
>   kworker/7:2-eve:17001 |     68.164 ms |    11820 | avg:   0.045 ms | max:
> 69.717 ms | max start: 54707.100903 s | max end: 54707.170619 s
>   kworker/17:1-ev:46510 |     68.804 ms |    13328 | avg:   0.037 ms | max:
> 67.894 ms | max start: 54711.022711 s | max end: 54711.090605 s
>   kworker/21:1-ev:45782 |     68.906 ms |    13215 | avg:   0.021 ms | max:
> 59.473 ms | max start: 54709.351135 s | max end: 54709.410608 s
>   ksoftirqd/17:101      |      0.041 ms |        2 | avg:  25.028 ms | max:
> 50.047 ms | max start: 54711.040578 s | max end: 54711.090625 s
> 
> #################### TEST SUMMARY ####################
>  Setup description:
> - single KVM host with 2 identical guests
> - guests are connected virtually via Open vSwitch
> - guests run uperf streaming read workload with 50 parallel connections
> - one guests acts as uperf client, the other one as uperf server
> 
> Regression:
> kernel-6.5.0-rc2: 78 Gb/s (before 86bfbb7ce4f6 sched/fair: Add lag based
> placement)
> kernel-6.5.0-rc2: 29 Gb/s (with 86bfbb7ce4f6 sched/fair: Add lag based
> placement)
> kernel-6.7.0-rc1: 41 Gb/s
> 
> KVM host:
> - 12 dedicated IFLs, SMT-2 (24 Linux CPUs)
> - 64 GiB memory
> - FEDORA 38
> - kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
> audit_debug=0 selinux=0
> 
> KVM guests:
> - 8 vCPUs
> - 8 GiB memory
> - RHEL 9.2
> - kernel: 5.14.0-162.6.1.el9_1.s390x
> - kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
> audit_debug=0 selinux=0
> 
> Open vSwitch:
> - Open vSwitch with 2 ports, each with mtu=32768 and qlen=15000
> - Open vSwitch ports attached to guests via virtio-net
> - each guest has 4 vhost-queues
> 
> Domain xml snippet for Open vSwitch port:
> <interface type="bridge" dev="OVS">
>   <source bridge="vswitch0"/>
>   <mac address="02:bb:97:28:02:02"/>
>   <virtualport type="openvswitch"/>
>   <model type="virtio"/>
>   <target dev="vport1"/>
>   <driver name="vhost" queues="4"/>
>   <address type="ccw" cssid="0xfe" ssid="0x0" devno="0x0002"/>
> </interface>
> 
> Benchmark: uperf
> - workload: str-readx30k, 50 active parallel connections
> - uperf server permanently sends data in 30720-byte chunks
> - uperf client receives and acknowledges this data
> - Server: uperf -s
> - Client: uperf -a -i 30 -m uperf.xml
> 
> uperf.xml:
> <?xml version="1.0"?>
> <profile name="strburst">
>   <group nprocs="50">
>     <transaction iterations="1">
>       <flowop type="connect" options="remotehost=10.161.28.3 protocol=tcp
> "/>
>     </transaction>
>     <transaction duration="300">
>       <flowop type="read" options="count=640 size=30k"/>
>     </transaction>
>     <transaction iterations="1">
>       <flowop type="disconnect" />
>     </transaction>
>   </group>
> </profile>

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 86bfbb7ce4f67a

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-18  5:14   ` Abel Wu
@ 2023-11-20 10:56     ` Peter Zijlstra
  2023-11-20 12:06       ` Abel Wu
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2023-11-20 10:56 UTC (permalink / raw)
  To: Abel Wu
  Cc: Tobias Huschle, Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:

> Hi Peter, I'm a little confused here. As we adopt placement strategy #1
> when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
> Given that the weight doesn't change, we have:
> 
> 	vl' = vl
> 
> But in fact it is scaled on placement:
> 
> 	vl' = vl * W/(W + w)

(W+w)/W

> 
> Does this intended? 

The scaling, yes that's intended and the comment explains why. So now
you have me confused too :-)

Specifically, I want the lag after placement to be equal to the lag we
come in with. Since placement will affect avg_vruntime (adding one
element to the average changes the average etc..) the placement also
affects the lag as measured after placement.

Or rather, if you enqueue and dequeue, I want the lag to be preserved.
If you do not take placement into consideration, lag will dissipate real
quick.

> And to illustrate my understanding of strategy #1:

> @@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		 * vl_i is given by:
>  		 *
>  		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> -		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
> -		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
> -		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
> -		 *      = V - w_i*vl_i / (W + w_i)
> -		 *
> -		 * And the actual lag after adding an entity with vl_i is:
> -		 *
> -		 *   vl'_i = V' - v_i
> -		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
> -		 *         = vl_i - w_i*vl_i / (W + w_i)
> -		 *
> -		 * Which is strictly less than vl_i. So in order to preserve lag
> -		 * we should inflate the lag before placement such that the
> -		 * effective lag after placement comes out right.
> -		 *
> -		 * As such, invert the above relation for vl'_i to get the vl_i
> -		 * we need to use such that the lag after placement is the lag
> -		 * we computed before dequeue.
> +		 *      = (W*V + w_i*(V' - vl_i)) / (W + w_i)
> +		 *      = V - w_i*vl_i / W
>  		 *
> -		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
> -		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
> -		 *
> -		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
> -		 *                   = W*vl_i
> -		 *
> -		 *   vl_i = (W + w_i)*vl'_i / W
>  		 */
>  		load = cfs_rq->avg_load;
>  		if (curr && curr->on_rq)
>  			load += scale_load_down(curr->load.weight);
> -
> -		lag *= load + scale_load_down(se->load.weight);
>  		if (WARN_ON_ONCE(!load))
>  			load = 1;
> -		lag = div_s64(lag, load);
> +
> +		vruntime -= div_s64(lag * scale_load_down(se->load.weight), load);
>  	}
>  	se->vruntime = vruntime - lag;


So you're proposing we do:

	v = V - (lag * w) / (W + w) - lag

?

That can be written like:

	v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
	  = V - (lag * (W+w) + lag * w) / (W+w)
	  = V - (lag * (W+2w)) / (W+w)

And that turns into a mess AFAICT.


Let me repeat my earlier argument. Suppose v,w,l are the new element.
V,W are the old avg_vruntime and sum-weight.

Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).

The new lag, after placement: 

l' = V' - v = (V*W + v*w) / (W+w) - v
            = (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
	    = (V*W + v*w -v*W - v*w) / (W+w)
	    = (V*W - v*W) / (W+w)
	    = W*(V-v) / (W+w)
	    = W/(W+w) * (V-v)

Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:

l' = W/(W+w) * (V - (V - (W+w)/W * l))
   = W/(W+w) * (V - V + (W+w)/W * l)
   = W/(W+w) * (W+w)/W * l
   = l

So by scaling, we've preserved lag across placement.

That make sense?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-20 10:56     ` Peter Zijlstra
@ 2023-11-20 12:06       ` Abel Wu
  0 siblings, 0 replies; 58+ messages in thread
From: Abel Wu @ 2023-11-20 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tobias Huschle, Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On 11/20/23 6:56 PM, Peter Zijlstra Wrote:
> On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:
> 
>> Hi Peter, I'm a little confused here. As we adopt placement strategy #1
>> when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
>> Given that the weight doesn't change, we have:
>>
>> 	vl' = vl
>>
>> But in fact it is scaled on placement:
>>
>> 	vl' = vl * W/(W + w)
> 
> (W+w)/W

Ah, right. I misunderstood (again) the comment which says:

	vl_i = (W + w_i)*vl'_i / W

So the current implementation is:

	v' = V - vl'

and what I was proposing is:

	v' = V' - vl

and they are equal in fact.

> 
>>
>> Does this intended?
> 
> The scaling, yes that's intended and the comment explains why. So now
> you have me confused too :-)
> 
> Specifically, I want the lag after placement to be equal to the lag we
> come in with. Since placement will affect avg_vruntime (adding one
> element to the average changes the average etc..) the placement also
> affects the lag as measured after placement.

Yes. You did the math in an iterative fashion and mine is facing the
final state:

	v' = V' - vlag
	V' = (WV + wv') / (W + w)

which gives:

	V' = V - w * vlag / W

> 
> Or rather, if you enqueue and dequeue, I want the lag to be preserved.
> If you do not take placement into consideration, lag will dissipate real
> quick.
> 
>> And to illustrate my understanding of strategy #1:
> 
>> @@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>   		 * vl_i is given by:
>>   		 *
>>   		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
>> -		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
>> -		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
>> -		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
>> -		 *      = V - w_i*vl_i / (W + w_i)
>> -		 *
>> -		 * And the actual lag after adding an entity with vl_i is:
>> -		 *
>> -		 *   vl'_i = V' - v_i
>> -		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
>> -		 *         = vl_i - w_i*vl_i / (W + w_i)
>> -		 *
>> -		 * Which is strictly less than vl_i. So in order to preserve lag
>> -		 * we should inflate the lag before placement such that the
>> -		 * effective lag after placement comes out right.
>> -		 *
>> -		 * As such, invert the above relation for vl'_i to get the vl_i
>> -		 * we need to use such that the lag after placement is the lag
>> -		 * we computed before dequeue.
>> +		 *      = (W*V + w_i*(V' - vl_i)) / (W + w_i)
>> +		 *      = V - w_i*vl_i / W
>>   		 *
>> -		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
>> -		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
>> -		 *
>> -		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
>> -		 *                   = W*vl_i
>> -		 *
>> -		 *   vl_i = (W + w_i)*vl'_i / W
>>   		 */
>>   		load = cfs_rq->avg_load;
>>   		if (curr && curr->on_rq)
>>   			load += scale_load_down(curr->load.weight);
>> -
>> -		lag *= load + scale_load_down(se->load.weight);
>>   		if (WARN_ON_ONCE(!load))
>>   			load = 1;
>> -		lag = div_s64(lag, load);
>> +
>> +		vruntime -= div_s64(lag * scale_load_down(se->load.weight), load);
>>   	}
>>   	se->vruntime = vruntime - lag;
> 
> 
> So you're proposing we do:
> 
> 	v = V - (lag * w) / (W + w) - lag

What I 'm proposing is:

	V' = V - w * vlag / W

so we have:

	v' = V' - vlag
	   = V - vlag * w/W - vlag
	   = V - vlag * (W + w)/W

which is exactly the same as current implementation.

> 
> ?
> 
> That can be written like:
> 
> 	v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
> 	  = V - (lag * (W+w) + lag * w) / (W+w)
> 	  = V - (lag * (W+2w)) / (W+w)
> 
> And that turns into a mess AFAICT.
> 
> 
> Let me repeat my earlier argument. Suppose v,w,l are the new element.
> V,W are the old avg_vruntime and sum-weight.
> 
> Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).
> 
> The new lag, after placement:
> 
> l' = V' - v = (V*W + v*w) / (W+w) - v
>              = (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
> 	    = (V*W + v*w -v*W - v*w) / (W+w)
> 	    = (V*W - v*W) / (W+w)
> 	    = W*(V-v) / (W+w)
> 	    = W/(W+w) * (V-v)
> 
> Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:
> 
> l' = W/(W+w) * (V - (V - (W+w)/W * l))
>     = W/(W+w) * (V - V + (W+w)/W * l)
>     = W/(W+w) * (W+w)/W * l
>     = l
> 
> So by scaling, we've preserved lag across placement.
> 
> That make sense?

Yes, I think I won't misunderstand again for the 3rd time :)

Thanks!
	Abel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-17 13:07       ` Abel Wu
@ 2023-11-21 13:17         ` Tobias Huschle
  2023-11-22 10:00           ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Tobias Huschle @ 2023-11-21 13:17 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On Fri, Nov 17, 2023 at 09:07:55PM +0800, Abel Wu wrote:
> On 11/17/23 8:37 PM, Peter Zijlstra Wrote:

[...]

> > Ah, so if this is a cgroup issue, it might be worth trying this patch
> > that we have in tip/sched/urgent.
> 
> And please also apply this fix:
> https://lore.kernel.org/all/20231117080106.12890-1-s921975628@gmail.com/
> 

We applied both suggested patch options and ran the test again, so 

sched/eevdf: Fix vruntime adjustment on reweight
sched/fair: Update min_vruntime for reweight_entity() correctly

and

sched/eevdf: Delay dequeue

Unfortunately, both variants do NOT fix the problem.
The regression remains unchanged.


I will continue getting myself familiar with how cgroups are scheduled to dig 
deeper here. If there are any other ideas, I'd be happy to use them as a 
starting point for further analysis.

Would additional traces still be of interest? If so, I would be glad to
provide them.

[...]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-21 13:17         ` Tobias Huschle
@ 2023-11-22 10:00           ` Peter Zijlstra
  2023-11-27 13:56             ` Tobias Huschle
       [not found]             ` <6564a012.c80a0220.adb78.f0e4SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2023-11-22 10:00 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Abel Wu, Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

> We applied both suggested patch options and ran the test again, so 
> 
> sched/eevdf: Fix vruntime adjustment on reweight
> sched/fair: Update min_vruntime for reweight_entity() correctly
> 
> and
> 
> sched/eevdf: Delay dequeue
> 
> Unfortunately, both variants do NOT fix the problem.
> The regression remains unchanged.

Thanks for testing.

> I will continue getting myself familiar with how cgroups are scheduled to dig 
> deeper here. If there are any other ideas, I'd be happy to use them as a 
> starting point for further analysis.
> 
> Would additional traces still be of interest? If so, I would be glad to
> provide them.

So, since it got bisected to the placement logic, but is a cgroup
related issue, I was thinking that 'Delay dequeue' might not cut it,
that only works for tasks, not the internal entities.

The below should also work for internal entities, but last time I poked
around with it I had some regressions elsewhere -- you know, fix one,
wreck another type of situations on hand.

But still, could you please give it a go -- it applies cleanly to linus'
master and -rc2.

---
Subject: sched/eevdf: Revenge of the Sith^WSleepers

For tasks that have received excess service (negative lag) allow them to
gain parity (zero lag) by sleeping.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     | 36 ++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  6 ++++++
 2 files changed, 42 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..b975e4b07a68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5110,6 +5110,33 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
 
 #endif /* CONFIG_SMP */
 
+static inline u64
+entity_vlag_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+	u64 now, vdelta;
+	s64 delta;
+
+	if (!(flags & ENQUEUE_WAKEUP))
+		return se->vlag;
+
+	if (flags & ENQUEUE_MIGRATED)
+		return 0;
+
+	now = rq_clock_task(rq_of(cfs_rq));
+	delta = now - se->exec_start;
+	if (delta < 0)
+		return se->vlag;
+
+	if (sched_feat(GENTLE_SLEEPER))
+		delta /= 2;
+
+	vdelta = __calc_delta(delta, NICE_0_LOAD, &cfs_rq->load);
+	if (vdelta < -se->vlag)
+		return se->vlag + vdelta;
+
+	return 0;
+}
+
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -5133,6 +5160,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 		lag = se->vlag;
 
+		/*
+		 * Allow tasks that have received too much service (negative
+		 * lag) to (re)gain parity (zero lag) by sleeping for the
+		 * equivalent duration. This ensures they will be readily
+		 * eligible.
+		 */
+		if (sched_feat(PLACE_SLEEPER) && lag < 0)
+			lag = entity_vlag_sleeper(cfs_rq, se, flags);
+
 		/*
 		 * If we want to place a task and preserve lag, we have to
 		 * consider the effect of the new entity on the weighted
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ddf84de430..722282d3ed07 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,12 @@
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(RUN_TO_PARITY, true)
+/*
+ * Let sleepers earn back lag, but not more than 0-lag. GENTLE_SLEEPERS earn at
+ * half the speed.
+ */
+SCHED_FEAT(PLACE_SLEEPER, true)
+SCHED_FEAT(GENTLE_SLEEPER, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-22 10:00           ` Peter Zijlstra
@ 2023-11-27 13:56             ` Tobias Huschle
       [not found]             ` <6564a012.c80a0220.adb78.f0e4SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2023-11-27 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Abel Wu, Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
> 
> The below should also work for internal entities, but last time I poked
> around with it I had some regressions elsewhere -- you know, fix one,
> wreck another type of situations on hand.
> 
> But still, could you please give it a go -- it applies cleanly to linus'
> master and -rc2.
> 
> ---
> Subject: sched/eevdf: Revenge of the Sith^WSleepers
> 

Tried the patch, it does not help unfortuntately.

It might also be possible that the long running vhost is stuck on something.
During those phases where the vhost just runs for a while. This might have
been a risk for a while, EEVDF might have just uncovered an unfortuntate
sequence of events.
I'll look into this option.

I also added some more trace outputs in order to find the actual vruntimes
of the cgroup parents. The numbers look kind of reasonable, but I struggle
to judge this with certainty.

In one of the scenarios where the kworker sees an absurd wait time, the 
following occurs (full trace below):

- The kworker ends its timeslice after 4941 ns
- __pick_eevdf finds the cgroup holding vhost as the next option to execute
- Last known values are:       
                    vruntime      deadline
   cgroup        56117619190   57650477291 -> depth 0
   kworker       56117624131   56120619190
  This is fair, since the kworker is not runnable here.
- At depth 4, the cgroup shows the observed vruntime value which is smaller 
  by a factor of 20, but depth 0 seems to be running with values of the 
  correct magnitude.
- cgroup at depth 0 has zero lag, with higher depth, there are large lag 
  values (as observed 606.338267 onwards)

Now the following occurs, triggered by the vhost:
- The kworker gets placed again with:       
                    vruntime      deadline
   cgroup        56117619190   57650477291 -> depth 0, last known value
   kworker       56117885776   56120885776 -> lag of -725
- vhost continues executing and updates its vruntime accordingly, here 
  I would need to enhance the trace to also print the vruntimes of the 
  parent sched_entities to see the progress of their vruntime/deadline/lag 
  values as well
- It is a bit irritating that the EEVDF algorithm would not pick the kworker 
  over the cgroup as its deadline is smaller.
  But, the kworker has negative lag, which might cause EEVDF to not pick 
  the kworker.
  The cgroup at depth 0 has no lag, all deeper layers have a significantly 
  positve lag (last known values, might have changed in the meantime).
  At this point I would see the option that the vhost task is stuck
  somewhere or EEVDF just does not see the kworker as a eligible option.

- Once the vhost is migrated off the cpu, the update_entity_lag function
  works with the following values at 606.467022: sched_update
  For the cgroup at depth 0
  - vruntime = 57104166665 --> this is in line with the amount of new timeslices
                               vhost got assigned while the kworker was waiting
  - vlag     =   -62439022 --> the scheduler knows that the cgroup was 
                               overconsuming, but no runtime for the kworker
  For the cfs_rq we have
  - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
  - avg_vruntime = 161750065796 --> this is rather large in comparison, but I 
                                    might access this value at a bad time
  - nr_running   =            2 --> at this point, both, cgroup and kworker are 
                                    still on the queue, with the cgroup being 
                                    in the migration process
--> It seems like the overconsumption accumulates at cgroup depth 0 and is not 
    propageted downwards. This might be intended though.

- At 606.479979: sched_place, cgroup hosting the vhost is migrated back
  onto cpu 13 with a lag of -166821875 it gets scheduled right away as 
  there is no other task (nr_running = 0)

- At 606.479996: sched_place, the kworker gets placed again, this time
  with no lag and get scheduled almost immediately, with a wait 
  time of 1255 ns.

It shall be noted, that these scenarios also occur when the first placement
of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
when observing this issue.

######################### full trace #########################

sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
--> during __pick_eevdf, prints values for best and the first node loop variable, second loop is never executed

sched_place/sched_update: sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime
--> at the end of place_entity and update_entity_lag

--> the chunks of 5 entries for these new events represent the 5 levels of the cgroup which hosts the vhost

    vhost-2931-2953    [013] d....   606.338262: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=90133345 [ns]
    vhost-2931-2953    [013] d....   606.338262: sched_bestvnode: best: id=0 v=56117619190 d=57650477291 l=0 md=56121178745 dp=0 node: id=168 v=56117619190 d=56120619190 l=0 md=56120619190 dp=0
    vhost-2931-2953    [013] dN...   606.338263: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
    vhost-2931-2953    [013] dN...   606.338263: sched_bestvnode: best: id=0 v=56117619190 d=57650477291 l=0 md=56121178745 dp=0 node: id=168 v=56117619190 d=56120619190 l=0 md=56120619190 dp=0
    vhost-2931-2953    [013] dN...   606.338263: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=17910 [ns] vruntime=2099190650 [ns] deadline=2102172740 [ns] lag=2102172740
    vhost-2931-2953    [013] dN...   606.338264: sched_stat_wait: comm=kworker/13:1 pid=168 delay=0 [ns]
    vhost-2931-2953    [013] d....   606.338264: sched_switch: prev_comm=vhost-2931 prev_pid=2953 prev_prio=120 prev_state=R+ ==> next_comm=kworker/13:1 next_pid=168 next_prio=120
--> kworker allowed to execute
  kworker/13:1-168     [013] d....   606.338266: sched_waking: comm=CPU 0/KVM pid=2958 prio=120 target_cpu=009
  kworker/13:1-168     [013] d....   606.338267: sched_stat_runtime: comm=kworker/13:1 pid=168 runtime=4941 [ns] vruntime=56117624131 [ns] deadline=56120619190 [ns] lag=56120619190
--> runtime of 4941 ns
  kworker/13:1-168     [013] d....   606.338267: sched_update: comm=kworker/13:1 pid=168 sev=56117624131 sed=56120619190 sel=-725 avg=0 min=56117619190 cpu=13 nr=2 lag=-725 lim=10000000
  kworker/13:1-168     [013] d....   606.338267: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=56117619190 d=57650477291 l=0 md=57650477291 dp=0
--> depth 0 of cgroup holding vhost:     vruntime      deadline
                        cgroup        56117619190   57650477291
                        kworker       56117624131   56120619190
  kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=29822481776 d=29834647752 l=29834647752 md=29834647752 dp=1
  kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=21909608438 d=21919458955 l=21919458955 md=21919458955 dp=2
  kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=11306038504 d=11312426915 l=11312426915 md=11312426915 dp=3
  kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=2953 v=2099190650 d=2102172740 l=2102172740 md=2102172740 dp=4
  kworker/13:1-168     [013] d....   606.338268: sched_stat_wait: comm=vhost-2931 pid=2953 delay=4941 [ns]
  kworker/13:1-168     [013] d....   606.338269: sched_switch: prev_comm=kworker/13:1 prev_pid=168 prev_prio=120 prev_state=I ==> next_comm=vhost-2931 next_pid=2953 next_prio=120
    vhost-2931-2953    [013] d....   606.338311: sched_waking: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
    vhost-2931-2953    [013] d....   606.338312: sched_place: comm=kworker/13:1 pid=168 sev=56117885776 sed=56120885776 sel=-725 avg=0 min=56117880833 cpu=13 nr=1 vru=56117880833 lag=-725
--> kworker gets placed again
    vhost-2931-2953    [013] d....   606.338312: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=44970 [ns]
    vhost-2931-2953    [013] d....   606.338313: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
--> kworker set to runnable, but vhost keeps on executing
    vhost-2931-2953    [013] d.h..   606.346964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=8697702 [ns] vruntime=2107888352 [ns] deadline=2110888352 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.356964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999583 [ns] vruntime=2117887935 [ns] deadline=2120887935 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.366964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000089 [ns] vruntime=2127888024 [ns] deadline=2130888024 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.376964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999716 [ns] vruntime=2137887740 [ns] deadline=2140887740 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.386964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000179 [ns] vruntime=2147887919 [ns] deadline=2150887919 [ns] lag=2102172740
    vhost-2931-2953    [013] D....   606.392250: sched_waking: comm=vhost-2306 pid=2324 prio=120 target_cpu=018
    vhost-2931-2953    [013] D....   606.392388: sched_waking: comm=vhost-2306 pid=2321 prio=120 target_cpu=017
    vhost-2931-2953    [013] D....   606.392390: sched_migrate_task: comm=vhost-2306 pid=2321 prio=120 orig_cpu=17 dest_cpu=23
    vhost-2931-2953    [013] d.h..   606.396964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000187 [ns] vruntime=2157888106 [ns] deadline=2160888106 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.406964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000112 [ns] vruntime=2167888218 [ns] deadline=2170888218 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.416964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999779 [ns] vruntime=2177887997 [ns] deadline=2180887997 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.426964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999667 [ns] vruntime=2187887664 [ns] deadline=2190887664 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.436964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000329 [ns] vruntime=2197887993 [ns] deadline=2200887993 [ns] lag=2102172740
    vhost-2931-2953    [013] D....   606.441980: sched_waking: comm=vhost-2306 pid=2325 prio=120 target_cpu=021
    vhost-2931-2953    [013] d.h..   606.446964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000069 [ns] vruntime=2207888062 [ns] deadline=2210888062 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.456964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999977 [ns] vruntime=2217888039 [ns] deadline=2220888039 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.466964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999548 [ns] vruntime=2227887587 [ns] deadline=2230887587 [ns] lag=2102172740
    vhost-2931-2953    [013] dNh..   606.466979: sched_wakeup: comm=migration/13 pid=80 prio=0 target_cpu=013
    vhost-2931-2953    [013] dN...   606.467017: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=41352 [ns] vruntime=2227928939 [ns] deadline=2230887587 [ns] lag=2102172740
    vhost-2931-2953    [013] d....   606.467018: sched_switch: prev_comm=vhost-2931 prev_pid=2953 prev_prio=120 prev_state=R+ ==> next_comm=migration/13 next_pid=80 next_prio=0
  migration/13-80      [013] d..1.   606.467020: sched_update: comm=vhost-2931 pid=2953 sev=2227928939 sed=2230887587 sel=0 avg=0 min=2227928939 cpu=13 nr=1 lag=0 lim=10000000
  migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=12075393889 sed=12087868931 sel=0 avg=0 min=12075393889 cpu=13 nr=1 lag=0 lim=42139916
  migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=23017543001 sed=23036322254 sel=0 avg=0 min=23017543001 cpu=13 nr=1 lag=0 lim=63209874
  migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=30619368612 sed=30633124735 sel=0 avg=0 min=30619368612 cpu=13 nr=1 lag=0 lim=46126124
  migration/13-80      [013] d..1.   606.467022: sched_update: comm= pid=0 sev=57104166665 sed=57945071818 sel=-62439022 avg=161750065796 min=56117885776 cpu=13 nr=2 lag=-62439022 lim=62439022
--> depth 0 of cgroup holding vhost:     vruntime      deadline
                        cgroup        57104166665   57945071818
                        kworker       56117885776   56120885776  --> last known values
--> cgroup's lag of -62439022 indicates that the scheduler knows that the cgroup ran for too long
--> nr=2 shows that the cgroup and the kworker are currently on the runqueue
  migration/13-80      [013] d..1.   606.467022: sched_migrate_task: comm=vhost-2931 pid=2953 prio=120 orig_cpu=13 dest_cpu=12
  migration/13-80      [013] d..1.   606.467023: sched_place: comm=vhost-2931 pid=2953 sev=2994881412 sed=2997881412 sel=0 avg=0 min=2994881412 cpu=12 nr=0 vru=2994881412 lag=0
  migration/13-80      [013] d..1.   606.467023: sched_place: comm= pid=0 sev=16617220304 sed=16632657489 sel=0 avg=0 min=16617220304 cpu=12 nr=0 vru=16617220304 lag=0
  migration/13-80      [013] d..1.   606.467024: sched_place: comm= pid=0 sev=30778525102 sed=30804781512 sel=0 avg=0 min=30778525102 cpu=12 nr=0 vru=30778525102 lag=0
  migration/13-80      [013] d..1.   606.467024: sched_place: comm= pid=0 sev=38704326194 sed=38724404624 sel=0 avg=0 min=38704326194 cpu=12 nr=0 vru=38704326194 lag=0
  migration/13-80      [013] d..1.   606.467025: sched_place: comm= pid=0 sev=66383057731 sed=66409091628 sel=-30739032 avg=0 min=66383057731 cpu=12 nr=0 vru=66383057731 lag=0
--> vhost migrated off to CPU 12
  migration/13-80      [013] d....   606.467026: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=168 v=56117885776 d=56120885776 l=-725 md=56120885776 dp=0
  migration/13-80      [013] d....   606.467026: sched_stat_wait: comm=kworker/13:1 pid=168 delay=128714004 [ns]
  migration/13-80      [013] d....   606.467027: sched_switch: prev_comm=migration/13 prev_pid=80 prev_prio=0 prev_state=S ==> next_comm=kworker/13:1 next_pid=168 next_prio=120
--> kworker runs next
  kworker/13:1-168     [013] d....   606.467030: sched_waking: comm=CPU 0/KVM pid=2958 prio=120 target_cpu=009
  kworker/13:1-168     [013] d....   606.467032: sched_stat_runtime: comm=kworker/13:1 pid=168 runtime=6163 [ns] vruntime=56117891939 [ns] deadline=56120885776 [ns] lag=56120885776
  kworker/13:1-168     [013] d....   606.467032: sched_update: comm=kworker/13:1 pid=168 sev=56117891939 sed=56120885776 sel=0 avg=0 min=56117891939 cpu=13 nr=1 lag=0 lim=10000000
  kworker/13:1-168     [013] d....   606.467033: sched_switch: prev_comm=kworker/13:1 prev_pid=168 prev_prio=120 prev_state=I ==> next_comm=swapper/13 next_pid=0 next_prio=120
--> kworker finishes
        <idle>-0       [013] d.h..   606.479977: sched_place: comm=vhost-2931 pid=2953 sev=2227928939 sed=2230928939 sel=0 avg=0 min=2227928939 cpu=13 nr=0 vru=2227928939 lag=0
--> vhost migrated back and placed on CPU 13 again
        <idle>-0       [013] d.h..   606.479977: sched_stat_sleep: comm=vhost-2931 pid=2953 delay=27874 [ns]
        <idle>-0       [013] d.h..   606.479977: sched_place: comm= pid=0 sev=12075393889 sed=12099393888 sel=0 avg=0 min=12075393889 cpu=13 nr=0 vru=12075393889 lag=0
        <idle>-0       [013] d.h..   606.479978: sched_place: comm= pid=0 sev=23017543001 sed=23056927616 sel=0 avg=0 min=23017543001 cpu=13 nr=0 vru=23017543001 lag=0
        <idle>-0       [013] d.h..   606.479978: sched_place: comm= pid=0 sev=30619368612 sed=30648907073 sel=0 avg=0 min=30619368612 cpu=13 nr=0 vru=30619368612 lag=0
        <idle>-0       [013] d.h..   606.479979: sched_place: comm= pid=0 sev=56117891939 sed=56168252594 sel=-166821875 avg=0 min=56117891939 cpu=13 nr=0 vru=56117891939 lag=0
        <idle>-0       [013] dNh..   606.479979: sched_wakeup: comm=vhost-2931 pid=2953 prio=120 target_cpu=013
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=56117891939 d=56168252594 l=-166821875 md=56168252594 dp=0
--> depth 0 of cgroup holding vhost:     vruntime      deadline
                        cgroup        56117891939   56168252594
                        kworker       56117891939   56120885776
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=30619368612 d=30648907073 l=0 md=30648907073 dp=1
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=23017543001 d=23056927616 l=0 md=23056927616 dp=2
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=12075393889 d=12099393888 l=0 md=12099393888 dp=3
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=2953 v=2227928939 d=2230928939 l=0 md=2230928939 dp=4
        <idle>-0       [013] dN...   606.479982: sched_stat_wait: comm=vhost-2931 pid=2953 delay=0 [ns]
        <idle>-0       [013] d....   606.479982: sched_switch: prev_comm=swapper/13 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=vhost-2931 next_pid=2953 next_prio=120
--> vhost can continue to bully the kworker
    vhost-2931-2953    [013] d....   606.479995: sched_waking: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
    vhost-2931-2953    [013] d....   606.479996: sched_place: comm=kworker/13:1 pid=168 sev=56118220659 sed=56121220659 sel=0 avg=0 min=56118220659 cpu=13 nr=1 vru=56118220659 lag=0
    vhost-2931-2953    [013] d....   606.479996: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=12964004 [ns]
    vhost-2931-2953    [013] d....   606.479997: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
    vhost-2931-2953    [013] d....   606.479997: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=20837 [ns] vruntime=2227949776 [ns] deadline=2230928939 [ns] lag=2230928939
    vhost-2931-2953    [013] d....   606.479997: sched_update: comm=vhost-2931 pid=2953 sev=2227949776 sed=2230928939 sel=0 avg=0 min=2227949776 cpu=13 nr=1 lag=0 lim=10000000
    vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=12075560584 sed=12099393888 sel=0 avg=0 min=12075560584 cpu=13 nr=1 lag=0 lim=79999997
    vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=23017816553 sed=23056927616 sel=0 avg=0 min=23017816553 cpu=13 nr=1 lag=0 lim=131282050
    vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=30619573776 sed=30648907073 sel=0 avg=0 min=30619573776 cpu=13 nr=1 lag=0 lim=98461537
    vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=56118241726 sed=56168252594 sel=-19883 avg=0 min=56118220659 cpu=13 nr=2 lag=-19883 lim=167868850
    vhost-2931-2953    [013] d....   606.479999: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=168 v=56118220659 d=56121220659 l=0 md=56121220659 dp=0
    vhost-2931-2953    [013] d....   606.479999: sched_stat_wait: comm=kworker/13:1 pid=168 delay=1255 [ns]
--> good delay of 1255 ns for the kworker
--> depth 0 of cgroup holding vhost:     vruntime      deadline
                        cgroup        56118241726   56168252594
                        kworker       56118220659   56121220659

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]             ` <6564a012.c80a0220.adb78.f0e4SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2023-11-28  8:55               ` Abel Wu
  2023-11-29  6:31                 ` Tobias Huschle
                                   ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Abel Wu @ 2023-11-28  8:55 UTC (permalink / raw)
  To: Tobias Huschle, Peter Zijlstra
  Cc: Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
>> On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
>>
>> The below should also work for internal entities, but last time I poked
>> around with it I had some regressions elsewhere -- you know, fix one,
>> wreck another type of situations on hand.
>>
>> But still, could you please give it a go -- it applies cleanly to linus'
>> master and -rc2.
>>
>> ---
>> Subject: sched/eevdf: Revenge of the Sith^WSleepers
>>
> 
> Tried the patch, it does not help unfortuntately.
> 
> It might also be possible that the long running vhost is stuck on something.
> During those phases where the vhost just runs for a while. This might have
> been a risk for a while, EEVDF might have just uncovered an unfortuntate
> sequence of events.
> I'll look into this option.
> 
> I also added some more trace outputs in order to find the actual vruntimes
> of the cgroup parents. The numbers look kind of reasonable, but I struggle
> to judge this with certainty.
> 
> In one of the scenarios where the kworker sees an absurd wait time, the
> following occurs (full trace below):
> 
> - The kworker ends its timeslice after 4941 ns
> - __pick_eevdf finds the cgroup holding vhost as the next option to execute
> - Last known values are:
>                      vruntime      deadline
>     cgroup        56117619190   57650477291 -> depth 0
>     kworker       56117624131   56120619190
>    This is fair, since the kworker is not runnable here.
> - At depth 4, the cgroup shows the observed vruntime value which is smaller
>    by a factor of 20, but depth 0 seems to be running with values of the
>    correct magnitude.

A child is running means its parent also being the cfs->curr, but
not vice versa if there are more than one child.

> - cgroup at depth 0 has zero lag, with higher depth, there are large lag
>    values (as observed 606.338267 onwards)

These values of se->vlag means 'run this entity to parity' to avoid
excess context switch, which is what RUN_TO_PARITY does, or nothing
when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.

> 
> Now the following occurs, triggered by the vhost:
> - The kworker gets placed again with:
>                      vruntime      deadline
>     cgroup        56117619190   57650477291 -> depth 0, last known value
>     kworker       56117885776   56120885776 -> lag of -725
> - vhost continues executing and updates its vruntime accordingly, here
>    I would need to enhance the trace to also print the vruntimes of the
>    parent sched_entities to see the progress of their vruntime/deadline/lag
>    values as well
> - It is a bit irritating that the EEVDF algorithm would not pick the kworker
>    over the cgroup as its deadline is smaller.
>    But, the kworker has negative lag, which might cause EEVDF to not pick
>    the kworker.
>    The cgroup at depth 0 has no lag, all deeper layers have a significantly
>    positve lag (last known values, might have changed in the meantime).
>    At this point I would see the option that the vhost task is stuck
>    somewhere or EEVDF just does not see the kworker as a eligible option.

IMHO such lag should not introduce that long delay. Can you run the
test again with NEXT_BUDDY disabled?

> 
> - Once the vhost is migrated off the cpu, the update_entity_lag function
>    works with the following values at 606.467022: sched_update
>    For the cgroup at depth 0
>    - vruntime = 57104166665 --> this is in line with the amount of new timeslices
>                                 vhost got assigned while the kworker was waiting
>    - vlag     =   -62439022 --> the scheduler knows that the cgroup was
>                                 overconsuming, but no runtime for the kworker
>    For the cfs_rq we have
>    - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
>    - avg_vruntime = 161750065796 --> this is rather large in comparison, but I
>                                      might access this value at a bad time

Use avg_vruntime() instead.

>    - nr_running   =            2 --> at this point, both, cgroup and kworker are
>                                      still on the queue, with the cgroup being
>                                      in the migration process
> --> It seems like the overconsumption accumulates at cgroup depth 0 and is not
>      propageted downwards. This might be intended though.
> 
> - At 606.479979: sched_place, cgroup hosting the vhost is migrated back
>    onto cpu 13 with a lag of -166821875 it gets scheduled right away as
>    there is no other task (nr_running = 0)
> 
> - At 606.479996: sched_place, the kworker gets placed again, this time
>    with no lag and get scheduled almost immediately, with a wait
>    time of 1255 ns.
> 
> It shall be noted, that these scenarios also occur when the first placement
> of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
> when observing this issue.
> 
> ######################### full trace #########################
> 
> sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
> --> during __pick_eevdf, prints values for best and the first node loop variable, second loop is never executed
> 
> sched_place/sched_update: sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime

It would be better replace cfs_rq->avg_vruntime with avg_vruntime().
Although we can get real @avg by (vruntime + vlag), I am not sure
vlag (@lag in trace) is se->vlag or the local variable in the place
function which is scaled and no longer be the true vlag.

> --> at the end of place_entity and update_entity_lag
> 
> --> the chunks of 5 entries for these new events represent the 5 levels of the cgroup which hosts the vhost
> 
>      vhost-2931-2953    [013] d....   606.338262: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=90133345 [ns]
>      vhost-2931-2953    [013] d....   606.338262: sched_bestvnode: best: id=0 v=56117619190 d=57650477291 l=0 md=56121178745 dp=0 node: id=168 v=56117619190 d=56120619190 l=0 md=56120619190 dp=0
>      vhost-2931-2953    [013] dN...   606.338263: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
>      vhost-2931-2953    [013] dN...   606.338263: sched_bestvnode: best: id=0 v=56117619190 d=57650477291 l=0 md=56121178745 dp=0 node: id=168 v=56117619190 d=56120619190 l=0 md=56120619190 dp=0
>      vhost-2931-2953    [013] dN...   606.338263: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=17910 [ns] vruntime=2099190650 [ns] deadline=2102172740 [ns] lag=2102172740
>      vhost-2931-2953    [013] dN...   606.338264: sched_stat_wait: comm=kworker/13:1 pid=168 delay=0 [ns]
>      vhost-2931-2953    [013] d....   606.338264: sched_switch: prev_comm=vhost-2931 prev_pid=2953 prev_prio=120 prev_state=R+ ==> next_comm=kworker/13:1 next_pid=168 next_prio=120
> --> kworker allowed to execute
>    kworker/13:1-168     [013] d....   606.338266: sched_waking: comm=CPU 0/KVM pid=2958 prio=120 target_cpu=009
>    kworker/13:1-168     [013] d....   606.338267: sched_stat_runtime: comm=kworker/13:1 pid=168 runtime=4941 [ns] vruntime=56117624131 [ns] deadline=56120619190 [ns] lag=56120619190
> --> runtime of 4941 ns
>    kworker/13:1-168     [013] d....   606.338267: sched_update: comm=kworker/13:1 pid=168 sev=56117624131 sed=56120619190 sel=-725 avg=0 min=56117619190 cpu=13 nr=2 lag=-725 lim=10000000
>    kworker/13:1-168     [013] d....   606.338267: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=56117619190 d=57650477291 l=0 md=57650477291 dp=0
> --> depth 0 of cgroup holding vhost:     vruntime      deadline
>                          cgroup        56117619190   57650477291
>                          kworker       56117624131   56120619190
>    kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=29822481776 d=29834647752 l=29834647752 md=29834647752 dp=1
>    kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=21909608438 d=21919458955 l=21919458955 md=21919458955 dp=2
>    kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=11306038504 d=11312426915 l=11312426915 md=11312426915 dp=3
>    kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=2953 v=2099190650 d=2102172740 l=2102172740 md=2102172740 dp=4
>    kworker/13:1-168     [013] d....   606.338268: sched_stat_wait: comm=vhost-2931 pid=2953 delay=4941 [ns]
>    kworker/13:1-168     [013] d....   606.338269: sched_switch: prev_comm=kworker/13:1 prev_pid=168 prev_prio=120 prev_state=I ==> next_comm=vhost-2931 next_pid=2953 next_prio=120
>      vhost-2931-2953    [013] d....   606.338311: sched_waking: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
>      vhost-2931-2953    [013] d....   606.338312: sched_place: comm=kworker/13:1 pid=168 sev=56117885776 sed=56120885776 sel=-725 avg=0 min=56117880833 cpu=13 nr=1 vru=56117880833 lag=-725
> --> kworker gets placed again
>      vhost-2931-2953    [013] d....   606.338312: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=44970 [ns]
>      vhost-2931-2953    [013] d....   606.338313: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> --> kworker set to runnable, but vhost keeps on executing

What are the weights of the two entities?

>      vhost-2931-2953    [013] d.h..   606.346964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=8697702 [ns] vruntime=2107888352 [ns] deadline=2110888352 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.356964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999583 [ns] vruntime=2117887935 [ns] deadline=2120887935 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.366964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000089 [ns] vruntime=2127888024 [ns] deadline=2130888024 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.376964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999716 [ns] vruntime=2137887740 [ns] deadline=2140887740 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.386964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000179 [ns] vruntime=2147887919 [ns] deadline=2150887919 [ns] lag=2102172740
>      vhost-2931-2953    [013] D....   606.392250: sched_waking: comm=vhost-2306 pid=2324 prio=120 target_cpu=018
>      vhost-2931-2953    [013] D....   606.392388: sched_waking: comm=vhost-2306 pid=2321 prio=120 target_cpu=017
>      vhost-2931-2953    [013] D....   606.392390: sched_migrate_task: comm=vhost-2306 pid=2321 prio=120 orig_cpu=17 dest_cpu=23
>      vhost-2931-2953    [013] d.h..   606.396964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000187 [ns] vruntime=2157888106 [ns] deadline=2160888106 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.406964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000112 [ns] vruntime=2167888218 [ns] deadline=2170888218 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.416964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999779 [ns] vruntime=2177887997 [ns] deadline=2180887997 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.426964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999667 [ns] vruntime=2187887664 [ns] deadline=2190887664 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.436964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000329 [ns] vruntime=2197887993 [ns] deadline=2200887993 [ns] lag=2102172740
>      vhost-2931-2953    [013] D....   606.441980: sched_waking: comm=vhost-2306 pid=2325 prio=120 target_cpu=021
>      vhost-2931-2953    [013] d.h..   606.446964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000069 [ns] vruntime=2207888062 [ns] deadline=2210888062 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.456964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999977 [ns] vruntime=2217888039 [ns] deadline=2220888039 [ns] lag=2102172740
>      vhost-2931-2953    [013] d.h..   606.466964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999548 [ns] vruntime=2227887587 [ns] deadline=2230887587 [ns] lag=2102172740
>      vhost-2931-2953    [013] dNh..   606.466979: sched_wakeup: comm=migration/13 pid=80 prio=0 target_cpu=013
>      vhost-2931-2953    [013] dN...   606.467017: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=41352 [ns] vruntime=2227928939 [ns] deadline=2230887587 [ns] lag=2102172740
>      vhost-2931-2953    [013] d....   606.467018: sched_switch: prev_comm=vhost-2931 prev_pid=2953 prev_prio=120 prev_state=R+ ==> next_comm=migration/13 next_pid=80 next_prio=0
>    migration/13-80      [013] d..1.   606.467020: sched_update: comm=vhost-2931 pid=2953 sev=2227928939 sed=2230887587 sel=0 avg=0 min=2227928939 cpu=13 nr=1 lag=0 lim=10000000
>    migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=12075393889 sed=12087868931 sel=0 avg=0 min=12075393889 cpu=13 nr=1 lag=0 lim=42139916
>    migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=23017543001 sed=23036322254 sel=0 avg=0 min=23017543001 cpu=13 nr=1 lag=0 lim=63209874
>    migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=30619368612 sed=30633124735 sel=0 avg=0 min=30619368612 cpu=13 nr=1 lag=0 lim=46126124
>    migration/13-80      [013] d..1.   606.467022: sched_update: comm= pid=0 sev=57104166665 sed=57945071818 sel=-62439022 avg=161750065796 min=56117885776 cpu=13 nr=2 lag=-62439022 lim=62439022
> --> depth 0 of cgroup holding vhost:     vruntime      deadline
>                          cgroup        57104166665   57945071818
>                          kworker       56117885776   56120885776  --> last known values
> --> cgroup's lag of -62439022 indicates that the scheduler knows that the cgroup ran for too long
> --> nr=2 shows that the cgroup and the kworker are currently on the runqueue
>    migration/13-80      [013] d..1.   606.467022: sched_migrate_task: comm=vhost-2931 pid=2953 prio=120 orig_cpu=13 dest_cpu=12
>    migration/13-80      [013] d..1.   606.467023: sched_place: comm=vhost-2931 pid=2953 sev=2994881412 sed=2997881412 sel=0 avg=0 min=2994881412 cpu=12 nr=0 vru=2994881412 lag=0
>    migration/13-80      [013] d..1.   606.467023: sched_place: comm= pid=0 sev=16617220304 sed=16632657489 sel=0 avg=0 min=16617220304 cpu=12 nr=0 vru=16617220304 lag=0
>    migration/13-80      [013] d..1.   606.467024: sched_place: comm= pid=0 sev=30778525102 sed=30804781512 sel=0 avg=0 min=30778525102 cpu=12 nr=0 vru=30778525102 lag=0
>    migration/13-80      [013] d..1.   606.467024: sched_place: comm= pid=0 sev=38704326194 sed=38724404624 sel=0 avg=0 min=38704326194 cpu=12 nr=0 vru=38704326194 lag=0
>    migration/13-80      [013] d..1.   606.467025: sched_place: comm= pid=0 sev=66383057731 sed=66409091628 sel=-30739032 avg=0 min=66383057731 cpu=12 nr=0 vru=66383057731 lag=0
> --> vhost migrated off to CPU 12
>    migration/13-80      [013] d....   606.467026: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=168 v=56117885776 d=56120885776 l=-725 md=56120885776 dp=0
>    migration/13-80      [013] d....   606.467026: sched_stat_wait: comm=kworker/13:1 pid=168 delay=128714004 [ns]
>    migration/13-80      [013] d....   606.467027: sched_switch: prev_comm=migration/13 prev_pid=80 prev_prio=0 prev_state=S ==> next_comm=kworker/13:1 next_pid=168 next_prio=120
> --> kworker runs next
>    kworker/13:1-168     [013] d....   606.467030: sched_waking: comm=CPU 0/KVM pid=2958 prio=120 target_cpu=009
>    kworker/13:1-168     [013] d....   606.467032: sched_stat_runtime: comm=kworker/13:1 pid=168 runtime=6163 [ns] vruntime=56117891939 [ns] deadline=56120885776 [ns] lag=56120885776
>    kworker/13:1-168     [013] d....   606.467032: sched_update: comm=kworker/13:1 pid=168 sev=56117891939 sed=56120885776 sel=0 avg=0 min=56117891939 cpu=13 nr=1 lag=0 lim=10000000
>    kworker/13:1-168     [013] d....   606.467033: sched_switch: prev_comm=kworker/13:1 prev_pid=168 prev_prio=120 prev_state=I ==> next_comm=swapper/13 next_pid=0 next_prio=120
> --> kworker finishes
>          <idle>-0       [013] d.h..   606.479977: sched_place: comm=vhost-2931 pid=2953 sev=2227928939 sed=2230928939 sel=0 avg=0 min=2227928939 cpu=13 nr=0 vru=2227928939 lag=0
> --> vhost migrated back and placed on CPU 13 again
>          <idle>-0       [013] d.h..   606.479977: sched_stat_sleep: comm=vhost-2931 pid=2953 delay=27874 [ns]
>          <idle>-0       [013] d.h..   606.479977: sched_place: comm= pid=0 sev=12075393889 sed=12099393888 sel=0 avg=0 min=12075393889 cpu=13 nr=0 vru=12075393889 lag=0
>          <idle>-0       [013] d.h..   606.479978: sched_place: comm= pid=0 sev=23017543001 sed=23056927616 sel=0 avg=0 min=23017543001 cpu=13 nr=0 vru=23017543001 lag=0
>          <idle>-0       [013] d.h..   606.479978: sched_place: comm= pid=0 sev=30619368612 sed=30648907073 sel=0 avg=0 min=30619368612 cpu=13 nr=0 vru=30619368612 lag=0
>          <idle>-0       [013] d.h..   606.479979: sched_place: comm= pid=0 sev=56117891939 sed=56168252594 sel=-166821875 avg=0 min=56117891939 cpu=13 nr=0 vru=56117891939 lag=0
>          <idle>-0       [013] dNh..   606.479979: sched_wakeup: comm=vhost-2931 pid=2953 prio=120 target_cpu=013
>          <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=56117891939 d=56168252594 l=-166821875 md=56168252594 dp=0
> --> depth 0 of cgroup holding vhost:     vruntime      deadline
>                          cgroup        56117891939   56168252594
>                          kworker       56117891939   56120885776
>          <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=30619368612 d=30648907073 l=0 md=30648907073 dp=1
>          <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=23017543001 d=23056927616 l=0 md=23056927616 dp=2
>          <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=12075393889 d=12099393888 l=0 md=12099393888 dp=3
>          <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=2953 v=2227928939 d=2230928939 l=0 md=2230928939 dp=4
>          <idle>-0       [013] dN...   606.479982: sched_stat_wait: comm=vhost-2931 pid=2953 delay=0 [ns]
>          <idle>-0       [013] d....   606.479982: sched_switch: prev_comm=swapper/13 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=vhost-2931 next_pid=2953 next_prio=120
> --> vhost can continue to bully the kworker
>      vhost-2931-2953    [013] d....   606.479995: sched_waking: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
>      vhost-2931-2953    [013] d....   606.479996: sched_place: comm=kworker/13:1 pid=168 sev=56118220659 sed=56121220659 sel=0 avg=0 min=56118220659 cpu=13 nr=1 vru=56118220659 lag=0
>      vhost-2931-2953    [013] d....   606.479996: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=12964004 [ns]
>      vhost-2931-2953    [013] d....   606.479997: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
>      vhost-2931-2953    [013] d....   606.479997: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=20837 [ns] vruntime=2227949776 [ns] deadline=2230928939 [ns] lag=2230928939
>      vhost-2931-2953    [013] d....   606.479997: sched_update: comm=vhost-2931 pid=2953 sev=2227949776 sed=2230928939 sel=0 avg=0 min=2227949776 cpu=13 nr=1 lag=0 lim=10000000
>      vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=12075560584 sed=12099393888 sel=0 avg=0 min=12075560584 cpu=13 nr=1 lag=0 lim=79999997
>      vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=23017816553 sed=23056927616 sel=0 avg=0 min=23017816553 cpu=13 nr=1 lag=0 lim=131282050
>      vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=30619573776 sed=30648907073 sel=0 avg=0 min=30619573776 cpu=13 nr=1 lag=0 lim=98461537
>      vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=56118241726 sed=56168252594 sel=-19883 avg=0 min=56118220659 cpu=13 nr=2 lag=-19883 lim=167868850
>      vhost-2931-2953    [013] d....   606.479999: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=168 v=56118220659 d=56121220659 l=0 md=56121220659 dp=0
>      vhost-2931-2953    [013] d....   606.479999: sched_stat_wait: comm=kworker/13:1 pid=168 delay=1255 [ns]
> --> good delay of 1255 ns for the kworker
> --> depth 0 of cgroup holding vhost:     vruntime      deadline
>                          cgroup        56118241726   56168252594
>                          kworker       56118220659   56121220659

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-28  8:55               ` Abel Wu
@ 2023-11-29  6:31                 ` Tobias Huschle
  2023-12-07  6:22                 ` Tobias Huschle
       [not found]                 ` <07513.123120701265800278@us-mta-474.us.mimecast.lan>
  2 siblings, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2023-11-29  6:31 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

[...]

> > - At depth 4, the cgroup shows the observed vruntime value which is smaller
> >    by a factor of 20, but depth 0 seems to be running with values of the
> >    correct magnitude.
> 
> A child is running means its parent also being the cfs->curr, but
> not vice versa if there are more than one child.
> 
> > - cgroup at depth 0 has zero lag, with higher depth, there are large lag
> >    values (as observed 606.338267 onwards)
> 
> These values of se->vlag means 'run this entity to parity' to avoid
> excess context switch, which is what RUN_TO_PARITY does, or nothing
> when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.
> 

Thanks for clarifying that. This makes things clearer to me.

> > 
> > Now the following occurs, triggered by the vhost:
> > - The kworker gets placed again with:
> >                      vruntime      deadline
> >     cgroup        56117619190   57650477291 -> depth 0, last known value
> >     kworker       56117885776   56120885776 -> lag of -725
> > - vhost continues executing and updates its vruntime accordingly, here
> >    I would need to enhance the trace to also print the vruntimes of the
> >    parent sched_entities to see the progress of their vruntime/deadline/lag
> >    values as well
> > - It is a bit irritating that the EEVDF algorithm would not pick the kworker
> >    over the cgroup as its deadline is smaller.
> >    But, the kworker has negative lag, which might cause EEVDF to not pick
> >    the kworker.
> >    The cgroup at depth 0 has no lag, all deeper layers have a significantly
> >    positve lag (last known values, might have changed in the meantime).
> >    At this point I would see the option that the vhost task is stuck
> >    somewhere or EEVDF just does not see the kworker as a eligible option.
> 
> IMHO such lag should not introduce that long delay. Can you run the
> test again with NEXT_BUDDY disabled?

I added a trace event to the next buddy path, it does not get triggered, so I'd 
assume that no buddies are selected.

> 
> > 
> > - Once the vhost is migrated off the cpu, the update_entity_lag function
> >    works with the following values at 606.467022: sched_update
> >    For the cgroup at depth 0
> >    - vruntime = 57104166665 --> this is in line with the amount of new timeslices
> >                                 vhost got assigned while the kworker was waiting
> >    - vlag     =   -62439022 --> the scheduler knows that the cgroup was
> >                                 overconsuming, but no runtime for the kworker
> >    For the cfs_rq we have
> >    - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
> >    - avg_vruntime = 161750065796 --> this is rather large in comparison, but I
> >                                      might access this value at a bad time
> 
> Use avg_vruntime() instead.

Fair.

[...]

> > 
> > ######################### full trace #########################
> > 
> > sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
> > --> during __pick_eevdf, prints values for best and the first node loop variable, second loop is never executed
> > 
> > sched_place/sched_update: sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime
> 
> It would be better replace cfs_rq->avg_vruntime with avg_vruntime().
> Although we can get real @avg by (vruntime + vlag), I am not sure
> vlag (@lag in trace) is se->vlag or the local variable in the place
> function which is scaled and no longer be the true vlag.
> 

Oh my bad, sev is the vlag value of the sched_entity, lag is the local variable.

[...]

> >      vhost-2931-2953    [013] d....   606.338313: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> > --> kworker set to runnable, but vhost keeps on executing
> 
> What are the weights of the two entities?

I'll do another run and look at those values.

[...]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-11-28  8:55               ` Abel Wu
  2023-11-29  6:31                 ` Tobias Huschle
@ 2023-12-07  6:22                 ` Tobias Huschle
       [not found]                 ` <07513.123120701265800278@us-mta-474.us.mimecast.lan>
  2 siblings, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2023-12-07  6:22 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Linux Kernel, kvm, virtualization, netdev, mst, jasowang

On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
[...]
> 
> What are the weights of the two entities?
> 

Both entities have the same weights (I saw 1048576 for both of them).
The story looks different when we look at the cgroup hierarchy though:

sew := weight of the sched entity (se->load.weight)

     CPU 6/KVM-2360    [011] d....  1158.884473: sched_place: comm=vhost-2961 pid=2984 sev=3595548386 sed=3598548386 sel=0 sew=1048576 avg=3595548386 min=3595548386 cpu=11 nr=0 vru=3595548386 lag=0
     CPU 6/KVM-2360    [011] d....  1158.884473: sched_place: comm= pid=0 sev=19998138425 sed=20007532920 sel=0 sew=335754 avg=19998138425 min=19998138425 cpu=11 nr=0 vru=19998138425 lag=0
     CPU 6/KVM-2360    [011] d....  1158.884474: sched_place: comm= pid=0 sev=37794158943 sed=37807515464 sel=0 sew=236146 avg=37794158943 min=37794158943 cpu=11 nr=0 vru=37794158943 lag=0
     CPU 6/KVM-2360    [011] d....  1158.884474: sched_place: comm= pid=0 sev=50387168150 sed=50394482435 sel=0 sew=430665 avg=50387168150 min=50387168150 cpu=11 nr=0 vru=50387168150 lag=0
     CPU 6/KVM-2360    [011] d....  1158.884474: sched_place: comm= pid=0 sev=76600751247 sed=77624751246 sel=0 sew=3876 avg=76600751247 min=76600751247 cpu=11 nr=0 vru=76600751247 lag=0
<...>
    vhost-2961-2984    [011] d....  1158.884487: sched_place: comm=kworker/11:2 pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0

Here we can see the following weights:
kworker     -> 1048576
vhost       -> 1048576
cgroup root ->    3876

kworker and vhost weights remain the same. The weights of the nodes in the cgroup vary.


I also spent some more thought on this and have some more observations:

1. kworker lag after short runtime

    vhost-2961-2984    [011] d....  1158.884486: sched_waking: comm=kworker/11:2 pid=202 prio=120 target_cpu=011
    vhost-2961-2984    [011] d....  1158.884487: sched_place: comm=kworker/11:2 pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0
<...>                                                                                                                   ^^^^^
    vhost-2961-2984    [011] d....  1158.884490: sched_switch: prev_comm=vhost-2961 prev_pid=2984 prev_prio=120 prev_state=R+ ==> next_comm=kworker/11:2 next_pid=202 next_prio=120
   kworker/11:2-202    [011] d....  1158.884491: sched_waking: comm=CPU 0/KVM pid=2988 prio=120 target_cpu=009
   kworker/11:2-202    [011] d....  1158.884492: sched_stat_runtime: comm=kworker/11:2 pid=202 runtime=5150 [ns] vruntime=76603911111 [ns] deadline=76606905961 [ns] lag=76606905961
                                                                                               ^^^^^^^^^^^^^^^^
   kworker/11:2-202    [011] d....  1158.884492: sched_update: comm=kworker/11:2 pid=202 sev=76603911111 sed=76606905961 sel=-1128 sew=1048576 avg=76603909983 min=76603905961 cpu=11 nr=2 lag=-1128 lim=10000000
                                                                                                                         ^^^^^^^^^
   kworker/11:2-202    [011] d....  1158.884494: sched_stat_wait: comm=vhost-2961 pid=2984 delay=5150 [ns]
   kworker/11:2-202    [011] d....  1158.884494: sched_switch: prev_comm=kworker/11:2 prev_pid=202 prev_prio=120 prev_state=I ==> next_comm=vhost-2961 next_pid=2984 next_prio=120

In the sequence above, the kworker gets woken up by the vhost and placed on the timeline with 0 lag.
The kworker then executes for 5150ns and returns control to the vhost.
Unfortunately, this short runtime earns the kworker a negative lag of -1128.
This in turn, causes the kworker to not be selected by check_preempt_wakeup_fair.

My naive understanding of lag is, that only those entities get negative lag, which consume
more time than they should. Why is the kworker being punished for running only a tiny
portion of time?

In the majority of cases, the kworker finishes after a 4-digit number of ns.
There are occassional outliers with 5-digit numbers. I would therefore not 
expect negative lag for the kworker.

It is fair to say that the kworker was executing while the vhost was not.
kworker gets put on the queue with no lag, so it essentially has its vruntime
set to avg_vruntime.
After giving up its timeslice the kworker has now a vruntime which is larger
than the avg_vruntime. Hence the negative lag might make sense here from an
algorithmic standpoint. 


2a/b. vhost getting increased deadlines over time, no call of pick_eevdf

    vhost-2961-2984    [011] d.h..  1158.892878: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=8385872 [ns] vruntime=3603948448 [ns] deadline=3606948448 [ns] lag=3598548386
    vhost-2961-2984    [011] d.h..  1158.892879: sched_stat_runtime: comm= pid=0 runtime=8385872 [ns] vruntime=76604158567 [ns] deadline=77624751246 [ns] lag=77624751246
<..>
    vhost-2961-2984    [011] d.h..  1158.902877: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=9999435 [ns] vruntime=3613947883 [ns] deadline=3616947883 [ns] lag=3598548386
    vhost-2961-2984    [011] d.h..  1158.902878: sched_stat_runtime: comm= pid=0 runtime=9999435 [ns] vruntime=76633826282 [ns] deadline=78137144356 [ns] lag=77624751246
<..>
    vhost-2961-2984    [011] d.h..  1158.912877: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=9999824 [ns] vruntime=3623947707 [ns] deadline=3626947707 [ns] lag=3598548386
    vhost-2961-2984    [011] d.h..  1158.912878: sched_stat_runtime: comm= pid=0 runtime=9999824 [ns] vruntime=76688003113 [ns] deadline=78161723086 [ns] lag=77624751246
<..>
<..>
    vhost-2961-2984    [011] dN...  1159.152927: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=40402 [ns] vruntime=3863988069 [ns] deadline=3866947667 [ns] lag=3598548386
    vhost-2961-2984    [011] dN...  1159.152928: sched_stat_runtime: comm= pid=0 runtime=40402 [ns] vruntime=78355923791 [ns] deadline=78393801472 [ns] lag=77624751246

In the sequence above, I extended the tracing of sched_stat_runtime to use 
for_each_sched_entity to also output the values for the cgroup hierarchy.
The first entry represents the actual task, the second entry represents
the root for that particular cgroup. I dropped the levels in between
for readability.

The first three groupings are happening in sequence. The fourth grouping
is the last sched_stat_runtime update before the vhost gets migrated off
the CPU. The ones in between repeat the same pattern.

Interestingly, the vruntimes of the root grow faster than the actual tasks.
I assume this is intended.
At the same time, the deadlines keep on growing for vhost and the cgroup root.
At the same time, the kworker is left starving with its negative lag.
At no point in this sequence, pick_eevdf is being called.

The only time pick_eevdf is being called is right when the kworker is woken up.
So check_preempt_wakeup_fair seems to be the only chance for the kworker to get
scheduled in time.

For reference:
    vhost-2961-2984    [011] d....  1158.884563: sched_place: comm=kworker/11:2 pid=202 sev=76604163719 sed=76607163719 sel=-1128 sew=1048576 avg=76604158567 min=76604158567 cpu=11 nr=1 vru=76604158567 lag=-5152

The kworker has a deadline which is definitely smaller than the one of vhost
in later stages. So, I would assume it should get scheduled at some point.
If vhost is running in kernel space and is therefore not preemptable,
this would be expected behavior though.


3. vhost looping endlessly, waiting for kworker to be scheduled

I dug a little deeper on what the vhost is doing. I'm not an expert on
virtio whatsoever, so these are just educated guesses that maybe
someone can verify/correct. Please bear with me probably messing up 
the terminology.

- vhost is looping through available queues.
- vhost wants to wake up a kworker to process a found queue.
- kworker does something with that queue and terminates quickly.

What I found by throwing in some very noisy trace statements was that,
if the kworker is not woken up, the vhost just keeps looping accross
all available queues (and seems to repeat itself). So it essentially
relies on the scheduler to schedule the kworker fast enough. Otherwise
it will just keep on looping until it is migrated off the CPU.


SUMMARY 

1 and 2a/b have some more or less plausible potential explanations, 
where the EEVDF scheduler might just do what it is designed to do.

3 is more tricky since I'm not familiar with the topic. If the vhost just
relies on the kworker pre-empting the vhost, than this sounds a bit
counter-intuitive. But there might also be a valid design decision
behind this.

If 1 and 2 are indeed plausible, path 3 is probably the
one to go in order to figure out if we have a problem there.
[...]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                 ` <07513.123120701265800278@us-mta-474.us.mimecast.lan>
@ 2023-12-07  6:48                   ` Michael S. Tsirkin
  2023-12-08  9:24                     ` Tobias Huschle
       [not found]                     ` <56082.123120804242300177@us-mta-137.us.mimecast.lan>
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-07  6:48 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Abel Wu, Peter Zijlstra, Linux Kernel, kvm, virtualization,
	netdev, jasowang

On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> 3. vhost looping endlessly, waiting for kworker to be scheduled
> 
> I dug a little deeper on what the vhost is doing. I'm not an expert on
> virtio whatsoever, so these are just educated guesses that maybe
> someone can verify/correct. Please bear with me probably messing up 
> the terminology.
> 
> - vhost is looping through available queues.
> - vhost wants to wake up a kworker to process a found queue.
> - kworker does something with that queue and terminates quickly.
> 
> What I found by throwing in some very noisy trace statements was that,
> if the kworker is not woken up, the vhost just keeps looping accross
> all available queues (and seems to repeat itself). So it essentially
> relies on the scheduler to schedule the kworker fast enough. Otherwise
> it will just keep on looping until it is migrated off the CPU.


Normally it takes the buffers off the queue and is done with it.
I am guessing that at the same time guest is running on some other
CPU and keeps adding available buffers?


-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-07  6:48                   ` Michael S. Tsirkin
@ 2023-12-08  9:24                     ` Tobias Huschle
  2023-12-08 17:28                       ` Mike Christie
       [not found]                     ` <56082.123120804242300177@us-mta-137.us.mimecast.lan>
  1 sibling, 1 reply; 58+ messages in thread
From: Tobias Huschle @ 2023-12-08  9:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Abel Wu, Peter Zijlstra, Linux Kernel, kvm, virtualization,
	netdev, jasowang

On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > 
> > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > virtio whatsoever, so these are just educated guesses that maybe
> > someone can verify/correct. Please bear with me probably messing up 
> > the terminology.
> > 
> > - vhost is looping through available queues.
> > - vhost wants to wake up a kworker to process a found queue.
> > - kworker does something with that queue and terminates quickly.
> > 
> > What I found by throwing in some very noisy trace statements was that,
> > if the kworker is not woken up, the vhost just keeps looping accross
> > all available queues (and seems to repeat itself). So it essentially
> > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > it will just keep on looping until it is migrated off the CPU.
> 
> 
> Normally it takes the buffers off the queue and is done with it.
> I am guessing that at the same time guest is running on some other
> CPU and keeps adding available buffers?
> 

It seems to do just that, there are multiple other vhost instances
involved which might keep filling up thoses queues. 

Unfortunately, this makes the problematic vhost instance to stay on
the CPU and prevents said kworker to get scheduled. The kworker is
explicitly woken up by vhost, so it wants it to do something.

At this point it seems that there is an assumption about the scheduler
in place which is no longer fulfilled by EEVDF. From the discussion so
far, it seems like EEVDF does what is intended to do.

Shouldn't there be a more explicit mechanism in use that allows the
kworker to be scheduled in favor of the vhost?

It is also concerning that the vhost seems cannot be preempted by the
scheduler while executing that loop.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                     ` <56082.123120804242300177@us-mta-137.us.mimecast.lan>
@ 2023-12-08 10:31                       ` Michael S. Tsirkin
  2023-12-08 11:41                         ` Tobias Huschle
       [not found]                         ` <53044.123120806415900549@us-mta-342.us.mimecast.lan>
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-08 10:31 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Abel Wu, Peter Zijlstra, Linux Kernel, kvm, virtualization,
	netdev, jasowang

On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > 
> > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > virtio whatsoever, so these are just educated guesses that maybe
> > > someone can verify/correct. Please bear with me probably messing up 
> > > the terminology.
> > > 
> > > - vhost is looping through available queues.
> > > - vhost wants to wake up a kworker to process a found queue.
> > > - kworker does something with that queue and terminates quickly.
> > > 
> > > What I found by throwing in some very noisy trace statements was that,
> > > if the kworker is not woken up, the vhost just keeps looping accross
> > > all available queues (and seems to repeat itself). So it essentially
> > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > it will just keep on looping until it is migrated off the CPU.
> > 
> > 
> > Normally it takes the buffers off the queue and is done with it.
> > I am guessing that at the same time guest is running on some other
> > CPU and keeps adding available buffers?
> > 
> 
> It seems to do just that, there are multiple other vhost instances
> involved which might keep filling up thoses queues. 
> 

No vhost is ever only draining queues. Guest is filling them.

> Unfortunately, this makes the problematic vhost instance to stay on
> the CPU and prevents said kworker to get scheduled. The kworker is
> explicitly woken up by vhost, so it wants it to do something.
> 
> At this point it seems that there is an assumption about the scheduler
> in place which is no longer fulfilled by EEVDF. From the discussion so
> far, it seems like EEVDF does what is intended to do.
> 
> Shouldn't there be a more explicit mechanism in use that allows the
> kworker to be scheduled in favor of the vhost?
> 
> It is also concerning that the vhost seems cannot be preempted by the
> scheduler while executing that loop.


Which loop is that, exactly?

-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-08 10:31                       ` Re: " Michael S. Tsirkin
@ 2023-12-08 11:41                         ` Tobias Huschle
       [not found]                         ` <53044.123120806415900549@us-mta-342.us.mimecast.lan>
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2023-12-08 11:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Abel Wu, Peter Zijlstra, Linux Kernel, kvm, virtualization,
	netdev, jasowang

On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > 
> > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > someone can verify/correct. Please bear with me probably messing up 
> > > > the terminology.
> > > > 
> > > > - vhost is looping through available queues.
> > > > - vhost wants to wake up a kworker to process a found queue.
> > > > - kworker does something with that queue and terminates quickly.
> > > > 
> > > > What I found by throwing in some very noisy trace statements was that,
> > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > all available queues (and seems to repeat itself). So it essentially
> > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > it will just keep on looping until it is migrated off the CPU.
> > > 
> > > 
> > > Normally it takes the buffers off the queue and is done with it.
> > > I am guessing that at the same time guest is running on some other
> > > CPU and keeps adding available buffers?
> > > 
> > 
> > It seems to do just that, there are multiple other vhost instances
> > involved which might keep filling up thoses queues. 
> > 
> 
> No vhost is ever only draining queues. Guest is filling them.
> 
> > Unfortunately, this makes the problematic vhost instance to stay on
> > the CPU and prevents said kworker to get scheduled. The kworker is
> > explicitly woken up by vhost, so it wants it to do something.
> > 
> > At this point it seems that there is an assumption about the scheduler
> > in place which is no longer fulfilled by EEVDF. From the discussion so
> > far, it seems like EEVDF does what is intended to do.
> > 
> > Shouldn't there be a more explicit mechanism in use that allows the
> > kworker to be scheduled in favor of the vhost?
> > 
> > It is also concerning that the vhost seems cannot be preempted by the
> > scheduler while executing that loop.
> 
> 
> Which loop is that, exactly?

The loop continously passes translate_desc in drivers/vhost/vhost.c
That's where I put the trace statements.

The overall sequence seems to be (top to bottom):

handle_rx
get_rx_bufs
vhost_get_vq_desc
vhost_get_avail_head
vhost_get_avail
__vhost_get_user_slow
translate_desc               << trace statement in here
vhost_iotlb_itree_first

These functions show up as having increased overhead in perf.

There are multiple loops going on in there.
Again the disclaimer though, I'm not familiar with that code at all.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-08  9:24                     ` Tobias Huschle
@ 2023-12-08 17:28                       ` Mike Christie
  0 siblings, 0 replies; 58+ messages in thread
From: Mike Christie @ 2023-12-08 17:28 UTC (permalink / raw)
  To: Tobias Huschle, Michael S. Tsirkin
  Cc: Abel Wu, Peter Zijlstra, Linux Kernel, kvm, virtualization,
	netdev, jasowang

On 12/8/23 3:24 AM, Tobias Huschle wrote:
> On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
>> On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
>>> 3. vhost looping endlessly, waiting for kworker to be scheduled
>>>
>>> I dug a little deeper on what the vhost is doing. I'm not an expert on
>>> virtio whatsoever, so these are just educated guesses that maybe
>>> someone can verify/correct. Please bear with me probably messing up 
>>> the terminology.
>>>
>>> - vhost is looping through available queues.
>>> - vhost wants to wake up a kworker to process a found queue.
>>> - kworker does something with that queue and terminates quickly.
>>>
>>> What I found by throwing in some very noisy trace statements was that,
>>> if the kworker is not woken up, the vhost just keeps looping accross
>>> all available queues (and seems to repeat itself). So it essentially
>>> relies on the scheduler to schedule the kworker fast enough. Otherwise
>>> it will just keep on looping until it is migrated off the CPU.
>>
>>
>> Normally it takes the buffers off the queue and is done with it.
>> I am guessing that at the same time guest is running on some other
>> CPU and keeps adding available buffers?
>>
> 
> It seems to do just that, there are multiple other vhost instances
> involved which might keep filling up thoses queues. 
> 
> Unfortunately, this makes the problematic vhost instance to stay on
> the CPU and prevents said kworker to get scheduled. The kworker is
> explicitly woken up by vhost, so it wants it to do something.
> 
> At this point it seems that there is an assumption about the scheduler
> in place which is no longer fulfilled by EEVDF. From the discussion so
> far, it seems like EEVDF does what is intended to do.
> 
> Shouldn't there be a more explicit mechanism in use that allows the
> kworker to be scheduled in favor of the vhost?
> 
> It is also concerning that the vhost seems cannot be preempted by the
> scheduler while executing that loop.
> 

Hey,

I recently noticed this change:

commit 05bfb338fa8dd40b008ce443e397fc374f6bd107
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Fri Feb 24 08:50:01 2023 -0800

    vhost: Fix livepatch timeouts in vhost_worker()

We used to do:

while (1)
	for each vhost work item in list
		execute work item
		if (need_resched())
                	schedule();

and after that patch we do:

while (1)
	for each vhost work item in list
		execute work item
		cond_resched()


Would the need_resched check we used to have give you what
you wanted?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                         ` <53044.123120806415900549@us-mta-342.us.mimecast.lan>
@ 2023-12-09 10:42                           ` Michael S. Tsirkin
  2023-12-11  7:26                             ` Jason Wang
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-09 10:42 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Abel Wu, Peter Zijlstra, Linux Kernel, kvm, virtualization,
	netdev, jasowang

On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > > 
> > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > someone can verify/correct. Please bear with me probably messing up 
> > > > > the terminology.
> > > > > 
> > > > > - vhost is looping through available queues.
> > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > - kworker does something with that queue and terminates quickly.
> > > > > 
> > > > > What I found by throwing in some very noisy trace statements was that,
> > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > > it will just keep on looping until it is migrated off the CPU.
> > > > 
> > > > 
> > > > Normally it takes the buffers off the queue and is done with it.
> > > > I am guessing that at the same time guest is running on some other
> > > > CPU and keeps adding available buffers?
> > > > 
> > > 
> > > It seems to do just that, there are multiple other vhost instances
> > > involved which might keep filling up thoses queues. 
> > > 
> > 
> > No vhost is ever only draining queues. Guest is filling them.
> > 
> > > Unfortunately, this makes the problematic vhost instance to stay on
> > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > explicitly woken up by vhost, so it wants it to do something.
> > > 
> > > At this point it seems that there is an assumption about the scheduler
> > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > far, it seems like EEVDF does what is intended to do.
> > > 
> > > Shouldn't there be a more explicit mechanism in use that allows the
> > > kworker to be scheduled in favor of the vhost?
> > > 
> > > It is also concerning that the vhost seems cannot be preempted by the
> > > scheduler while executing that loop.
> > 
> > 
> > Which loop is that, exactly?
> 
> The loop continously passes translate_desc in drivers/vhost/vhost.c
> That's where I put the trace statements.
> 
> The overall sequence seems to be (top to bottom):
> 
> handle_rx
> get_rx_bufs
> vhost_get_vq_desc
> vhost_get_avail_head
> vhost_get_avail
> __vhost_get_user_slow
> translate_desc               << trace statement in here
> vhost_iotlb_itree_first

I wonder why do you keep missing cache and re-translating.
Is pr_debug enabled for you? If not could you check if it
outputs anything?
Or you can tweak:

#define vq_err(vq, fmt, ...) do {                                  \
                pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
                if ((vq)->error_ctx)                               \
                                eventfd_signal((vq)->error_ctx, 1);\
        } while (0)

to do pr_err if you prefer.

> These functions show up as having increased overhead in perf.
> 
> There are multiple loops going on in there.
> Again the disclaimer though, I'm not familiar with that code at all.


So there's a limit there: vhost_exceeds_weight should requeue work:

        } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));

then we invoke scheduler each time before re-executing it:


{       
        struct vhost_worker *worker = data;
        struct vhost_work *work, *work_next;
        struct llist_node *node;
        
        node = llist_del_all(&worker->work_list);
        if (node) {
                __set_current_state(TASK_RUNNING);

                node = llist_reverse_order(node);
                /* make sure flag is seen after deletion */
                smp_wmb();
                llist_for_each_entry_safe(work, work_next, node, node) {
                        clear_bit(VHOST_WORK_QUEUED, &work->flags);
                        kcov_remote_start_common(worker->kcov_handle);
                        work->fn(work);
                        kcov_remote_stop();
                        cond_resched();
                }
        }

        return !!node;
}       

These are the byte and packet limits:

/* Max number of bytes transferred before requeueing the job.
 * Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x80000

/* Max number of packets transferred before requeueing the job.
 * Using this limit prevents one virtqueue from starving others with small
 * pkts.
 */
#define VHOST_NET_PKT_WEIGHT 256


Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?

-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-09 10:42                           ` Michael S. Tsirkin
@ 2023-12-11  7:26                             ` Jason Wang
  2023-12-11 16:53                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Wang @ 2023-12-11  7:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tobias Huschle, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Sat, Dec 9, 2023 at 6:42 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> > On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > > >
> > > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > > someone can verify/correct. Please bear with me probably messing up
> > > > > > the terminology.
> > > > > >
> > > > > > - vhost is looping through available queues.
> > > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > > - kworker does something with that queue and terminates quickly.
> > > > > >
> > > > > > What I found by throwing in some very noisy trace statements was that,
> > > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > > > it will just keep on looping until it is migrated off the CPU.
> > > > >
> > > > >
> > > > > Normally it takes the buffers off the queue and is done with it.
> > > > > I am guessing that at the same time guest is running on some other
> > > > > CPU and keeps adding available buffers?
> > > > >
> > > >
> > > > It seems to do just that, there are multiple other vhost instances
> > > > involved which might keep filling up thoses queues.
> > > >
> > >
> > > No vhost is ever only draining queues. Guest is filling them.
> > >
> > > > Unfortunately, this makes the problematic vhost instance to stay on
> > > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > > explicitly woken up by vhost, so it wants it to do something.

It looks to me vhost doesn't use workqueue but the worker by itself.

> > > >
> > > > At this point it seems that there is an assumption about the scheduler
> > > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > > far, it seems like EEVDF does what is intended to do.
> > > >
> > > > Shouldn't there be a more explicit mechanism in use that allows the
> > > > kworker to be scheduled in favor of the vhost?

Vhost did a brunch of copy_from_user() which should trigger
__might_fault() so a __might_sleep() most of the case.

> > > >
> > > > It is also concerning that the vhost seems cannot be preempted by the
> > > > scheduler while executing that loop.
> > >
> > >
> > > Which loop is that, exactly?
> >
> > The loop continously passes translate_desc in drivers/vhost/vhost.c
> > That's where I put the trace statements.
> >
> > The overall sequence seems to be (top to bottom):
> >
> > handle_rx
> > get_rx_bufs
> > vhost_get_vq_desc
> > vhost_get_avail_head
> > vhost_get_avail
> > __vhost_get_user_slow
> > translate_desc               << trace statement in here
> > vhost_iotlb_itree_first
>
> I wonder why do you keep missing cache and re-translating.
> Is pr_debug enabled for you? If not could you check if it
> outputs anything?
> Or you can tweak:
>
> #define vq_err(vq, fmt, ...) do {                                  \
>                 pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
>                 if ((vq)->error_ctx)                               \
>                                 eventfd_signal((vq)->error_ctx, 1);\
>         } while (0)
>
> to do pr_err if you prefer.
>
> > These functions show up as having increased overhead in perf.
> >
> > There are multiple loops going on in there.
> > Again the disclaimer though, I'm not familiar with that code at all.
>
>
> So there's a limit there: vhost_exceeds_weight should requeue work:
>
>         } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));
>
> then we invoke scheduler each time before re-executing it:
>
>
> {
>         struct vhost_worker *worker = data;
>         struct vhost_work *work, *work_next;
>         struct llist_node *node;
>
>         node = llist_del_all(&worker->work_list);
>         if (node) {
>                 __set_current_state(TASK_RUNNING);
>
>                 node = llist_reverse_order(node);
>                 /* make sure flag is seen after deletion */
>                 smp_wmb();
>                 llist_for_each_entry_safe(work, work_next, node, node) {
>                         clear_bit(VHOST_WORK_QUEUED, &work->flags);
>                         kcov_remote_start_common(worker->kcov_handle);
>                         work->fn(work);
>                         kcov_remote_stop();
>                         cond_resched();
>                 }
>         }
>
>         return !!node;
> }
>
> These are the byte and packet limits:
>
> /* Max number of bytes transferred before requeueing the job.
>  * Using this limit prevents one virtqueue from starving others. */
> #define VHOST_NET_WEIGHT 0x80000
>
> /* Max number of packets transferred before requeueing the job.
>  * Using this limit prevents one virtqueue from starving others with small
>  * pkts.
>  */
> #define VHOST_NET_PKT_WEIGHT 256
>
>
> Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?

Or a dirty hack like cond_resched() in translate_desc().

Thanks


>
> --
> MST
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-11  7:26                             ` Jason Wang
@ 2023-12-11 16:53                               ` Michael S. Tsirkin
  2023-12-12  3:00                                 ` Jason Wang
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-11 16:53 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tobias Huschle, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?
> 
> Or a dirty hack like cond_resched() in translate_desc().

what do you mean, exactly?

-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-11 16:53                               ` Michael S. Tsirkin
@ 2023-12-12  3:00                                 ` Jason Wang
  2023-12-12 16:15                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Jason Wang @ 2023-12-12  3:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tobias Huschle, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?
> >
> > Or a dirty hack like cond_resched() in translate_desc().
>
> what do you mean, exactly?

Ideally it should not matter, but Tobias said there's an unexpectedly
long time spent on translate_desc() which may indicate that the
might_sleep() or other doesn't work for some reason.

Thanks

>
> --
> MST
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-12  3:00                                 ` Jason Wang
@ 2023-12-12 16:15                                   ` Michael S. Tsirkin
  2023-12-13 10:37                                     ` Tobias Huschle
       [not found]                                     ` <42870.123121305373200110@us-mta-641.us.mimecast.lan>
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-12 16:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tobias Huschle, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > > > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?
> > >
> > > Or a dirty hack like cond_resched() in translate_desc().
> >
> > what do you mean, exactly?
> 
> Ideally it should not matter, but Tobias said there's an unexpectedly
> long time spent on translate_desc() which may indicate that the
> might_sleep() or other doesn't work for some reason.
> 
> Thanks

You mean for debugging, add it with a patch to see what this does?

Sure - can you post the debugging patch pls?

> >
> > --
> > MST
> >


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-12 16:15                                   ` Michael S. Tsirkin
@ 2023-12-13 10:37                                     ` Tobias Huschle
       [not found]                                     ` <42870.123121305373200110@us-mta-641.us.mimecast.lan>
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2023-12-13 10:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:

We played around with the suggestions and some other ideas.
I would like to share some initial results.

We tried the following:

1. Call uncondtional schedule in the vhost_worker function
2. Change the HZ value from 100 to 1000
3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()
4. Adding a cond_resched to translate_desc
5. Reducing VHOST_NET_WEIGHT to 25% of its original value

Please find the diffs below.

Summary:

Option 1 is very very hacky but resolved the regression.
Option 2 reduces the regression by ~20%.
Options 3-5 do not help unfortunately.

Potential explanation:

While the vhost is executing, the need_resched flag is not set (observable
in the traces). Therefore cond_resched and alike will do nothing. vhost
will continue executing until the need_resched flag is set by an external
party, e.g. by a request to migrate the vhost.

Calling schedule unconditionally forces the scheduler to re-evaluate all 
tasks and their vruntime/deadline/vlag values. The scheduler comes to the
correct conclusion, that the kworker should be executed and from there it
is smooth sailing. I will have to verify that sequence by collecting more
traces, but this seems rather plausible.
This hack might of course introduce all kinds of side effects but might
provide an indicator that this is the actual problem.
The big question would be how to solve this conceptually, and, first
things first, whether you think this is a viable hypothesis.

Increasing the HZ value helps most likely because the other CPUs take 
scheduling/load balancing decisions more often as well and therefore
trigger the migration faster.

Bringing down VHOST_NET_WEIGHT even more might also help to shorten the
vhost loop. But I have no intuition how low we can/should go here.


We also changed vq_err to print error messages, but did not encounter any.

Diffs:
--------------------------------------------------------------------------

1. Call uncondtional schedule in the vhost_worker function

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..16d73fd28831 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -414,6 +414,7 @@ static bool vhost_worker(void *data)
                }
        }
 
+       schedule();
        return !!node;
 }

--------------------------------------------------------------------------

2. Change the HZ value from 100 to 1000

--> config change 

--------------------------------------------------------------------------

3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..d519d598ebb9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -410,7 +410,8 @@ static bool vhost_worker(void *data)
                        kcov_remote_start_common(worker->kcov_handle);
                        work->fn(work);
                        kcov_remote_stop();
-                       cond_resched();
+                       if (need_resched())
+                               schedule();
                }
        }

--------------------------------------------------------------------------

4. Adding a cond_resched to translate_desc

I just picked some location.

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..f885dd29cbd1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2367,6 +2367,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
                s += size;
                addr += size;
                ++ret;
+               cond_resched();
        }
 
        if (ret == -EAGAIN)

--------------------------------------------------------------------------

5. Reducing VHOST_NET_WEIGHT to 25% of its original value

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..2c6966ea6229 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -42,7 +42,7 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
-#define VHOST_NET_WEIGHT 0x80000
+#define VHOST_NET_WEIGHT 0x20000
 
 /* Max number of packets transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others with small

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                     ` <42870.123121305373200110@us-mta-641.us.mimecast.lan>
@ 2023-12-13 12:00                                       ` Michael S. Tsirkin
  2023-12-13 12:45                                         ` Tobias Huschle
       [not found]                                         ` <25485.123121307454100283@us-mta-18.us.mimecast.lan>
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-13 12:00 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> We played around with the suggestions and some other ideas.
> I would like to share some initial results.
> 
> We tried the following:
> 
> 1. Call uncondtional schedule in the vhost_worker function
> 2. Change the HZ value from 100 to 1000
> 3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()
> 4. Adding a cond_resched to translate_desc
> 5. Reducing VHOST_NET_WEIGHT to 25% of its original value
> 
> Please find the diffs below.
> 
> Summary:
> 
> Option 1 is very very hacky but resolved the regression.
> Option 2 reduces the regression by ~20%.
> Options 3-5 do not help unfortunately.
> 
> Potential explanation:
> 
> While the vhost is executing, the need_resched flag is not set (observable
> in the traces). Therefore cond_resched and alike will do nothing. vhost
> will continue executing until the need_resched flag is set by an external
> party, e.g. by a request to migrate the vhost.
> 
> Calling schedule unconditionally forces the scheduler to re-evaluate all 
> tasks and their vruntime/deadline/vlag values. The scheduler comes to the
> correct conclusion, that the kworker should be executed and from there it
> is smooth sailing. I will have to verify that sequence by collecting more
> traces, but this seems rather plausible.
> This hack might of course introduce all kinds of side effects but might
> provide an indicator that this is the actual problem.
> The big question would be how to solve this conceptually, and, first
> things first, whether you think this is a viable hypothesis.
> 
> Increasing the HZ value helps most likely because the other CPUs take 
> scheduling/load balancing decisions more often as well and therefore
> trigger the migration faster.
> 
> Bringing down VHOST_NET_WEIGHT even more might also help to shorten the
> vhost loop. But I have no intuition how low we can/should go here.
> 
> 
> We also changed vq_err to print error messages, but did not encounter any.
> 
> Diffs:
> --------------------------------------------------------------------------
> 
> 1. Call uncondtional schedule in the vhost_worker function
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index e0c181ad17e3..16d73fd28831 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -414,6 +414,7 @@ static bool vhost_worker(void *data)
>                 }
>         }
>  
> +       schedule();
>         return !!node;
>  }


So, this helps.
But this is very surprising!


static int vhost_task_fn(void *data)
{
        struct vhost_task *vtsk = data;
        bool dead = false;

        for (;;) {
                bool did_work;

                if (!dead && signal_pending(current)) {
                        struct ksignal ksig;
                        /*
                         * Calling get_signal will block in SIGSTOP,
                         * or clear fatal_signal_pending, but remember
                         * what was set.
                         *
                         * This thread won't actually exit until all
                         * of the file descriptors are closed, and
                         * the release function is called.
                         */
                        dead = get_signal(&ksig);
                        if (dead)
                                clear_thread_flag(TIF_SIGPENDING);
                }

                /* mb paired w/ vhost_task_stop */
                set_current_state(TASK_INTERRUPTIBLE);

                if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
                        __set_current_state(TASK_RUNNING);
                        break;
                }

                did_work = vtsk->fn(vtsk->data);
                if (!did_work)
                        schedule();
        }

        complete(&vtsk->exited);
        do_exit(0);

}

Apparently schedule is already called?


-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-13 12:00                                       ` Michael S. Tsirkin
@ 2023-12-13 12:45                                         ` Tobias Huschle
       [not found]                                         ` <25485.123121307454100283@us-mta-18.us.mimecast.lan>
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2023-12-13 12:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:

[...]
> 
> Apparently schedule is already called?
> 

What about this: 

static int vhost_task_fn(void *data)
{
	<...>
	did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm not mistaken
	if (!did_work)
		schedule();
	<...>
}

static bool vhost_worker(void *data)
{
	struct vhost_worker *worker = data;
	struct vhost_work *work, *work_next;
	struct llist_node *node;

	node = llist_del_all(&worker->work_list);
	if (node) {
		<...>
		llist_for_each_entry_safe(work, work_next, node, node) {
			<...>
		}
	}

	return !!node;
}

The llist_for_each_entry_safe does not actually change the node value, doesn't it?

If it does not change it, !!node would return 1.
Thereby skipping the schedule.

This was changed recently with:
f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

It returned a hardcoded 0 before. The commit message explicitly mentions this
change to make vhost_worker return 1 if it did something.

Seems indeed like a nasty little side effect caused by EEVDF not scheduling
the woken up kworker right away.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                         ` <25485.123121307454100283@us-mta-18.us.mimecast.lan>
@ 2023-12-13 14:47                                           ` Michael S. Tsirkin
  2023-12-13 14:55                                           ` Michael S. Tsirkin
  1 sibling, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-13 14:47 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev, Mike Christie

On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> [...]
> > 
> > Apparently schedule is already called?
> > 
> 
> What about this: 
> 
> static int vhost_task_fn(void *data)
> {
> 	<...>
> 	did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm not mistaken
> 	if (!did_work)
> 		schedule();
> 	<...>
> }
> 
> static bool vhost_worker(void *data)
> {
> 	struct vhost_worker *worker = data;
> 	struct vhost_work *work, *work_next;
> 	struct llist_node *node;
> 
> 	node = llist_del_all(&worker->work_list);
> 	if (node) {
> 		<...>
> 		llist_for_each_entry_safe(work, work_next, node, node) {
> 			<...>
> 		}
> 	}
> 
> 	return !!node;
> }
> 
> The llist_for_each_entry_safe does not actually change the node value, doesn't it?
> 
> If it does not change it, !!node would return 1.
> Thereby skipping the schedule.
> 
> This was changed recently with:
> f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> 
> It returned a hardcoded 0 before. The commit message explicitly mentions this
> change to make vhost_worker return 1 if it did something.
> 
> Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> the woken up kworker right away.

Indeed, but previously vhost_worker was looping itself.
And it did:
-               node = llist_del_all(&worker->work_list);
-               if (!node)
-                       schedule();

so I don't think this was changed at all.






-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                         ` <25485.123121307454100283@us-mta-18.us.mimecast.lan>
  2023-12-13 14:47                                           ` Michael S. Tsirkin
@ 2023-12-13 14:55                                           ` Michael S. Tsirkin
  2023-12-14  7:14                                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-13 14:55 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> [...]
> > 
> > Apparently schedule is already called?
> > 
> 
> What about this: 
> 
> static int vhost_task_fn(void *data)
> {
> 	<...>
> 	did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm not mistaken
> 	if (!did_work)
> 		schedule();
> 	<...>
> }
> 
> static bool vhost_worker(void *data)
> {
> 	struct vhost_worker *worker = data;
> 	struct vhost_work *work, *work_next;
> 	struct llist_node *node;
> 
> 	node = llist_del_all(&worker->work_list);
> 	if (node) {
> 		<...>
> 		llist_for_each_entry_safe(work, work_next, node, node) {
> 			<...>
> 		}
> 	}
> 
> 	return !!node;
> }
> 
> The llist_for_each_entry_safe does not actually change the node value, doesn't it?
> 
> If it does not change it, !!node would return 1.
> Thereby skipping the schedule.
> 
> This was changed recently with:
> f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> 
> It returned a hardcoded 0 before. The commit message explicitly mentions this
> change to make vhost_worker return 1 if it did something.
> 
> Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> the woken up kworker right away.


So we are actually making an effort to be nice.
Documentation/kernel-hacking/hacking.rst says:

If you're doing longer computations: first think userspace. If you
**really** want to do it in kernel you should regularly check if you need
to give up the CPU (remember there is cooperative multitasking per CPU).
Idiom::

    cond_resched(); /* Will sleep */


and this is what vhost.c does.

At this point I'm not sure why it's appropriate to call schedule() as opposed to
cond_resched(). Ideas?


-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-13 14:55                                           ` Michael S. Tsirkin
@ 2023-12-14  7:14                                             ` Michael S. Tsirkin
  2024-01-08 13:13                                               ` Tobias Huschle
       [not found]                                               ` <92916.124010808133201076@us-mta-622.us.mimecast.lan>
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-12-14  7:14 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Wed, Dec 13, 2023 at 09:55:23AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> > On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > [...]
> > > 
> > > Apparently schedule is already called?
> > > 
> > 
> > What about this: 
> > 
> > static int vhost_task_fn(void *data)
> > {
> > 	<...>
> > 	did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm not mistaken
> > 	if (!did_work)
> > 		schedule();
> > 	<...>
> > }
> > 
> > static bool vhost_worker(void *data)
> > {
> > 	struct vhost_worker *worker = data;
> > 	struct vhost_work *work, *work_next;
> > 	struct llist_node *node;
> > 
> > 	node = llist_del_all(&worker->work_list);
> > 	if (node) {
> > 		<...>
> > 		llist_for_each_entry_safe(work, work_next, node, node) {
> > 			<...>
> > 		}
> > 	}
> > 
> > 	return !!node;
> > }
> > 
> > The llist_for_each_entry_safe does not actually change the node value, doesn't it?
> > 
> > If it does not change it, !!node would return 1.
> > Thereby skipping the schedule.
> > 
> > This was changed recently with:
> > f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> > 
> > It returned a hardcoded 0 before. The commit message explicitly mentions this
> > change to make vhost_worker return 1 if it did something.
> > 
> > Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> > the woken up kworker right away.
> 
> 
> So we are actually making an effort to be nice.
> Documentation/kernel-hacking/hacking.rst says:
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
>     cond_resched(); /* Will sleep */
> 
> 
> and this is what vhost.c does.
> 
> At this point I'm not sure why it's appropriate to call schedule() as opposed to
> cond_resched(). Ideas?
> 

Peter, would appreciate feedback on this. When is cond_resched()
insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
be updated to require schedule() instead?


> -- 
> MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2023-12-14  7:14                                             ` Michael S. Tsirkin
@ 2024-01-08 13:13                                               ` Tobias Huschle
       [not found]                                               ` <92916.124010808133201076@us-mta-622.us.mimecast.lan>
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2024-01-08 13:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> 
> Peter, would appreciate feedback on this. When is cond_resched()
> insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
> be updated to require schedule() instead?
> 

Happy new year everybody!

I'd like to bring this thread back to life. To reiterate:

- The introduction of the EEVDF scheduler revealed a performance
  regression in a uperf testcase of ~50%.
- Tracing the scheduler showed that it takes decisions which are
  in line with its design.
- The traces showed as well, that a vhost instance might run
  excessively long on its CPU in some circumstance. Those cause
  the performance regression as they cause delay times of 100+ms
  for a kworker which drives the actual network processing.
- Before EEVDF, the vhost would always be scheduled off its CPU
  in favor of the kworker, as the kworker was being woken up and
  the former scheduler was giving more priority to the woken up
  task. With EEVDF, the kworker, as a long running process, is
  able to accumulate negative lag, which causes EEVDF to not
  prefer it on its wake up, leaving the vhost running.
- If the kworker is not scheduled when being woken up, the vhost
  continues looping until it is migrated off the CPU.
- The vhost offers to be scheduled off the CPU by calling 
  cond_resched(), but, the the need_resched flag is not set,
  therefore cond_resched() does nothing.

To solve this, I see the following options 
  (might not be a complete nor a correct list)
- Along with the wakeup of the kworker, need_resched needs to
  be set, such that cond_resched() triggers a reschedule.
- The vhost calls schedule() instead of cond_resched() to give up
  the CPU. This would of course be a significantly stricter
  approach and might limit the performance of vhost in other cases.
- Preventing the kworker from accumulating negative lag as it is
  mostly not runnable and if it runs, it only runs for a very short
  time frame. This might clash with the overall concept of EEVDF.
- On cond_resched(), verify if the consumed runtime of the caller
  is outweighing the negative lag of another process (e.g. the 
  kworker) and schedule the other process. Introduces overhead
  to cond_resched.

I would be curious on feedback on those ideas and interested in
alternative approaches.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                               ` <92916.124010808133201076@us-mta-622.us.mimecast.lan>
@ 2024-01-09 23:07                                                 ` Michael S. Tsirkin
  2024-01-21 18:44                                                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-01-09 23:07 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> > Peter, would appreciate feedback on this. When is cond_resched()
> > insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
> > be updated to require schedule() instead?
> > 
> 
> Happy new year everybody!
> 
> I'd like to bring this thread back to life. To reiterate:
> 
> - The introduction of the EEVDF scheduler revealed a performance
>   regression in a uperf testcase of ~50%.
> - Tracing the scheduler showed that it takes decisions which are
>   in line with its design.
> - The traces showed as well, that a vhost instance might run
>   excessively long on its CPU in some circumstance. Those cause
>   the performance regression as they cause delay times of 100+ms
>   for a kworker which drives the actual network processing.
> - Before EEVDF, the vhost would always be scheduled off its CPU
>   in favor of the kworker, as the kworker was being woken up and
>   the former scheduler was giving more priority to the woken up
>   task. With EEVDF, the kworker, as a long running process, is
>   able to accumulate negative lag, which causes EEVDF to not
>   prefer it on its wake up, leaving the vhost running.
> - If the kworker is not scheduled when being woken up, the vhost
>   continues looping until it is migrated off the CPU.
> - The vhost offers to be scheduled off the CPU by calling 
>   cond_resched(), but, the the need_resched flag is not set,
>   therefore cond_resched() does nothing.
> 
> To solve this, I see the following options 
>   (might not be a complete nor a correct list)
> - Along with the wakeup of the kworker, need_resched needs to
>   be set, such that cond_resched() triggers a reschedule.
> - The vhost calls schedule() instead of cond_resched() to give up
>   the CPU. This would of course be a significantly stricter
>   approach and might limit the performance of vhost in other cases.

And on these two, I asked:
	Would appreciate feedback on this. When is cond_resched()
	insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
	be updated to require schedule() instead?


> - Preventing the kworker from accumulating negative lag as it is
>   mostly not runnable and if it runs, it only runs for a very short
>   time frame. This might clash with the overall concept of EEVDF.
> - On cond_resched(), verify if the consumed runtime of the caller
>   is outweighing the negative lag of another process (e.g. the 
>   kworker) and schedule the other process. Introduces overhead
>   to cond_resched.
> 
> I would be curious on feedback on those ideas and interested in
> alternative approaches.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                               ` <92916.124010808133201076@us-mta-622.us.mimecast.lan>
  2024-01-09 23:07                                                 ` Michael S. Tsirkin
@ 2024-01-21 18:44                                                 ` Michael S. Tsirkin
  2024-01-22 11:29                                                   ` Tobias Huschle
                                                                     ` (2 more replies)
  1 sibling, 3 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-01-21 18:44 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> > Peter, would appreciate feedback on this. When is cond_resched()
> > insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
> > be updated to require schedule() instead?
> > 
> 
> Happy new year everybody!
> 
> I'd like to bring this thread back to life. To reiterate:
> 
> - The introduction of the EEVDF scheduler revealed a performance
>   regression in a uperf testcase of ~50%.
> - Tracing the scheduler showed that it takes decisions which are
>   in line with its design.
> - The traces showed as well, that a vhost instance might run
>   excessively long on its CPU in some circumstance. Those cause
>   the performance regression as they cause delay times of 100+ms
>   for a kworker which drives the actual network processing.
> - Before EEVDF, the vhost would always be scheduled off its CPU
>   in favor of the kworker, as the kworker was being woken up and
>   the former scheduler was giving more priority to the woken up
>   task. With EEVDF, the kworker, as a long running process, is
>   able to accumulate negative lag, which causes EEVDF to not
>   prefer it on its wake up, leaving the vhost running.
> - If the kworker is not scheduled when being woken up, the vhost
>   continues looping until it is migrated off the CPU.
> - The vhost offers to be scheduled off the CPU by calling 
>   cond_resched(), but, the the need_resched flag is not set,
>   therefore cond_resched() does nothing.
> 
> To solve this, I see the following options 
>   (might not be a complete nor a correct list)
> - Along with the wakeup of the kworker, need_resched needs to
>   be set, such that cond_resched() triggers a reschedule.

Let's try this? Does not look like discussing vhost itself will
draw attention from scheduler guys but posting a scheduling
patch probably will? Can you post a patch?

> - The vhost calls schedule() instead of cond_resched() to give up
>   the CPU. This would of course be a significantly stricter
>   approach and might limit the performance of vhost in other cases.
> - Preventing the kworker from accumulating negative lag as it is
>   mostly not runnable and if it runs, it only runs for a very short
>   time frame. This might clash with the overall concept of EEVDF.
> - On cond_resched(), verify if the consumed runtime of the caller
>   is outweighing the negative lag of another process (e.g. the 
>   kworker) and schedule the other process. Introduces overhead
>   to cond_resched.

Or this last one.


> 
> I would be curious on feedback on those ideas and interested in
> alternative approaches.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-01-21 18:44                                                 ` Michael S. Tsirkin
@ 2024-01-22 11:29                                                   ` Tobias Huschle
  2024-02-01  7:38                                                   ` Tobias Huschle
       [not found]                                                   ` <07974.124020102385100135@us-mta-501.us.mimecast.lan>
  2 siblings, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2024-01-22 11:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > > Peter, would appreciate feedback on this. When is cond_resched()
> > > insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
> > > be updated to require schedule() instead?
> > > 
> > 
> > Happy new year everybody!
> > 
> > I'd like to bring this thread back to life. To reiterate:
> > 
> > - The introduction of the EEVDF scheduler revealed a performance
> >   regression in a uperf testcase of ~50%.
> > - Tracing the scheduler showed that it takes decisions which are
> >   in line with its design.
> > - The traces showed as well, that a vhost instance might run
> >   excessively long on its CPU in some circumstance. Those cause
> >   the performance regression as they cause delay times of 100+ms
> >   for a kworker which drives the actual network processing.
> > - Before EEVDF, the vhost would always be scheduled off its CPU
> >   in favor of the kworker, as the kworker was being woken up and
> >   the former scheduler was giving more priority to the woken up
> >   task. With EEVDF, the kworker, as a long running process, is
> >   able to accumulate negative lag, which causes EEVDF to not
> >   prefer it on its wake up, leaving the vhost running.
> > - If the kworker is not scheduled when being woken up, the vhost
> >   continues looping until it is migrated off the CPU.
> > - The vhost offers to be scheduled off the CPU by calling 
> >   cond_resched(), but, the the need_resched flag is not set,
> >   therefore cond_resched() does nothing.
> > 
> > To solve this, I see the following options 
> >   (might not be a complete nor a correct list)
> > - Along with the wakeup of the kworker, need_resched needs to
> >   be set, such that cond_resched() triggers a reschedule.
> 
> Let's try this? Does not look like discussing vhost itself will
> draw attention from scheduler guys but posting a scheduling
> patch probably will? Can you post a patch?
> 

I'll give it a go.

> > - The vhost calls schedule() instead of cond_resched() to give up
> >   the CPU. This would of course be a significantly stricter
> >   approach and might limit the performance of vhost in other cases.
> > - Preventing the kworker from accumulating negative lag as it is
> >   mostly not runnable and if it runs, it only runs for a very short
> >   time frame. This might clash with the overall concept of EEVDF.
> > - On cond_resched(), verify if the consumed runtime of the caller
> >   is outweighing the negative lag of another process (e.g. the 
> >   kworker) and schedule the other process. Introduces overhead
> >   to cond_resched.
> 
> Or this last one.
> 

This one will probably be more complicated as the necessary information
is not really available at the places where I'd like to see it.
Will have to ponder on that a bit to figure out if there might be an
elegant way to approach this.

> 
> > 
> > I would be curious on feedback on those ideas and interested in
> > alternative approaches.
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-01-21 18:44                                                 ` Michael S. Tsirkin
  2024-01-22 11:29                                                   ` Tobias Huschle
@ 2024-02-01  7:38                                                   ` Tobias Huschle
       [not found]                                                   ` <07974.124020102385100135@us-mta-501.us.mimecast.lan>
  2 siblings, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2024-02-01  7:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > - Along with the wakeup of the kworker, need_resched needs to
> >   be set, such that cond_resched() triggers a reschedule.
> 
> Let's try this? Does not look like discussing vhost itself will
> draw attention from scheduler guys but posting a scheduling
> patch probably will? Can you post a patch?

As a baseline, I verified that the following two options fix
the regression:

- replacing the cond_resched in the vhost_worker function with a hard
  schedule 
- setting the need_resched flag using set_tsk_need_resched(current)
  right before calling cond_resched

I then tried to find a better spot to put the set_tsk_need_resched
call. 

One approach I found to be working is setting the need_resched flag 
at the end of handle_tx and hande_rx.
This would be after data has been actually passed to the socket, so 
the originally blocked kworker has something to do and will profit
from the reschedule. 
It might be possible to go deeper and place the set_tsk_need_resched
call to the location right after actually passing the data, but this
might leave us with sprinkling that call in multiple places and
might be too intrusive.
Furthermore, it might be possible to check if an error occured when
preparing the transmission and then skip the setting of the flag.

This would require a conceptual decision on the vhost side.
This solution would not touch the scheduler, only incentivise it to
do the right thing for this particular regression.

Another idea could be to find the counterpart that initiates the
actual data transfer, which I assume wakes up the kworker. From
what I gather it seems to be an eventfd notification that ends up
somewhere in the qemu code. Not sure if that context would allow
to set the need_resched flag, nor whether this would be a good idea.

> 
> > - On cond_resched(), verify if the consumed runtime of the caller
> >   is outweighing the negative lag of another process (e.g. the 
> >   kworker) and schedule the other process. Introduces overhead
> >   to cond_resched.
> 
> Or this last one.

On cond_resched itself, this will probably only be possible in a very 
very hacky way. That is because currently, there is no immidiate access
to the necessary data available, which would make it necessary to 
bloat up the cond_resched function quite a bit, with a probably 
non-negligible amount of overhead.

Changing other aspects in the scheduler might get us in trouble as
they all would probably resolve back to the question "What is the magic
value that determines whether a small task not being scheduled justifies
setting the need_resched flag for a currently running task or adjusting 
its lag?". As this would then also have to work for all non-vhost related
cases, this looks like a dangerous path to me on second thought.


-------- Summary --------

In my (non-vhost experience) opinion the way to go would be either
replacing the cond_resched with a hard schedule or setting the
need_resched flag within vhost if the a data transfer was successfully
initiated. It will be necessary to check if this causes problems with
other workloads/benchmarks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                                   ` <07974.124020102385100135@us-mta-501.us.mimecast.lan>
@ 2024-02-01  8:08                                                     ` Michael S. Tsirkin
  2024-02-01 11:47                                                       ` Tobias Huschle
       [not found]                                                       ` <89460.124020106474400877@us-mta-475.us.mimecast.lan>
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-02-01  8:08 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > - Along with the wakeup of the kworker, need_resched needs to
> > >   be set, such that cond_resched() triggers a reschedule.
> > 
> > Let's try this? Does not look like discussing vhost itself will
> > draw attention from scheduler guys but posting a scheduling
> > patch probably will? Can you post a patch?
> 
> As a baseline, I verified that the following two options fix
> the regression:
> 
> - replacing the cond_resched in the vhost_worker function with a hard
>   schedule 
> - setting the need_resched flag using set_tsk_need_resched(current)
>   right before calling cond_resched
> 
> I then tried to find a better spot to put the set_tsk_need_resched
> call. 
> 
> One approach I found to be working is setting the need_resched flag 
> at the end of handle_tx and hande_rx.
> This would be after data has been actually passed to the socket, so 
> the originally blocked kworker has something to do and will profit
> from the reschedule. 
> It might be possible to go deeper and place the set_tsk_need_resched
> call to the location right after actually passing the data, but this
> might leave us with sprinkling that call in multiple places and
> might be too intrusive.
> Furthermore, it might be possible to check if an error occured when
> preparing the transmission and then skip the setting of the flag.
> 
> This would require a conceptual decision on the vhost side.
> This solution would not touch the scheduler, only incentivise it to
> do the right thing for this particular regression.
> 
> Another idea could be to find the counterpart that initiates the
> actual data transfer, which I assume wakes up the kworker. From
> what I gather it seems to be an eventfd notification that ends up
> somewhere in the qemu code. Not sure if that context would allow
> to set the need_resched flag, nor whether this would be a good idea.
> 
> > 
> > > - On cond_resched(), verify if the consumed runtime of the caller
> > >   is outweighing the negative lag of another process (e.g. the 
> > >   kworker) and schedule the other process. Introduces overhead
> > >   to cond_resched.
> > 
> > Or this last one.
> 
> On cond_resched itself, this will probably only be possible in a very 
> very hacky way. That is because currently, there is no immidiate access
> to the necessary data available, which would make it necessary to 
> bloat up the cond_resched function quite a bit, with a probably 
> non-negligible amount of overhead.
> 
> Changing other aspects in the scheduler might get us in trouble as
> they all would probably resolve back to the question "What is the magic
> value that determines whether a small task not being scheduled justifies
> setting the need_resched flag for a currently running task or adjusting 
> its lag?". As this would then also have to work for all non-vhost related
> cases, this looks like a dangerous path to me on second thought.
> 
> 
> -------- Summary --------
> 
> In my (non-vhost experience) opinion the way to go would be either
> replacing the cond_resched with a hard schedule or setting the
> need_resched flag within vhost if the a data transfer was successfully
> initiated. It will be necessary to check if this causes problems with
> other workloads/benchmarks.

Yes but conceptually I am still in the dark on whether the fact that
periodically invoking cond_resched is no longer sufficient to be nice to
others is a bug, or intentional.  So you feel it is intentional?
I propose a two patch series then:

patch 1: in this text in Documentation/kernel-hacking/hacking.rst

If you're doing longer computations: first think userspace. If you
**really** want to do it in kernel you should regularly check if you need
to give up the CPU (remember there is cooperative multitasking per CPU).
Idiom::

    cond_resched(); /* Will sleep */


replace cond_resched -> schedule


Since apparently cond_resched is no longer sufficient to
make the scheduler check whether you need to give up the CPU.

patch 2: make this change for vhost.

WDYT?

-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-02-01  8:08                                                     ` Michael S. Tsirkin
@ 2024-02-01 11:47                                                       ` Tobias Huschle
       [not found]                                                       ` <89460.124020106474400877@us-mta-475.us.mimecast.lan>
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2024-02-01 11:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> > -------- Summary --------
> > 
> > In my (non-vhost experience) opinion the way to go would be either
> > replacing the cond_resched with a hard schedule or setting the
> > need_resched flag within vhost if the a data transfer was successfully
> > initiated. It will be necessary to check if this causes problems with
> > other workloads/benchmarks.
> 
> Yes but conceptually I am still in the dark on whether the fact that
> periodically invoking cond_resched is no longer sufficient to be nice to
> others is a bug, or intentional.  So you feel it is intentional?

I would assume that cond_resched is still a valid concept.
But, in this particular scenario we have the following problem:

So far (with CFS) we had:
1. vhost initiates data transfer
2. kworker is woken up
3. CFS gives priority to woken up task and schedules it
4. kworker runs

Now (with EEVDF) we have:
0. In some cases, kworker has accumulated negative lag 
1. vhost initiates data transfer
2. kworker is woken up
-3a. EEVDF does not schedule kworker if it has negative lag
-4a. vhost continues running, kworker on same CPU starves
--
-3b. EEVDF schedules kworker if it has positive or no lag
-4b. kworker runs

In the 3a/4a case, the kworker is given no chance to set the
necessary flag. The flag can only be set by another CPU now.
The schedule of the kworker was not caused by cond_resched, but
rather by the wakeup path of the scheduler.

cond_resched works successfully once the load balancer (I suppose) 
decides to migrate the vhost off to another CPU. In that case, the
load balancer on another CPU sets that flag and we are good.
That then eventually allows the scheduler to pick kworker, but very
late.

> I propose a two patch series then:
> 
> patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
>     cond_resched(); /* Will sleep */
> 
> 
> replace cond_resched -> schedule
> 
> 
> Since apparently cond_resched is no longer sufficient to
> make the scheduler check whether you need to give up the CPU.
> 
> patch 2: make this change for vhost.
> 
> WDYT?

For patch 1, I would like to see some feedback from Peter (or someone else
from the scheduler maintainers).
For patch 2, I would prefer to do some more testing first if this might have
an negative effect on other benchmarks.

I also stumbled upon something in the scheduler code that I want to verify.
Maybe a cgroup thing, will check that out again.

I'll do some more testing with the cond_resched->schedule fix, check the
cgroup thing and wait for Peter then.
Will get back if any of the above yields some results.

> 
> -- 
> MST
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                                       ` <89460.124020106474400877@us-mta-475.us.mimecast.lan>
@ 2024-02-01 12:08                                                         ` Michael S. Tsirkin
  2024-02-22 19:23                                                         ` Michael S. Tsirkin
  2024-03-11 17:05                                                         ` Michael S. Tsirkin
  2 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-02-01 12:08 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > > -------- Summary --------
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

I don't really understand what is special about vhost though.
Wouldn't it apply to any kernel code?

> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> >     cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).

I am guessing once you post it you will see feedback.

> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> > 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                                       ` <89460.124020106474400877@us-mta-475.us.mimecast.lan>
  2024-02-01 12:08                                                         ` Michael S. Tsirkin
@ 2024-02-22 19:23                                                         ` Michael S. Tsirkin
  2024-03-11 17:05                                                         ` Michael S. Tsirkin
  2 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-02-22 19:23 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.

As I predicted, if you want attention from sched guys you need to
send a patch in their area.

-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                                       ` <89460.124020106474400877@us-mta-475.us.mimecast.lan>
  2024-02-01 12:08                                                         ` Michael S. Tsirkin
  2024-02-22 19:23                                                         ` Michael S. Tsirkin
@ 2024-03-11 17:05                                                         ` Michael S. Tsirkin
  2024-03-12  9:45                                                           ` Luis Machado
  2 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-03-11 17:05 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev

On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > > -------- Summary --------
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

Are we going anywhere with this btw?


> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> >     cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).
> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> > 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-03-11 17:05                                                         ` Michael S. Tsirkin
@ 2024-03-12  9:45                                                           ` Luis Machado
  2024-03-14 11:46                                                             ` Tobias Huschle
       [not found]                                                             ` <73123.124031407552500165@us-mta-156.us.mimecast.lan>
  0 siblings, 2 replies; 58+ messages in thread
From: Luis Machado @ 2024-03-12  9:45 UTC (permalink / raw)
  To: Michael S. Tsirkin, Tobias Huschle
  Cc: Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel, kvm,
	virtualization, netdev, nd

On 3/11/24 17:05, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
>> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
>>> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
>>>> On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
>>>>> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
>>>>>> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
>>>>
>>>> -------- Summary --------
>>>>
>>>> In my (non-vhost experience) opinion the way to go would be either
>>>> replacing the cond_resched with a hard schedule or setting the
>>>> need_resched flag within vhost if the a data transfer was successfully
>>>> initiated. It will be necessary to check if this causes problems with
>>>> other workloads/benchmarks.
>>>
>>> Yes but conceptually I am still in the dark on whether the fact that
>>> periodically invoking cond_resched is no longer sufficient to be nice to
>>> others is a bug, or intentional.  So you feel it is intentional?
>>
>> I would assume that cond_resched is still a valid concept.
>> But, in this particular scenario we have the following problem:
>>
>> So far (with CFS) we had:
>> 1. vhost initiates data transfer
>> 2. kworker is woken up
>> 3. CFS gives priority to woken up task and schedules it
>> 4. kworker runs
>>
>> Now (with EEVDF) we have:
>> 0. In some cases, kworker has accumulated negative lag 
>> 1. vhost initiates data transfer
>> 2. kworker is woken up
>> -3a. EEVDF does not schedule kworker if it has negative lag
>> -4a. vhost continues running, kworker on same CPU starves
>> --
>> -3b. EEVDF schedules kworker if it has positive or no lag
>> -4b. kworker runs
>>
>> In the 3a/4a case, the kworker is given no chance to set the
>> necessary flag. The flag can only be set by another CPU now.
>> The schedule of the kworker was not caused by cond_resched, but
>> rather by the wakeup path of the scheduler.
>>
>> cond_resched works successfully once the load balancer (I suppose) 
>> decides to migrate the vhost off to another CPU. In that case, the
>> load balancer on another CPU sets that flag and we are good.
>> That then eventually allows the scheduler to pick kworker, but very
>> late.
> 
> Are we going anywhere with this btw?
> 
>

I think Tobias had a couple other threads related to this, with other potential fixes:

https://lore.kernel.org/lkml/20240228161018.14253-1-huschle@linux.ibm.com/

https://lore.kernel.org/lkml/20240228161023.14310-1-huschle@linux.ibm.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-03-12  9:45                                                           ` Luis Machado
@ 2024-03-14 11:46                                                             ` Tobias Huschle
       [not found]                                                             ` <73123.124031407552500165@us-mta-156.us.mimecast.lan>
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2024-03-14 11:46 UTC (permalink / raw)
  To: Luis Machado
  Cc: Michael S. Tsirkin, Jason Wang, Abel Wu, Peter Zijlstra,
	Linux Kernel, kvm, virtualization, netdev, nd

On Tue, Mar 12, 2024 at 09:45:57AM +0000, Luis Machado wrote:
> On 3/11/24 17:05, Michael S. Tsirkin wrote:
> > 
> > Are we going anywhere with this btw?
> > 
> >
> 
> I think Tobias had a couple other threads related to this, with other potential fixes:
> 
> https://lore.kernel.org/lkml/20240228161018.14253-1-huschle@linux.ibm.com/
> 
> https://lore.kernel.org/lkml/20240228161023.14310-1-huschle@linux.ibm.com/
> 

Sorry, Michael, should have provided those threads here as well.

The more I look into this issue, the more things to ponder upon I find.
It seems like this issue can (maybe) be fixed on the scheduler side after all.

The root cause of this regression remains that the mentioned kworker gets
a negative lag value and is therefore not elligible to run on wake up.
This negative lag is potentially assigned incorrectly. But I'm not sure yet.

Anytime I find something that can address the symptom, there is a potential
root cause on another level, and I would like to avoid to just address a
symptom to fix the issue, wheras it would be better to find the actual
root cause.

I would nevertheless still argue, that vhost relies rather heavily on the fact
that the kworker gets scheduled on wake up everytime. But I don't have a 
proposal at hand that accounts for potential side effects if opting for
explicitly initiating a schedule.
Maybe the assumption, that said kworker should always be selected on wake 
up is valid. In that case the explicit schedule would merely be a safety 
net.

I will let you know if something comes up on the scheduler side. There are
some more ideas on my side how this could be approached.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                                             ` <73123.124031407552500165@us-mta-156.us.mimecast.lan>
@ 2024-03-14 15:09                                                               ` Michael S. Tsirkin
  2024-03-15  8:33                                                                 ` Tobias Huschle
       [not found]                                                                 ` <84704.124031504335801509@us-mta-515.us.mimecast.lan>
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-03-14 15:09 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Luis Machado, Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel,
	kvm, virtualization, netdev, nd

On Thu, Mar 14, 2024 at 12:46:54PM +0100, Tobias Huschle wrote:
> On Tue, Mar 12, 2024 at 09:45:57AM +0000, Luis Machado wrote:
> > On 3/11/24 17:05, Michael S. Tsirkin wrote:
> > > 
> > > Are we going anywhere with this btw?
> > > 
> > >
> > 
> > I think Tobias had a couple other threads related to this, with other potential fixes:
> > 
> > https://lore.kernel.org/lkml/20240228161018.14253-1-huschle@linux.ibm.com/
> > 
> > https://lore.kernel.org/lkml/20240228161023.14310-1-huschle@linux.ibm.com/
> > 
> 
> Sorry, Michael, should have provided those threads here as well.
> 
> The more I look into this issue, the more things to ponder upon I find.
> It seems like this issue can (maybe) be fixed on the scheduler side after all.
> 
> The root cause of this regression remains that the mentioned kworker gets
> a negative lag value and is therefore not elligible to run on wake up.
> This negative lag is potentially assigned incorrectly. But I'm not sure yet.
> 
> Anytime I find something that can address the symptom, there is a potential
> root cause on another level, and I would like to avoid to just address a
> symptom to fix the issue, wheras it would be better to find the actual
> root cause.
> 
> I would nevertheless still argue, that vhost relies rather heavily on the fact
> that the kworker gets scheduled on wake up everytime. But I don't have a 
> proposal at hand that accounts for potential side effects if opting for
> explicitly initiating a schedule.
> Maybe the assumption, that said kworker should always be selected on wake 
> up is valid. In that case the explicit schedule would merely be a safety 
> net.
> 
> I will let you know if something comes up on the scheduler side. There are
> some more ideas on my side how this could be approached.

Thanks a lot! To clarify it is not that I am opposed to changing vhost.
I would like however for some documentation to exist saying that if you
do abc then call API xyz. Then I hope we can feel a bit safer that
future scheduler changes will not break vhost (though as usual, nothing
is for sure).  Right now we are going by the documentation and that says
cond_resched so we do that.

-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-03-14 15:09                                                               ` Michael S. Tsirkin
@ 2024-03-15  8:33                                                                 ` Tobias Huschle
       [not found]                                                                 ` <84704.124031504335801509@us-mta-515.us.mimecast.lan>
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2024-03-15  8:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Luis Machado, Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel,
	kvm, virtualization, netdev, nd

On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> 
> Thanks a lot! To clarify it is not that I am opposed to changing vhost.
> I would like however for some documentation to exist saying that if you
> do abc then call API xyz. Then I hope we can feel a bit safer that
> future scheduler changes will not break vhost (though as usual, nothing
> is for sure).  Right now we are going by the documentation and that says
> cond_resched so we do that.
> 
> -- 
> MST
> 

Here I'd like to add that we have two different problems:

1. cond_resched not working as expected
   This appears to me to be a bug in the scheduler where it lets the cgroup, 
   which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup
   is allowed to surpass its own deadline without consequences. One of my RFCs
   mentioned above adresses this issue (not happy yet with the implementation).
   This issue only appears in that specific scenario, so it's not a general 
   issue, rather a corner case.
   But, this fix will still allow the vhost to reach its deadline, which is
   one full time slice. This brings down the max delays from 300+ms to whatever
   the timeslice is. This is not enough to fix the regression.

2. vhost relying on kworker being scheduled on wake up
   This is the bigger issue for the regression. There are rare cases, where
   the vhost runs only for a very short amount of time before it wakes up 
   the kworker. Simultaneously, the kworker takes longer than usual to 
   complete its work and takes longer than the vhost did before. We
   are talking 4digit to low 5digit nanosecond values.
   With those two being the only tasks on the CPU, the scheduler now assumes
   that the kworker wants to unfairly consume more than the vhost and denies
   it being scheduled on wakeup.
   In the regular cases, the kworker is faster than the vhost, so the 
   scheduler assumes that the kworker needs help, which benefits the
   scenario we are looking at.
   In the bad case, this means unfortunately, that cond_resched cannot work
   as good as before, for this particular case!
   So, let's assume that problem 1 from above is fixed. It will take one 
   full time slice to get the need_resched flag set by the scheduler
   because vhost surpasses its deadline. Before, the scheduler cannot know
   that the kworker should actually run. The kworker itself is unable
   to communicate that by itself since it's not getting scheduled and there 
   is no external entity that could intervene.
   Hence my argumentation that cond_resched still works as expected. The
   crucial part is that the wake up behavior has changed which is why I'm 
   a bit reluctant to propose a documentation change on cond_resched.
   I could see proposing a doc change, that cond_resched should not be
   used if a task heavily relies on a woken up task being scheduled.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
       [not found]                                                                 ` <84704.124031504335801509@us-mta-515.us.mimecast.lan>
@ 2024-03-15 10:31                                                                   ` Michael S. Tsirkin
  2024-03-19  8:21                                                                     ` Tobias Huschle
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-03-15 10:31 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Luis Machado, Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel,
	kvm, virtualization, netdev, nd

On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > 
> > Thanks a lot! To clarify it is not that I am opposed to changing vhost.
> > I would like however for some documentation to exist saying that if you
> > do abc then call API xyz. Then I hope we can feel a bit safer that
> > future scheduler changes will not break vhost (though as usual, nothing
> > is for sure).  Right now we are going by the documentation and that says
> > cond_resched so we do that.
> > 
> > -- 
> > MST
> > 
> 
> Here I'd like to add that we have two different problems:
> 
> 1. cond_resched not working as expected
>    This appears to me to be a bug in the scheduler where it lets the cgroup, 
>    which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup
>    is allowed to surpass its own deadline without consequences. One of my RFCs
>    mentioned above adresses this issue (not happy yet with the implementation).
>    This issue only appears in that specific scenario, so it's not a general 
>    issue, rather a corner case.
>    But, this fix will still allow the vhost to reach its deadline, which is
>    one full time slice. This brings down the max delays from 300+ms to whatever
>    the timeslice is. This is not enough to fix the regression.
> 
> 2. vhost relying on kworker being scheduled on wake up
>    This is the bigger issue for the regression. There are rare cases, where
>    the vhost runs only for a very short amount of time before it wakes up 
>    the kworker. Simultaneously, the kworker takes longer than usual to 
>    complete its work and takes longer than the vhost did before. We
>    are talking 4digit to low 5digit nanosecond values.
>    With those two being the only tasks on the CPU, the scheduler now assumes
>    that the kworker wants to unfairly consume more than the vhost and denies
>    it being scheduled on wakeup.
>    In the regular cases, the kworker is faster than the vhost, so the 
>    scheduler assumes that the kworker needs help, which benefits the
>    scenario we are looking at.
>    In the bad case, this means unfortunately, that cond_resched cannot work
>    as good as before, for this particular case!
>    So, let's assume that problem 1 from above is fixed. It will take one 
>    full time slice to get the need_resched flag set by the scheduler
>    because vhost surpasses its deadline. Before, the scheduler cannot know
>    that the kworker should actually run. The kworker itself is unable
>    to communicate that by itself since it's not getting scheduled and there 
>    is no external entity that could intervene.
>    Hence my argumentation that cond_resched still works as expected. The
>    crucial part is that the wake up behavior has changed which is why I'm 
>    a bit reluctant to propose a documentation change on cond_resched.
>    I could see proposing a doc change, that cond_resched should not be
>    used if a task heavily relies on a woken up task being scheduled.

Could you remind me pls, what is the kworker doing specifically that
vhost is relying on?

-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-03-15 10:31                                                                   ` Michael S. Tsirkin
@ 2024-03-19  8:21                                                                     ` Tobias Huschle
  2024-03-19  8:29                                                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 58+ messages in thread
From: Tobias Huschle @ 2024-03-19  8:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Luis Machado, Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel,
	kvm, virtualization, netdev, nd

On 2024-03-15 11:31, Michael S. Tsirkin wrote:
> On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
>> On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
>> >
> 
> Could you remind me pls, what is the kworker doing specifically that
> vhost is relying on?

The kworker is handling the actual data moving in memory if I'm not 
mistaking.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-03-19  8:21                                                                     ` Tobias Huschle
@ 2024-03-19  8:29                                                                       ` Michael S. Tsirkin
  2024-03-19  8:59                                                                         ` Tobias Huschle
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  8:29 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Luis Machado, Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel,
	kvm, virtualization, netdev, nd

On Tue, Mar 19, 2024 at 09:21:06AM +0100, Tobias Huschle wrote:
> On 2024-03-15 11:31, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> > > On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > > >
> > 
> > Could you remind me pls, what is the kworker doing specifically that
> > vhost is relying on?
> 
> The kworker is handling the actual data moving in memory if I'm not
> mistaking.

I think that is the vhost process itself. Maybe you mean the
guest thread versus the vhost thread then?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-03-19  8:29                                                                       ` Michael S. Tsirkin
@ 2024-03-19  8:59                                                                         ` Tobias Huschle
  2024-04-30 10:50                                                                           ` Tobias Huschle
  0 siblings, 1 reply; 58+ messages in thread
From: Tobias Huschle @ 2024-03-19  8:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Luis Machado, Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel,
	kvm, virtualization, netdev, nd

On 2024-03-19 09:29, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 09:21:06AM +0100, Tobias Huschle wrote:
>> On 2024-03-15 11:31, Michael S. Tsirkin wrote:
>> > On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
>> > > On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
>> > > >
>> >
>> > Could you remind me pls, what is the kworker doing specifically that
>> > vhost is relying on?
>> 
>> The kworker is handling the actual data moving in memory if I'm not
>> mistaking.
> 
> I think that is the vhost process itself. Maybe you mean the
> guest thread versus the vhost thread then?

My understanding was that vhost writes data into a file descriptor which 
then triggers eventfd.

That's at least how I read the vhost code if I remember correctly.

The handler beneath (the kworker) then runs the actual instructions that 
move the data to the receiving vhost on the other end of the connection.

Again, I might be wrong here.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-03-19  8:59                                                                         ` Tobias Huschle
@ 2024-04-30 10:50                                                                           ` Tobias Huschle
  2024-05-01 10:51                                                                             ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Tobias Huschle @ 2024-04-30 10:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Luis Machado, Jason Wang, Abel Wu, Peter Zijlstra, Linux Kernel,
	kvm, virtualization, netdev, nd, borntraeger

It took me a while, but I was able to figure out why EEVDF behaves 
different then CFS does. I'm still waiting for some official confirmation
of my assumptions but it all seems very plausible to me.

Leaving aside all the specifics of vhost and kworkers, a more general
description of the scenario would be as follows:

Assume that we have two tasks taking turns on a single CPU. 
Task 1 does something and wakes up Task 2.
Task 2 does something and goes to sleep.
And we're just repeating that.
Task 1 and task 2 only run for very short amounts of time, i.e. much 
shorter than a regular time slice (vhost = task1, kworker = task2).

Let's further assume, that task 1 runs longer than task 2. 
In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
always gets picked on wake up because it has the smaller vruntime. 
In EEVDF, this would translate to a permanent positive lag, which also 
causes task 2 to get consistently scheduled on wake up.

Let's now assume, that ocassionally, task 2 runs a little bit longer than
task 1. In CFS, this means, that task 2 can close the vruntime gap by a
bit, but, it can easily remain below the value of task 1. Task 2 would 
still get picked on wake up.
With EEVDF, in its current form, task 2 will now get a negative lag, which
in turn, will cause it not being picked on the next wake up.

So, it seems we have a change in the level of how far the both variants look 
into the past. CFS being willing to take more history into account, whereas
EEVDF does not (with update_entity_lag setting the lag value from scratch, 
and place_entity not taking the original vruntime into account).

All of this can be seen as correct by design, a task consumes more time
than the others, so it has to give way to others. The big difference
is now, that CFS allowed a task to collect some bonus by constantly using 
less CPU time than others and trading that time against ocassionally taking
more CPU time. EEVDF could do the same thing, by allowing the accumulation
of positive lag, which can then be traded against the one time the task
would get negative lag. This might clash with other EEVDF assumptions though.

The patch below fixes the degredation, but is not at all aligned with what 
EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
correct.

So, what does this now mean for the vhost regression we were discussing?

1. The behavior of the scheduler changed with regard to wake-up scenarios.
2. vhost in its current form relies on the way how CFS works by assuming 
   that the kworker always gets scheduled.

I would like to argue that it therefore makes sense to reconsider the vhost
implementation to make it less dependent on the internals of the scheduler.
As proposed earlier in this thread, I see two options:

1. Do an explicit schedule() after every iteration across the vhost queues
2. Set the need_resched flag after writing to the socket that would trigger
   eventfd and the underlying kworker

Both options would make sure that the vhost gives up the CPU as it cannot
continue anyway without the kworker handling the event. Option 1 will give
up the CPU regardless of whether something was found in the queues, whereas
option 2 would only give up the CPU if there is.

It shall be noted, that we encountered similar behavior when running some
fio benchmarks. From a brief glance at the code, I was seeing similar
intentions: Loop over queues, then trigger an action through some event
mechanism. Applying the same patch as mentioned above also fixes this issue.

It could be argued, that this is still something that needs to be somehow
addressed by the scheduler since it might affect others as well and there 
are in fact patches coming in. Will they address our issue here? Not sure yet.
On the other hand, it might just be beneficial to make vhost more resilient
towards the scheduler's algorithm by not relying on a certain behavior in
the wakeup path.
Further discussion on additional commits to make EEVDF work correctly can 
be found here: 
https://lore.kernel.org/lkml/20240408090639.GD21904@noisy.programming.kicks-ass.net/T/
So far these patches do not fix the degredation.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..b83a72311d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
        s64 lag, limit;
 
        SCHED_WARN_ON(!se->on_rq);
-       lag = avg_vruntime(cfs_rq) - se->vruntime;
+       lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime;
 
        limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
        se->vlag = clamp(lag, -limit, limit);


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-04-30 10:50                                                                           ` Tobias Huschle
@ 2024-05-01 10:51                                                                             ` Peter Zijlstra
  2024-05-01 15:31                                                                               ` Michael S. Tsirkin
  2024-05-02 12:20                                                                               ` Tobias Huschle
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2024-05-01 10:51 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Michael S. Tsirkin, Luis Machado, Jason Wang, Abel Wu,
	Linux Kernel, kvm, virtualization, netdev, nd, borntraeger,
	Ingo Molnar, Mike Galbraith

On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
> It took me a while, but I was able to figure out why EEVDF behaves 
> different then CFS does. I'm still waiting for some official confirmation
> of my assumptions but it all seems very plausible to me.
> 
> Leaving aside all the specifics of vhost and kworkers, a more general
> description of the scenario would be as follows:
> 
> Assume that we have two tasks taking turns on a single CPU. 
> Task 1 does something and wakes up Task 2.
> Task 2 does something and goes to sleep.
> And we're just repeating that.
> Task 1 and task 2 only run for very short amounts of time, i.e. much 
> shorter than a regular time slice (vhost = task1, kworker = task2).
> 
> Let's further assume, that task 1 runs longer than task 2. 
> In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
> of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
> always gets picked on wake up because it has the smaller vruntime. 
> In EEVDF, this would translate to a permanent positive lag, which also 
> causes task 2 to get consistently scheduled on wake up.
> 
> Let's now assume, that ocassionally, task 2 runs a little bit longer than
> task 1. In CFS, this means, that task 2 can close the vruntime gap by a
> bit, but, it can easily remain below the value of task 1. Task 2 would 
> still get picked on wake up.
> With EEVDF, in its current form, task 2 will now get a negative lag, which
> in turn, will cause it not being picked on the next wake up.

Right, so I've been working on changes where tasks will be able to
'earn' credit when sleeping. Specifically, keeping dequeued tasks on the
runqueue will allow them to burn off negative lag. Once they get picked
again they are guaranteed to have zero (or more) lag. If by that time
they've not been woken up again, they get dequeued with 0-lag.

(placement with 0-lag will ensure eligibility doesn't inhibit the pick,
but is not sufficient to ensure a pick)

However, this alone will not be sufficient to get the behaviour you
want. Notably, even at 0-lag the virtual deadline will still be after
the virtual deadline of the already running task -- assuming they have
equal request sizes.

That is, IIUC, you want your task 2 (kworker) to always preempt task 1
(vhost), right? So even if tsak 2 were to have 0-lag, placing it would
be something like:

t1      |---------<    
t2        |---------<
V    -----|-----------------------------

So t1 has started at | with a virtual deadline at <. Then a short
while later -- V will have advanced a little -- it wakes t2 with 0-lag,
but as you can observe, its virtual deadline will be later than t1's and
as such it will never get picked, even though they're both eligible.

> So, it seems we have a change in the level of how far the both variants look 
> into the past. CFS being willing to take more history into account, whereas
> EEVDF does not (with update_entity_lag setting the lag value from scratch, 
> and place_entity not taking the original vruntime into account).
>
> All of this can be seen as correct by design, a task consumes more time
> than the others, so it has to give way to others. The big difference
> is now, that CFS allowed a task to collect some bonus by constantly using 
> less CPU time than others and trading that time against ocassionally taking
> more CPU time. EEVDF could do the same thing, by allowing the accumulation
> of positive lag, which can then be traded against the one time the task
> would get negative lag. This might clash with other EEVDF assumptions though.

Right, so CFS was a pure virtual runtime based scheduler, while EEVDF
considers both virtual runtime (for eligibility, which ties to fairness)
but primarily virtual deadline (for timeliness).

If you want to make EEVDF force pick a task by modifying vruntime you
have to place it with lag > request (slice) such that the virtual
deadline of the newly placed task is before the already running task,
yielding both eligibility and earliest deadline.

Consistently placing tasks with such large (positive) lag will affect
fairness though, they're basically always runnable, so barring external
throttling, they'll starve you.

> The patch below fixes the degredation, but is not at all aligned with what 
> EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
> correct.
> 
> So, what does this now mean for the vhost regression we were discussing?
> 
> 1. The behavior of the scheduler changed with regard to wake-up scenarios.
> 2. vhost in its current form relies on the way how CFS works by assuming 
>    that the kworker always gets scheduled.

How does it assume this? Also, this is a performance issue, not a
correctness issue, right?

> I would like to argue that it therefore makes sense to reconsider the vhost
> implementation to make it less dependent on the internals of the scheduler.

I think I'll propose the opposite :-) Much of the problems we have are
because the scheduler simply doesn't know anything and we're playing a
mutual guessing game.

The trick is finding things to tell the scheduler it can actually do
something with though..

> As proposed earlier in this thread, I see two options:
> 
> 1. Do an explicit schedule() after every iteration across the vhost queues
> 2. Set the need_resched flag after writing to the socket that would trigger
>    eventfd and the underlying kworker

Neither of these options will get you what you want. Specifically in the
example above, t1 doing an explicit reschedule will result in t1 being
picked.

> Both options would make sure that the vhost gives up the CPU as it cannot
> continue anyway without the kworker handling the event. Option 1 will give
> up the CPU regardless of whether something was found in the queues, whereas
> option 2 would only give up the CPU if there is.

Incorrect, neither schedule() nor marking things with TIF_NEED_RESCHED
(which has more issues) will make t2 run. In that scenario you have to
make t1 block, such that t2 is the only possible choice. As long as you
keep t1 on the runqueue, it will be the most eligible pick at that time.

Now, there is an easy option... but I hate to mention it because I've
spend a lifetime telling people not to use it (for really good reasons):
yield().

With EEVDF yield() will move the virtual deadline ahead by one request.
That is, given the above scenario:

t1      |---------<    
t2        |---------<
V    -----|-----------------------------

t1 doing yield(), would result in:

t1      |-------------------<    
t2        |---------<
V    -----|-----------------------------

And at that point, you'll find that all of a sudden t2 will be picked.
On the flip side, you might find that when t2 completes another task is
more likely to run than return to t1 -- because of that elongated
deadline. Ofc. if t1 and t2 are the only tasks on the CPU this doesn't
matter.

> It shall be noted, that we encountered similar behavior when running some
> fio benchmarks. From a brief glance at the code, I was seeing similar
> intentions: Loop over queues, then trigger an action through some event
> mechanism. Applying the same patch as mentioned above also fixes this issue.
> 
> It could be argued, that this is still something that needs to be somehow
> addressed by the scheduler since it might affect others as well and there 
> are in fact patches coming in. Will they address our issue here? Not sure yet.

> On the other hand, it might just be beneficial to make vhost more resilient
> towards the scheduler's algorithm by not relying on a certain behavior in
> the wakeup path.

So the 'advantage' of EEVDF over CFS is that it has 2 parameters to play
with: weight and slice. Slice being the new toy in town.

Specifically in your example you would ideally have task 2 have a
shorter slice. Except of course its a kworker and you can't very well
set a kworker with a short slice because you never know wth it will end
up doing.

I'm still wondering why exactly it is imperative for t2 to preempt t1.
Is there some unexpressed serialization / spin-waiting ?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-05-01 10:51                                                                             ` Peter Zijlstra
@ 2024-05-01 15:31                                                                               ` Michael S. Tsirkin
  2024-05-02  9:16                                                                                 ` Peter Zijlstra
  2024-05-02 12:23                                                                                 ` Tobias Huschle
  2024-05-02 12:20                                                                               ` Tobias Huschle
  1 sibling, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2024-05-01 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tobias Huschle, Luis Machado, Jason Wang, Abel Wu, Linux Kernel,
	kvm, virtualization, netdev, nd, borntraeger, Ingo Molnar,
	Mike Galbraith

On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
> > It took me a while, but I was able to figure out why EEVDF behaves 
> > different then CFS does. I'm still waiting for some official confirmation
> > of my assumptions but it all seems very plausible to me.
> > 
> > Leaving aside all the specifics of vhost and kworkers, a more general
> > description of the scenario would be as follows:
> > 
> > Assume that we have two tasks taking turns on a single CPU. 
> > Task 1 does something and wakes up Task 2.
> > Task 2 does something and goes to sleep.
> > And we're just repeating that.
> > Task 1 and task 2 only run for very short amounts of time, i.e. much 
> > shorter than a regular time slice (vhost = task1, kworker = task2).
> > 
> > Let's further assume, that task 1 runs longer than task 2. 
> > In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
> > of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
> > always gets picked on wake up because it has the smaller vruntime. 
> > In EEVDF, this would translate to a permanent positive lag, which also 
> > causes task 2 to get consistently scheduled on wake up.
> > 
> > Let's now assume, that ocassionally, task 2 runs a little bit longer than
> > task 1. In CFS, this means, that task 2 can close the vruntime gap by a
> > bit, but, it can easily remain below the value of task 1. Task 2 would 
> > still get picked on wake up.
> > With EEVDF, in its current form, task 2 will now get a negative lag, which
> > in turn, will cause it not being picked on the next wake up.
> 
> Right, so I've been working on changes where tasks will be able to
> 'earn' credit when sleeping. Specifically, keeping dequeued tasks on the
> runqueue will allow them to burn off negative lag. Once they get picked
> again they are guaranteed to have zero (or more) lag. If by that time
> they've not been woken up again, they get dequeued with 0-lag.
> 
> (placement with 0-lag will ensure eligibility doesn't inhibit the pick,
> but is not sufficient to ensure a pick)
> 
> However, this alone will not be sufficient to get the behaviour you
> want. Notably, even at 0-lag the virtual deadline will still be after
> the virtual deadline of the already running task -- assuming they have
> equal request sizes.
> 
> That is, IIUC, you want your task 2 (kworker) to always preempt task 1
> (vhost), right? So even if tsak 2 were to have 0-lag, placing it would
> be something like:
> 
> t1      |---------<    
> t2        |---------<
> V    -----|-----------------------------
> 
> So t1 has started at | with a virtual deadline at <. Then a short
> while later -- V will have advanced a little -- it wakes t2 with 0-lag,
> but as you can observe, its virtual deadline will be later than t1's and
> as such it will never get picked, even though they're both eligible.
> 
> > So, it seems we have a change in the level of how far the both variants look 
> > into the past. CFS being willing to take more history into account, whereas
> > EEVDF does not (with update_entity_lag setting the lag value from scratch, 
> > and place_entity not taking the original vruntime into account).
> >
> > All of this can be seen as correct by design, a task consumes more time
> > than the others, so it has to give way to others. The big difference
> > is now, that CFS allowed a task to collect some bonus by constantly using 
> > less CPU time than others and trading that time against ocassionally taking
> > more CPU time. EEVDF could do the same thing, by allowing the accumulation
> > of positive lag, which can then be traded against the one time the task
> > would get negative lag. This might clash with other EEVDF assumptions though.
> 
> Right, so CFS was a pure virtual runtime based scheduler, while EEVDF
> considers both virtual runtime (for eligibility, which ties to fairness)
> but primarily virtual deadline (for timeliness).
> 
> If you want to make EEVDF force pick a task by modifying vruntime you
> have to place it with lag > request (slice) such that the virtual
> deadline of the newly placed task is before the already running task,
> yielding both eligibility and earliest deadline.
> 
> Consistently placing tasks with such large (positive) lag will affect
> fairness though, they're basically always runnable, so barring external
> throttling, they'll starve you.
> 
> > The patch below fixes the degredation, but is not at all aligned with what 
> > EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
> > correct.
> > 
> > So, what does this now mean for the vhost regression we were discussing?
> > 
> > 1. The behavior of the scheduler changed with regard to wake-up scenarios.
> > 2. vhost in its current form relies on the way how CFS works by assuming 
> >    that the kworker always gets scheduled.
> 
> How does it assume this? Also, this is a performance issue, not a
> correctness issue, right?
> 
> > I would like to argue that it therefore makes sense to reconsider the vhost
> > implementation to make it less dependent on the internals of the scheduler.
> 
> I think I'll propose the opposite :-) Much of the problems we have are
> because the scheduler simply doesn't know anything and we're playing a
> mutual guessing game.
> 
> The trick is finding things to tell the scheduler it can actually do
> something with though..
> 
> > As proposed earlier in this thread, I see two options:
> > 
> > 1. Do an explicit schedule() after every iteration across the vhost queues
> > 2. Set the need_resched flag after writing to the socket that would trigger
> >    eventfd and the underlying kworker
> 
> Neither of these options will get you what you want. Specifically in the
> example above, t1 doing an explicit reschedule will result in t1 being
> picked.
> 
> > Both options would make sure that the vhost gives up the CPU as it cannot
> > continue anyway without the kworker handling the event. Option 1 will give
> > up the CPU regardless of whether something was found in the queues, whereas
> > option 2 would only give up the CPU if there is.
> 
> Incorrect, neither schedule() nor marking things with TIF_NEED_RESCHED
> (which has more issues) will make t2 run. In that scenario you have to
> make t1 block, such that t2 is the only possible choice. As long as you
> keep t1 on the runqueue, it will be the most eligible pick at that time.
> 
> Now, there is an easy option... but I hate to mention it because I've
> spend a lifetime telling people not to use it (for really good reasons):
> yield().
> 
> With EEVDF yield() will move the virtual deadline ahead by one request.
> That is, given the above scenario:
> 
> t1      |---------<    
> t2        |---------<
> V    -----|-----------------------------
> 
> t1 doing yield(), would result in:
> 
> t1      |-------------------<    
> t2        |---------<
> V    -----|-----------------------------
> 
> And at that point, you'll find that all of a sudden t2 will be picked.
> On the flip side, you might find that when t2 completes another task is
> more likely to run than return to t1 -- because of that elongated
> deadline. Ofc. if t1 and t2 are the only tasks on the CPU this doesn't
> matter.
> 
> > It shall be noted, that we encountered similar behavior when running some
> > fio benchmarks. From a brief glance at the code, I was seeing similar
> > intentions: Loop over queues, then trigger an action through some event
> > mechanism. Applying the same patch as mentioned above also fixes this issue.
> > 
> > It could be argued, that this is still something that needs to be somehow
> > addressed by the scheduler since it might affect others as well and there 
> > are in fact patches coming in. Will they address our issue here? Not sure yet.
> 
> > On the other hand, it might just be beneficial to make vhost more resilient
> > towards the scheduler's algorithm by not relying on a certain behavior in
> > the wakeup path.
> 
> So the 'advantage' of EEVDF over CFS is that it has 2 parameters to play
> with: weight and slice. Slice being the new toy in town.
> 
> Specifically in your example you would ideally have task 2 have a
> shorter slice. Except of course its a kworker and you can't very well
> set a kworker with a short slice because you never know wth it will end
> up doing.
> 
> I'm still wondering why exactly it is imperative for t2 to preempt t1.
> Is there some unexpressed serialization / spin-waiting ?


I am not sure but I think the point is that t2 is a kworker. It is
much cheaper to run it right now when we are already in the kernel
than return to userspace, let it run for a bit then interrupt it
and then run t2.
Right, Tobias?

-- 
MST


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-05-01 15:31                                                                               ` Michael S. Tsirkin
@ 2024-05-02  9:16                                                                                 ` Peter Zijlstra
  2024-05-02 12:23                                                                                 ` Tobias Huschle
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2024-05-02  9:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tobias Huschle, Luis Machado, Jason Wang, Abel Wu, Linux Kernel,
	kvm, virtualization, netdev, nd, borntraeger, Ingo Molnar,
	Mike Galbraith

On Wed, May 01, 2024 at 11:31:02AM -0400, Michael S. Tsirkin wrote:
> On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:

> > I'm still wondering why exactly it is imperative for t2 to preempt t1.
> > Is there some unexpressed serialization / spin-waiting ?
> 
> 
> I am not sure but I think the point is that t2 is a kworker. It is
> much cheaper to run it right now when we are already in the kernel
> than return to userspace, let it run for a bit then interrupt it
> and then run t2.
> Right, Tobias?

So that is fundamentally a consequence of using a kworker.

So I tried to have a quick peek at vhost to figure out why you're using
kworkers... but no luck :/

Also, when I look at drivers/vhost/ it seems to implement it's own
worker and not use normal workqueues or even kthread_worker. Did we
really need yet another copy of all that?

Anyway, I tried to have a quick look at the code, but I can't seem to
get a handle on what and why it's doing things.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-05-01 10:51                                                                             ` Peter Zijlstra
  2024-05-01 15:31                                                                               ` Michael S. Tsirkin
@ 2024-05-02 12:20                                                                               ` Tobias Huschle
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2024-05-02 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael S. Tsirkin, Luis Machado, Jason Wang, Abel Wu,
	Linux Kernel, kvm, virtualization, netdev, nd, borntraeger,
	Ingo Molnar, Mike Galbraith

On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
<...>
> > 
> > Let's now assume, that ocassionally, task 2 runs a little bit longer than
> > task 1. In CFS, this means, that task 2 can close the vruntime gap by a
> > bit, but, it can easily remain below the value of task 1. Task 2 would 
> > still get picked on wake up.
> > With EEVDF, in its current form, task 2 will now get a negative lag, which
> > in turn, will cause it not being picked on the next wake up.
> 
> Right, so I've been working on changes where tasks will be able to
> 'earn' credit when sleeping. Specifically, keeping dequeued tasks on the
> runqueue will allow them to burn off negative lag. Once they get picked
> again they are guaranteed to have zero (or more) lag. If by that time
> they've not been woken up again, they get dequeued with 0-lag.
> 
> (placement with 0-lag will ensure eligibility doesn't inhibit the pick,
> but is not sufficient to ensure a pick)
> 
> However, this alone will not be sufficient to get the behaviour you
> want. Notably, even at 0-lag the virtual deadline will still be after
> the virtual deadline of the already running task -- assuming they have
> equal request sizes.
> 
> That is, IIUC, you want your task 2 (kworker) to always preempt task 1
> (vhost), right? So even if tsak 2 were to have 0-lag, placing it would
> be something like:
> 
> t1      |---------<    
> t2        |---------<
> V    -----|-----------------------------

Exactly, the kworker should be picked. I experimented with that a bit as
well and forced all tasks to have 0-lag on wake-up but got the results
you are mentioning here. Only if I would give the kworker (in general all
woken up tasks) a lag >0, with 1 being already sufficient, the kworker 
would be picked consistently.

> 
> So t1 has started at | with a virtual deadline at <. Then a short
> while later -- V will have advanced a little -- it wakes t2 with 0-lag,
> but as you can observe, its virtual deadline will be later than t1's and
> as such it will never get picked, even though they're both eligible.
> 
> > So, it seems we have a change in the level of how far the both variants look 
> > into the past. CFS being willing to take more history into account, whereas
> > EEVDF does not (with update_entity_lag setting the lag value from scratch, 
> > and place_entity not taking the original vruntime into account).
> >
> > All of this can be seen as correct by design, a task consumes more time
> > than the others, so it has to give way to others. The big difference
> > is now, that CFS allowed a task to collect some bonus by constantly using 
> > less CPU time than others and trading that time against ocassionally taking
> > more CPU time. EEVDF could do the same thing, by allowing the accumulation
> > of positive lag, which can then be traded against the one time the task
> > would get negative lag. This might clash with other EEVDF assumptions though.
> 
> Right, so CFS was a pure virtual runtime based scheduler, while EEVDF
> considers both virtual runtime (for eligibility, which ties to fairness)
> but primarily virtual deadline (for timeliness).
> 
> If you want to make EEVDF force pick a task by modifying vruntime you
> have to place it with lag > request (slice) such that the virtual
> deadline of the newly placed task is before the already running task,
> yielding both eligibility and earliest deadline.
> 
> Consistently placing tasks with such large (positive) lag will affect
> fairness though, they're basically always runnable, so barring external
> throttling, they'll starve you.

I was concerned about that as well. Tampering with the lag value will help
in this particular scenario but might cause problems with others.

> 
> > The patch below fixes the degredation, but is not at all aligned with what 
> > EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
> > correct.
> > 
> > So, what does this now mean for the vhost regression we were discussing?
> > 
> > 1. The behavior of the scheduler changed with regard to wake-up scenarios.
> > 2. vhost in its current form relies on the way how CFS works by assuming 
> >    that the kworker always gets scheduled.
> 
> How does it assume this? Also, this is a performance issue, not a
> correctness issue, right?

vhost runs a while(true) loop to go over its queues. After each iteration it 
runs cond_resched to give other tasks a chance to run. So, it will never be
pre-empted by the kworker, since the kworker has no handle to do so since vhost
is running in kernel space. This means, that the wake up path is the only
chance for the kworker to get selected.

So that assumption is of a very implicit nature.

In fact, vhost will run forever until migration hits due do an issue that I 
assume in the cgroup context. See here:
https://lore.kernel.org/all/20240228161023.14310-1-huschle@linux.ibm.com/
Fixing this issue still lets the vhost consume its full time slice which 
still causes significant performance degredation though.

> 
> > I would like to argue that it therefore makes sense to reconsider the vhost
> > implementation to make it less dependent on the internals of the scheduler.
> 
> I think I'll propose the opposite :-) Much of the problems we have are
> because the scheduler simply doesn't know anything and we're playing a
> mutual guessing game.
> 
> The trick is finding things to tell the scheduler it can actually do
> something with though..

I appreciate to hear that adjusting the scheduler might be an option here.
Nevertheless, the implicit assumption mentioned above seems something to keep
an eye on to me.

> 
> > As proposed earlier in this thread, I see two options:
> > 
> > 1. Do an explicit schedule() after every iteration across the vhost queues
> > 2. Set the need_resched flag after writing to the socket that would trigger
> >    eventfd and the underlying kworker
> 
> Neither of these options will get you what you want. Specifically in the
> example above, t1 doing an explicit reschedule will result in t1 being
> picked.
> 

In this particular scenario it actually helped. I had two patches for both
variants and they eliminated the degredation. Maybe the schedule was enough
to equalize the lag values again, can't spot the actual code that would do
that right now though.
Nevertheless both versions fixed the degredation consistently.

> > Both options would make sure that the vhost gives up the CPU as it cannot
> > continue anyway without the kworker handling the event. Option 1 will give
> > up the CPU regardless of whether something was found in the queues, whereas
> > option 2 would only give up the CPU if there is.
> 
> Incorrect, neither schedule() nor marking things with TIF_NEED_RESCHED
> (which has more issues) will make t2 run. In that scenario you have to
> make t1 block, such that t2 is the only possible choice. As long as you
> keep t1 on the runqueue, it will be the most eligible pick at that time.
> 

That makes sense, but does not match the results I was seeing, I might have to
give this a closer look to figure out why this works in this particular 
scenario.

> Now, there is an easy option... but I hate to mention it because I've
> spend a lifetime telling people not to use it (for really good reasons):
> yield().
> With EEVDF yield() will move the virtual deadline ahead by one request.
> That is, given the above scenario:
> 
> t1      |---------<    
> t2        |---------<
> V    -----|-----------------------------
> 
> t1 doing yield(), would result in:
> 
> t1      |-------------------<    
> t2        |---------<
> V    -----|-----------------------------
> 
> And at that point, you'll find that all of a sudden t2 will be picked.
> On the flip side, you might find that when t2 completes another task is
> more likely to run than return to t1 -- because of that elongated
> deadline. Ofc. if t1 and t2 are the only tasks on the CPU this doesn't
> matter.

Which would fix the degradation in this particular benchmark scenario.
But I could see this having some unwanted side effects. This would require
a yield_to() which only passes control to the target task and then returns
to the caller consistently. Which might allow to bypass all considerartions
on fairness.

> 
> > It shall be noted, that we encountered similar behavior when running some
> > fio benchmarks. From a brief glance at the code, I was seeing similar
> > intentions: Loop over queues, then trigger an action through some event
> > mechanism. Applying the same patch as mentioned above also fixes this issue.
> > 
> > It could be argued, that this is still something that needs to be somehow
> > addressed by the scheduler since it might affect others as well and there 
> > are in fact patches coming in. Will they address our issue here? Not sure yet.
> 
> > On the other hand, it might just be beneficial to make vhost more resilient
> > towards the scheduler's algorithm by not relying on a certain behavior in
> > the wakeup path.
> 
> So the 'advantage' of EEVDF over CFS is that it has 2 parameters to play
> with: weight and slice. Slice being the new toy in town.
> 
> Specifically in your example you would ideally have task 2 have a
> shorter slice. Except of course its a kworker and you can't very well
> set a kworker with a short slice because you never know wth it will end
> up doing.
> 
> I'm still wondering why exactly it is imperative for t2 to preempt t1.
> Is there some unexpressed serialization / spin-waiting ?
> 

I spent some time crawling through the vhost code, and to me it looks like
vhost is writing data to a socket file which has an eventfd handler attached
which is ran by the kworker. Said handler does the actual memory interaction.
So without the kworker running, the transaction is not completed. The 
counterpart keeps waiting for that message while the local vhost expects some
reply from the counterpart. So we're in a deadlock until the kworker gets to
run.

Which could be technically true for more components which use such a loop+event
construction. We saw the same, although less severe, issue with fio.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
  2024-05-01 15:31                                                                               ` Michael S. Tsirkin
  2024-05-02  9:16                                                                                 ` Peter Zijlstra
@ 2024-05-02 12:23                                                                                 ` Tobias Huschle
  1 sibling, 0 replies; 58+ messages in thread
From: Tobias Huschle @ 2024-05-02 12:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Zijlstra, Luis Machado, Jason Wang, Abel Wu, Linux Kernel,
	kvm, virtualization, netdev, nd, borntraeger, Ingo Molnar,
	Mike Galbraith

On Wed, May 01, 2024 at 11:31:02AM -0400, Michael S. Tsirkin wrote:
> On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
<...>
> > 
> > I'm still wondering why exactly it is imperative for t2 to preempt t1.
> > Is there some unexpressed serialization / spin-waiting ?
> 
> 
> I am not sure but I think the point is that t2 is a kworker. It is
> much cheaper to run it right now when we are already in the kernel
> than return to userspace, let it run for a bit then interrupt it
> and then run t2.
> Right, Tobias?
> 

That would be correct, the optimal scenario would be that t1, the vhost
does its thing, wakes up t2, the kworker, makes sure t2 executes immediately,
then gets control again and continues watching its loops without ever 
leaving kernel space.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2024-05-02 12:23 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-16 18:58 EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement) Tobias Huschle
2023-11-17  9:23 ` Peter Zijlstra
2023-11-17  9:58   ` Peter Zijlstra
2023-11-17 12:24   ` Tobias Huschle
2023-11-17 12:37     ` Peter Zijlstra
2023-11-17 13:07       ` Abel Wu
2023-11-21 13:17         ` Tobias Huschle
2023-11-22 10:00           ` Peter Zijlstra
2023-11-27 13:56             ` Tobias Huschle
     [not found]             ` <6564a012.c80a0220.adb78.f0e4SMTPIN_ADDED_BROKEN@mx.google.com>
2023-11-28  8:55               ` Abel Wu
2023-11-29  6:31                 ` Tobias Huschle
2023-12-07  6:22                 ` Tobias Huschle
     [not found]                 ` <07513.123120701265800278@us-mta-474.us.mimecast.lan>
2023-12-07  6:48                   ` Michael S. Tsirkin
2023-12-08  9:24                     ` Tobias Huschle
2023-12-08 17:28                       ` Mike Christie
     [not found]                     ` <56082.123120804242300177@us-mta-137.us.mimecast.lan>
2023-12-08 10:31                       ` Re: " Michael S. Tsirkin
2023-12-08 11:41                         ` Tobias Huschle
     [not found]                         ` <53044.123120806415900549@us-mta-342.us.mimecast.lan>
2023-12-09 10:42                           ` Michael S. Tsirkin
2023-12-11  7:26                             ` Jason Wang
2023-12-11 16:53                               ` Michael S. Tsirkin
2023-12-12  3:00                                 ` Jason Wang
2023-12-12 16:15                                   ` Michael S. Tsirkin
2023-12-13 10:37                                     ` Tobias Huschle
     [not found]                                     ` <42870.123121305373200110@us-mta-641.us.mimecast.lan>
2023-12-13 12:00                                       ` Michael S. Tsirkin
2023-12-13 12:45                                         ` Tobias Huschle
     [not found]                                         ` <25485.123121307454100283@us-mta-18.us.mimecast.lan>
2023-12-13 14:47                                           ` Michael S. Tsirkin
2023-12-13 14:55                                           ` Michael S. Tsirkin
2023-12-14  7:14                                             ` Michael S. Tsirkin
2024-01-08 13:13                                               ` Tobias Huschle
     [not found]                                               ` <92916.124010808133201076@us-mta-622.us.mimecast.lan>
2024-01-09 23:07                                                 ` Michael S. Tsirkin
2024-01-21 18:44                                                 ` Michael S. Tsirkin
2024-01-22 11:29                                                   ` Tobias Huschle
2024-02-01  7:38                                                   ` Tobias Huschle
     [not found]                                                   ` <07974.124020102385100135@us-mta-501.us.mimecast.lan>
2024-02-01  8:08                                                     ` Michael S. Tsirkin
2024-02-01 11:47                                                       ` Tobias Huschle
     [not found]                                                       ` <89460.124020106474400877@us-mta-475.us.mimecast.lan>
2024-02-01 12:08                                                         ` Michael S. Tsirkin
2024-02-22 19:23                                                         ` Michael S. Tsirkin
2024-03-11 17:05                                                         ` Michael S. Tsirkin
2024-03-12  9:45                                                           ` Luis Machado
2024-03-14 11:46                                                             ` Tobias Huschle
     [not found]                                                             ` <73123.124031407552500165@us-mta-156.us.mimecast.lan>
2024-03-14 15:09                                                               ` Michael S. Tsirkin
2024-03-15  8:33                                                                 ` Tobias Huschle
     [not found]                                                                 ` <84704.124031504335801509@us-mta-515.us.mimecast.lan>
2024-03-15 10:31                                                                   ` Michael S. Tsirkin
2024-03-19  8:21                                                                     ` Tobias Huschle
2024-03-19  8:29                                                                       ` Michael S. Tsirkin
2024-03-19  8:59                                                                         ` Tobias Huschle
2024-04-30 10:50                                                                           ` Tobias Huschle
2024-05-01 10:51                                                                             ` Peter Zijlstra
2024-05-01 15:31                                                                               ` Michael S. Tsirkin
2024-05-02  9:16                                                                                 ` Peter Zijlstra
2024-05-02 12:23                                                                                 ` Tobias Huschle
2024-05-02 12:20                                                                               ` Tobias Huschle
2023-11-18  5:14   ` Abel Wu
2023-11-20 10:56     ` Peter Zijlstra
2023-11-20 12:06       ` Abel Wu
2023-11-18  7:33 ` Abel Wu
2023-11-18 15:29   ` Honglei Wang
2023-11-19 13:29 ` Bagas Sanjaya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).