linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Scheduler: DMA Engine regression because of sched/fair changes
@ 2022-01-12 15:26 Alexander Fomichev
  2022-01-12 17:05 ` Mel Gorman
  2022-01-16  9:55 ` Thorsten Leemhuis
  0 siblings, 2 replies; 11+ messages in thread
From: Alexander Fomichev @ 2022-01-12 15:26 UTC (permalink / raw)
  To: linux-kernel, dmaengine; +Cc: Mel Gorman, linux

CC: Mel Gorman <mgorman@suse.de>
CC: linux@yadro.com

Hi all,

There's a huge regression found, which affects Intel Xeon's DMA Engine
performance between v4.14 LTS and modern kernels. In certain
circumstances the speed in dmatest is more than 6 times lower.

	- Hardware -
I did testing on 2 systems:
1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)

	- Measurement -
The dmatest result speed decreases with almost any test settings.
Although the most significant impact is revealed with 64K transfers. The
following parameters were used:

modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
echo "dma0chan0" > /sys/module/dmatest/parameters/channel
echo 1 > /sys/module/dmatest/parameters/run

Every test csse was performed at least 3 times. All detailed results are
below.

	- Analysis -
Bisecting revealed 2 different bad commits for those 2 systems, but both
change the same function/condition in the same file.
For the system (1) the bad commit is:
[7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
For the system (2) the bad commit is:
[806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle

	- Additional check -
Attempting to revert the changes above, a dirty patch for the (current)
kernel v5.16.0-rc5 was tested too:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f16dfb74246..0a58cc00b1b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5931,8 +5931,8 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
         * a cpufreq perspective, it's better to have higher utilisation
         * on one CPU.
         */
-       if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-               return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+       if (available_idle_cpu(this_cpu))
+               return this_cpu;

        if (sync && cpu_rq(this_cpu)->nr_running == 1)
                return this_cpu;

Please, take a look if this makes sense. But with this patch applied the
performance of DMA Engine restores.

	- Dmatest results TL;DR -

System (1) before bad commit:
---------------------
[  519.894642] dmatest: Added 1 threads using dma0chan0
[  525.383021] dmatest: Started 1 threads using dma0chan0
[  528.521915] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 98367.10 iops 6295494 KB/s (0)
[  544.851751] dmatest: Added 1 threads using dma0chan0
[  546.460064] dmatest: Started 1 threads using dma0chan0
[  549.609504] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 100310.96 iops 6419901 KB/s (0)
[  562.178365] dmatest: Added 1 threads using dma0chan0
[  563.852534] dmatest: Started 1 threads using dma0chan0
[  567.004898] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 98580.44 iops 6309148 KB/s (0)
---------------------

System (1) on HEAD=bad commit:
---------------------
[  149.555401] dmatest: Added 1 threads using dma0chan0
[  154.162444] dmatest: Started 1 threads using dma0chan0
[  157.490868] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 26653.87 iops 1705847 KB/s (0)
[  176.783450] dmatest: Added 1 threads using dma0chan0
[  178.428518] dmatest: Started 1 threads using dma0chan0
[  181.606531] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 14194.86 iops 908471 KB/s (0)
[  192.125218] dmatest: Added 1 threads using dma0chan0
[  194.060029] dmatest: Started 1 threads using dma0chan0
[  197.235265] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 14757.09 iops 944454 KB/s (0)
---------------------

Systen (1) on v5.16.0-rc5:
---------------------
[ 1430.860170] dmatest: Added 1 threads using dma0chan0
[ 1437.367447] dmatest: Started 1 threads using dma0chan0
[ 1442.756660] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 24837.31 iops 1589588 KB/s (0)
[ 1561.614191] dmatest: Added 1 threads using dma0chan0
[ 1562.816375] dmatest: Started 1 threads using dma0chan0
[ 1566.619614] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 13666.05 iops 874627 KB/s (0)
[ 1585.019601] dmatest: Added 1 threads using dma0chan0
[ 1587.585741] dmatest: Started 1 threads using dma0chan0
[ 1591.386816] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 13521.91 iops 865402 KB/s (0)
---------------------

System (1) on v5.16.0-rc5 with dirty patch:
---------------------
[  733.571508] dmatest: Added 1 threads using dma0chan0
[  746.050800] dmatest: Started 1 threads using dma0chan0
[  749.765600] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 87260.03 iops 5584642 KB/s (0)
[  915.051955] dmatest: Added 1 threads using dma0chan0
[  916.550732] dmatest: Started 1 threads using dma0chan0
[  920.267525] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 88464.25 iops 5661712 KB/s (0)
[  936.781273] dmatest: Added 1 threads using dma0chan0
[  939.528616] dmatest: Started 1 threads using dma0chan0
[  943.247694] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 88833.61 iops 5685351 KB/s (0)
---------------------

System (2) before bad commit:
---------------------
[  481.309411] dmatest: Added 1 threads using dma0chan0
[  491.197425] dmatest: Started 1 threads using dma0chan0
[  497.047315] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 78988.94 iops 5055292 KB/s (0)
[  506.057101] dmatest: Added 1 threads using dma0chan0
[  508.939426] dmatest: Started 1 threads using dma0chan0
[  514.788823] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 77754.44 iops 4976284 KB/s (0)
[  531.894587] dmatest: Added 1 threads using dma0chan0
[  534.053360] dmatest: Started 1 threads using dma0chan0
[  539.906424] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 76988.21 iops 4927246 KB/s (0)
---------------------

System (2) on HEAD=bad commit:
---------------------
[44522.892995] dmatest: Added 1 threads using dma0chan0
[44526.193331] dmatest: Started 1 threads using dma0chan0
[44532.043932] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 80360.01 iops 5143040 KB/s (0)
[44561.121118] dmatest: Added 1 threads using dma0chan0
[44562.868428] dmatest: Started 1 threads using dma0chan0
[44568.808577] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 16080.53 iops 1029154 KB/s (0)
[44728.597409] dmatest: Added 1 threads using dma0chan0
[44730.301566] dmatest: Started 1 threads using dma0chan0
[44736.259009] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 16091.91 iops 1029882 KB/s (0)
---------------------

Thanks for reading.

-- 
Regards,
  Alexander Fomichev

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
  2022-01-12 15:26 [RFC] Scheduler: DMA Engine regression because of sched/fair changes Alexander Fomichev
@ 2022-01-12 17:05 ` Mel Gorman
  2022-01-17  8:19   ` Alexander Fomichev
  2022-01-16  9:55 ` Thorsten Leemhuis
  1 sibling, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2022-01-12 17:05 UTC (permalink / raw)
  To: Alexander Fomichev; +Cc: linux-kernel, dmaengine, linux, Peter Zijlstra

On Wed, Jan 12, 2022 at 06:26:09PM +0300, Alexander Fomichev wrote:
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux@yadro.com
> 
> Hi all,
> 
> There's a huge regression found, which affects Intel Xeon's DMA Engine
> performance between v4.14 LTS and modern kernels. In certain
> circumstances the speed in dmatest is more than 6 times lower.
> 
> 	- Hardware -
> I did testing on 2 systems:
> 1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
> 2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)
> 
> 	- Measurement -
> The dmatest result speed decreases with almost any test settings.
> Although the most significant impact is revealed with 64K transfers. The
> following parameters were used:
> 
> modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
> echo "dma0chan0" > /sys/module/dmatest/parameters/channel
> echo 1 > /sys/module/dmatest/parameters/run
> 
> Every test csse was performed at least 3 times. All detailed results are
> below.
> 
> 	- Analysis -
> Bisecting revealed 2 different bad commits for those 2 systems, but both
> change the same function/condition in the same file.
> For the system (1) the bad commit is:
> [7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
> For the system (2) the bad commit is:
> [806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle
> 
> 	- Additional check -
> Attempting to revert the changes above, a dirty patch for the (current)
> kernel v5.16.0-rc5 was tested too:
> 

The consequences of the patch is allowing interrupts to migrate tasks away
from potentially cache hot data -- L1 misses if the two CPUs share LLC
or incurring remote memory access if migrating cross-node. The secondary
concern is that excessive migration from interrupts that round-robin CPUs
will mean that the CPU does not increase frequency. Minimally, the RFC
patch introduces regressions of their own. The comments cover the two
scenarios of interest

+        * If this_cpu is idle, it implies the wakeup is from interrupt
+        * context. Only allow the move if cache is shared. Otherwise an
+        * interrupt intensive workload could force all tasks onto one
+        * node depending on the IO topology or IRQ affinity settings.

(This one causes remote memory accesses and potentially overutilisation
of a subset of nodes)

+        * If the prev_cpu is idle and cache affine then avoid a migration.
+        * There is no guarantee that the cache hot data from an interrupt
+        * is more important than cache hot data on the prev_cpu and from
+        * a cpufreq perspective, it's better to have higher utilisation
+        * on one CPU.

(This one incurs L1/L2 misses due to a migration even though LLC may be
shared)

The tests don't say but what CPUs to the dmatest interrupts get
delivered to? dmatest appears to be an exception that the *only* hot
data of concern is also related to the interrupt as the DMA operation is
validated.

However, given that the point of a DMA engine is to transfer data without
the host CPU being involved and the interrupt is delivered on completion,
how realistic is it that the DMA data is immediately accessed on completion
by normal workloads that happen to use the DMA engine? What impact does
it have to tbe test is noverify or polling is used?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
  2022-01-12 15:26 [RFC] Scheduler: DMA Engine regression because of sched/fair changes Alexander Fomichev
  2022-01-12 17:05 ` Mel Gorman
@ 2022-01-16  9:55 ` Thorsten Leemhuis
  1 sibling, 0 replies; 11+ messages in thread
From: Thorsten Leemhuis @ 2022-01-16  9:55 UTC (permalink / raw)
  To: Alexander Fomichev, linux-kernel, dmaengine
  Cc: Mel Gorman, linux, regressions


[TLDR: I'm adding this regression to regzbot, the Linux kernel
regression tracking bot; most text you find below is compiled from a few
templates paragraphs some of you might have seen already.]

Hi, this is your Linux kernel regression tracker speaking.

Thanks for the report.

Adding the regression mailing list to the list of recipients, as it
should be in the loop for all regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

On 12.01.22 16:26, Alexander Fomichev wrote:
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux@yadro.com
> 
> Hi all,
> 
> There's a huge regression found, which affects Intel Xeon's DMA Engine
> performance between v4.14 LTS and modern kernels. In certain
> circumstances the speed in dmatest is more than 6 times lower.
> 
> 	- Hardware -
> I did testing on 2 systems:
> 1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
> 2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)
> 
> 	- Measurement -
> The dmatest result speed decreases with almost any test settings.
> Although the most significant impact is revealed with 64K transfers. The
> following parameters were used:
> 
> modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
> echo "dma0chan0" > /sys/module/dmatest/parameters/channel
> echo 1 > /sys/module/dmatest/parameters/run
> 
> Every test csse was performed at least 3 times. All detailed results are
> below.
> 
> 	- Analysis -
> Bisecting revealed 2 different bad commits for those 2 systems, but both
> change the same function/condition in the same file.
> For the system (1) the bad commit is:
> [7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
> For the system (2) the bad commit is:
> [806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle

Uhh, regzbot is not prepared for this, hence I'll simply pick
7332dec055f2457c386032f7e9b2991eb05c2a0a

#regzbot ^introduced 7332dec055f2457c386032f7e9b2991eb05c2a0a
#regzbot title sched: DMA Engine regression because of sched/fair changes
#regzbot ignore-activity

Reminder: when fixing the issue, please add a 'Link:' tag with the URL
to the report (the parent of this mail) using the kernel.org redirector,
as explained in 'Documentation/process/submitting-patches.rst'. Regzbot
then will automatically mark the regression as resolved once the fix
lands in the appropriate tree. For more details about regzbot see footer.

Sending this to everyone that got the initial report, to make all aware
of the tracking. I also hope that messages like this motivate people to
directly get at least the regression mailing list and ideally even
regzbot involved when dealing with regressions, as messages like this
wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), as
long as they are intended just for regzbot. With a bit of luck no such
messages will be needed anyway.

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

> 	- Additional check -
> Attempting to revert the changes above, a dirty patch for the (current)
> kernel v5.16.0-rc5 was tested too:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6f16dfb74246..0a58cc00b1b8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5931,8 +5931,8 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>          * a cpufreq perspective, it's better to have higher utilisation
>          * on one CPU.
>          */
> -       if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> -               return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> +       if (available_idle_cpu(this_cpu))
> +               return this_cpu;
> 
>         if (sync && cpu_rq(this_cpu)->nr_running == 1)
>                 return this_cpu;
> 
> Please, take a look if this makes sense. But with this patch applied the
> performance of DMA Engine restores.
> 
> 	- Dmatest results TL;DR -
> 
> System (1) before bad commit:
> ---------------------
> [  519.894642] dmatest: Added 1 threads using dma0chan0
> [  525.383021] dmatest: Started 1 threads using dma0chan0
> [  528.521915] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 98367.10 iops 6295494 KB/s (0)
> [  544.851751] dmatest: Added 1 threads using dma0chan0
> [  546.460064] dmatest: Started 1 threads using dma0chan0
> [  549.609504] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 100310.96 iops 6419901 KB/s (0)
> [  562.178365] dmatest: Added 1 threads using dma0chan0
> [  563.852534] dmatest: Started 1 threads using dma0chan0
> [  567.004898] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 98580.44 iops 6309148 KB/s (0)
> ---------------------
> 
> System (1) on HEAD=bad commit:
> ---------------------
> [  149.555401] dmatest: Added 1 threads using dma0chan0
> [  154.162444] dmatest: Started 1 threads using dma0chan0
> [  157.490868] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 26653.87 iops 1705847 KB/s (0)
> [  176.783450] dmatest: Added 1 threads using dma0chan0
> [  178.428518] dmatest: Started 1 threads using dma0chan0
> [  181.606531] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 14194.86 iops 908471 KB/s (0)
> [  192.125218] dmatest: Added 1 threads using dma0chan0
> [  194.060029] dmatest: Started 1 threads using dma0chan0
> [  197.235265] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 14757.09 iops 944454 KB/s (0)
> ---------------------
> 
> Systen (1) on v5.16.0-rc5:
> ---------------------
> [ 1430.860170] dmatest: Added 1 threads using dma0chan0
> [ 1437.367447] dmatest: Started 1 threads using dma0chan0
> [ 1442.756660] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 24837.31 iops 1589588 KB/s (0)
> [ 1561.614191] dmatest: Added 1 threads using dma0chan0
> [ 1562.816375] dmatest: Started 1 threads using dma0chan0
> [ 1566.619614] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 13666.05 iops 874627 KB/s (0)
> [ 1585.019601] dmatest: Added 1 threads using dma0chan0
> [ 1587.585741] dmatest: Started 1 threads using dma0chan0
> [ 1591.386816] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 13521.91 iops 865402 KB/s (0)
> ---------------------
> 
> System (1) on v5.16.0-rc5 with dirty patch:
> ---------------------
> [  733.571508] dmatest: Added 1 threads using dma0chan0
> [  746.050800] dmatest: Started 1 threads using dma0chan0
> [  749.765600] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 87260.03 iops 5584642 KB/s (0)
> [  915.051955] dmatest: Added 1 threads using dma0chan0
> [  916.550732] dmatest: Started 1 threads using dma0chan0
> [  920.267525] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 88464.25 iops 5661712 KB/s (0)
> [  936.781273] dmatest: Added 1 threads using dma0chan0
> [  939.528616] dmatest: Started 1 threads using dma0chan0
> [  943.247694] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 88833.61 iops 5685351 KB/s (0)
> ---------------------
> 
> System (2) before bad commit:
> ---------------------
> [  481.309411] dmatest: Added 1 threads using dma0chan0
> [  491.197425] dmatest: Started 1 threads using dma0chan0
> [  497.047315] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 78988.94 iops 5055292 KB/s (0)
> [  506.057101] dmatest: Added 1 threads using dma0chan0
> [  508.939426] dmatest: Started 1 threads using dma0chan0
> [  514.788823] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 77754.44 iops 4976284 KB/s (0)
> [  531.894587] dmatest: Added 1 threads using dma0chan0
> [  534.053360] dmatest: Started 1 threads using dma0chan0
> [  539.906424] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 76988.21 iops 4927246 KB/s (0)
> ---------------------
> 
> System (2) on HEAD=bad commit:
> ---------------------
> [44522.892995] dmatest: Added 1 threads using dma0chan0
> [44526.193331] dmatest: Started 1 threads using dma0chan0
> [44532.043932] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 80360.01 iops 5143040 KB/s (0)
> [44561.121118] dmatest: Added 1 threads using dma0chan0
> [44562.868428] dmatest: Started 1 threads using dma0chan0
> [44568.808577] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 16080.53 iops 1029154 KB/s (0)
> [44728.597409] dmatest: Added 1 threads using dma0chan0
> [44730.301566] dmatest: Started 1 threads using dma0chan0
> [44736.259009] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 16091.91 iops 1029882 KB/s (0)
> ---------------------
> 
> Thanks for reading.
> 

---
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and/or the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
tell #regzbot about it in the report, as that will ensure the regression
gets on the radar of regzbot and the regression tracker. That's in your
interest, as they will make sure the report won't fall through the
cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include a 'Link:' tag to the report in the commit message, as explained
in Documentation/process/submitting-patches.rst
That aspect was recently was made more explicit in commit 1f57bd42b77c:
https://git.kernel.org/linus/1f57bd42b77c

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
  2022-01-12 17:05 ` Mel Gorman
@ 2022-01-17  8:19   ` Alexander Fomichev
  2022-01-17 10:27     ` Mel Gorman
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Fomichev @ 2022-01-17  8:19 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, dmaengine, linux, Peter Zijlstra

On Wed, Jan 12, 2022 at 05:05:12PM +0000, Mel Gorman wrote:
> On Wed, Jan 12, 2022 at 06:26:09PM +0300, Alexander Fomichev wrote:
> > CC: Mel Gorman <mgorman@suse.de>
> > CC: linux@yadro.com
> > 
> > Hi all,
> > 
> > There's a huge regression found, which affects Intel Xeon's DMA Engine
> > performance between v4.14 LTS and modern kernels. In certain
> > circumstances the speed in dmatest is more than 6 times lower.
> > 
> > 	- Hardware -
> > I did testing on 2 systems:
> > 1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
> > 2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)
> > 
> > 	- Measurement -
> > The dmatest result speed decreases with almost any test settings.
> > Although the most significant impact is revealed with 64K transfers. The
> > following parameters were used:
> > 
> > modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
> > echo "dma0chan0" > /sys/module/dmatest/parameters/channel
> > echo 1 > /sys/module/dmatest/parameters/run
> > 
> > Every test csse was performed at least 3 times. All detailed results are
> > below.
> > 
> > 	- Analysis -
> > Bisecting revealed 2 different bad commits for those 2 systems, but both
> > change the same function/condition in the same file.
> > For the system (1) the bad commit is:
> > [7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
> > For the system (2) the bad commit is:
> > [806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle
> > 
> > 	- Additional check -
> > Attempting to revert the changes above, a dirty patch for the (current)
> > kernel v5.16.0-rc5 was tested too:
> > 
> 
> The consequences of the patch is allowing interrupts to migrate tasks away
> from potentially cache hot data -- L1 misses if the two CPUs share LLC
> or incurring remote memory access if migrating cross-node. The secondary
> concern is that excessive migration from interrupts that round-robin CPUs
> will mean that the CPU does not increase frequency. Minimally, the RFC
> patch introduces regressions of their own. The comments cover the two
> scenarios of interest
> 
> +        * If this_cpu is idle, it implies the wakeup is from interrupt
> +        * context. Only allow the move if cache is shared. Otherwise an
> +        * interrupt intensive workload could force all tasks onto one
> +        * node depending on the IO topology or IRQ affinity settings.
> 
> (This one causes remote memory accesses and potentially overutilisation
> of a subset of nodes)
> 
> +        * If the prev_cpu is idle and cache affine then avoid a migration.
> +        * There is no guarantee that the cache hot data from an interrupt
> +        * is more important than cache hot data on the prev_cpu and from
> +        * a cpufreq perspective, it's better to have higher utilisation
> +        * on one CPU.
> 
> (This one incurs L1/L2 misses due to a migration even though LLC may be
> shared)
> 
> The tests don't say but what CPUs to the dmatest interrupts get
> delivered to? dmatest appears to be an exception that the *only* hot
> data of concern is also related to the interrupt as the DMA operation is
> validated.
> 
> However, given that the point of a DMA engine is to transfer data without
> the host CPU being involved and the interrupt is delivered on completion,
> how realistic is it that the DMA data is immediately accessed on completion
> by normal workloads that happen to use the DMA engine? What impact does
> it have to tbe test is noverify or polling is used?

Thanks for the comment. Some additional notes regarding the issue.

1) You're right. When options "noverify=1" and "polling=1" are used.
then no performance reducing occurs.

2) DMA Engine on certain devices, e.g. Switchtec DMA and AMD PTDMA, is
used particularly for off-CPU data transfer via device's NTB to a remote
host. In NTRDMA project, which I'm involved to, DMA Engine sends data to
remote ring buffer and on data arrival CPU processes local ring buffers.

3) I checked dmatest with noverify=0 on PTDMA dirver: AMD EPYC 7313 16-Core
Processor/ASRock ROMED8-2T. The regression occurs on this hardware too.

4) Do you mean that with noverify=N and dirty patch, data verification
is performed on cached data and thus measured performance is fake?

5) What DMA Engine enabled drivers (and dmatest) should use as design
pattern to conform migration/cache behavior? Does scheduler optimisation
conflict to DMA Engine performance in general?

6) I didn't suggest RFC patch to real world usage. It was just a test
case to find out a low speed cause.

Comments/answers/suggestions are welcome.

-- 
Regards,
  Alexander

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
  2022-01-17  8:19   ` Alexander Fomichev
@ 2022-01-17 10:27     ` Mel Gorman
  2022-01-17 17:44       ` Alexander Fomichev
       [not found]       ` <20220118020448.2399-1-hdanton@sina.com>
  0 siblings, 2 replies; 11+ messages in thread
From: Mel Gorman @ 2022-01-17 10:27 UTC (permalink / raw)
  To: Alexander Fomichev; +Cc: linux-kernel, dmaengine, linux, Peter Zijlstra

On Mon, Jan 17, 2022 at 11:19:05AM +0300, Alexander Fomichev wrote:
> On Wed, Jan 12, 2022 at 05:05:12PM +0000, Mel Gorman wrote:
> > On Wed, Jan 12, 2022 at 06:26:09PM +0300, Alexander Fomichev wrote:
> > > CC: Mel Gorman <mgorman@suse.de>
> > > CC: linux@yadro.com
> > > 
> > > Hi all,
> > > 
> > > There's a huge regression found, which affects Intel Xeon's DMA Engine
> > > performance between v4.14 LTS and modern kernels. In certain
> > > circumstances the speed in dmatest is more than 6 times lower.
> > > 
> > > 	- Hardware -
> > > I did testing on 2 systems:
> > > 1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
> > > 2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)
> > > 
> > > 	- Measurement -
> > > The dmatest result speed decreases with almost any test settings.
> > > Although the most significant impact is revealed with 64K transfers. The
> > > following parameters were used:
> > > 
> > > modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
> > > echo "dma0chan0" > /sys/module/dmatest/parameters/channel
> > > echo 1 > /sys/module/dmatest/parameters/run
> > > 
> > > Every test csse was performed at least 3 times. All detailed results are
> > > below.
> > > 
> > > 	- Analysis -
> > > Bisecting revealed 2 different bad commits for those 2 systems, but both
> > > change the same function/condition in the same file.
> > > For the system (1) the bad commit is:
> > > [7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
> > > For the system (2) the bad commit is:
> > > [806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle
> > > 
> > > 	- Additional check -
> > > Attempting to revert the changes above, a dirty patch for the (current)
> > > kernel v5.16.0-rc5 was tested too:
> > > 
> > 
> > The consequences of the patch is allowing interrupts to migrate tasks away
> > from potentially cache hot data -- L1 misses if the two CPUs share LLC
> > or incurring remote memory access if migrating cross-node. The secondary
> > concern is that excessive migration from interrupts that round-robin CPUs
> > will mean that the CPU does not increase frequency. Minimally, the RFC
> > patch introduces regressions of their own. The comments cover the two
> > scenarios of interest
> > 
> > +        * If this_cpu is idle, it implies the wakeup is from interrupt
> > +        * context. Only allow the move if cache is shared. Otherwise an
> > +        * interrupt intensive workload could force all tasks onto one
> > +        * node depending on the IO topology or IRQ affinity settings.
> > 
> > (This one causes remote memory accesses and potentially overutilisation
> > of a subset of nodes)
> > 
> > +        * If the prev_cpu is idle and cache affine then avoid a migration.
> > +        * There is no guarantee that the cache hot data from an interrupt
> > +        * is more important than cache hot data on the prev_cpu and from
> > +        * a cpufreq perspective, it's better to have higher utilisation
> > +        * on one CPU.
> > 
> > (This one incurs L1/L2 misses due to a migration even though LLC may be
> > shared)
> > 
> > The tests don't say but what CPUs to the dmatest interrupts get
> > delivered to? dmatest appears to be an exception that the *only* hot
> > data of concern is also related to the interrupt as the DMA operation is
> > validated.
> > 
> > However, given that the point of a DMA engine is to transfer data without
> > the host CPU being involved and the interrupt is delivered on completion,
> > how realistic is it that the DMA data is immediately accessed on completion
> > by normal workloads that happen to use the DMA engine? What impact does
> > it have to tbe test is noverify or polling is used?
> 
> Thanks for the comment. Some additional notes regarding the issue.
> 
> 1) You're right. When options "noverify=1" and "polling=1" are used.
> then no performance reducing occurs.
> 

How about just noverify=1 on its own? It's a stronger indicator that
cache hotness is a factor.

> 2) DMA Engine on certain devices, e.g. Switchtec DMA and AMD PTDMA, is
> used particularly for off-CPU data transfer via device's NTB to a remote
> host. In NTRDMA project, which I'm involved to, DMA Engine sends data to
> remote ring buffer and on data arrival CPU processes local ring buffers.
> 

Is there any impact of the patch in this case? Given that it's a remote
host, the data is likely cache cold anyway.

> 3) I checked dmatest with noverify=0 on PTDMA dirver: AMD EPYC 7313 16-Core
> Processor/ASRock ROMED8-2T. The regression occurs on this hardware too.
> 

I expect it would be the same reason, the data is cache cold for the
CPU.

> 4) Do you mean that with noverify=N and dirty patch, data verification
> is performed on cached data and thus measured performance is fake?
> 

I think it's the data verification going slower because the tasks are
not aggressively migrating on interrupt. The flip side is other
interrupts such as IO completion should not migrate the tasks given that
the interrupt is not necessarily correlated with data hotness.

> 5) What DMA Engine enabled drivers (and dmatest) should use as design
> pattern to conform migration/cache behavior? Does scheduler optimisation
> conflict to DMA Engine performance in general?
> 

I'm not familiar with DMA engine drivers but if they use wake_up
interfaces then passing WF_SYNC or calling the wake_up_*_sync helpers
may force the migration.

> 6) I didn't suggest RFC patch to real world usage. It was just a test
> case to find out a low speed cause.
> 

I understand but I'm relunctant to go with the dirty patch on the
grounds it would reintroduce another class of regressions.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
  2022-01-17 10:27     ` Mel Gorman
@ 2022-01-17 17:44       ` Alexander Fomichev
       [not found]       ` <20220118020448.2399-1-hdanton@sina.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Alexander Fomichev @ 2022-01-17 17:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, dmaengine, linux, Peter Zijlstra

On Mon, Jan 17, 2022 at 10:27:01AM +0000, Mel Gorman wrote:
> > 1) You're right. When options "noverify=1" and "polling=1" are used.
> > then no performance reducing occurs.
> 
> How about just noverify=1 on its own? It's a stronger indicator that
> cache hotness is a factor.
> 

With "noverify=1 polled=0" the performance reduction is only 10-20%,
but still exists.

-----< v5.15.8-vanilla >-----
[17057.866760] dmatest: Added 1 threads using dma0chan0
[17060.133880] dmatest: Started 1 threads using dma0chan0
[17060.154343] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 49338.85 iops 3157686 KB/s (0)
[17063.737887] dmatest: Added 1 threads using dma0chan0
[17065.113838] dmatest: Started 1 threads using dma0chan0
[17065.137659] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42183.41 iops 2699738 KB/s (0)
[17100.339989] dmatest: Added 1 threads using dma0chan0
[17102.190764] dmatest: Started 1 threads using dma0chan0
[17102.214285] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42844.89 iops 2742073 KB/s (0)
-----< end >-----

-----< 5.15.8-ioat-ptdma-dirty-fix+ >-----
[ 6183.356549] dmatest: Added 1 threads using dma0chan0
[ 6187.868237] dmatest: Started 1 threads using dma0chan0
[ 6187.887389] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52753.74 iops 3376239 KB/s (0)
[ 6201.913154] dmatest: Added 1 threads using dma0chan0
[ 6204.701340] dmatest: Started 1 threads using dma0chan0
[ 6204.720490] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52614.96 iops 3367357 KB/s (0)
[ 6285.114603] dmatest: Added 1 threads using dma0chan0
[ 6287.031875] dmatest: Started 1 threads using dma0chan0
[ 6287.050278] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 54939.01 iops 3516097 KB/s (0)
-----< end >-----

> > 2) DMA Engine on certain devices, e.g. Switchtec DMA and AMD PTDMA, is
> > used particularly for off-CPU data transfer via device's NTB to a remote
> > host. In NTRDMA project, which I'm involved to, DMA Engine sends data to
> > remote ring buffer and on data arrival CPU processes local ring buffers.
> > 
> 
> Is there any impact of the patch in this case? Given that it's a remote
> host, the data is likely cache cold anyway.
> 

It's complicated. Currently we have a bunch of problems with the
project. So we do decomposition and try to solve them separately. Here
we faced the DMA Engine issue.

> > 4) Do you mean that with noverify=N and dirty patch, data verification
> > is performed on cached data and thus measured performance is fake?
> > 
> 
> I think it's the data verification going slower because the tasks are
> not aggressively migrating on interrupt. The flip side is other
> interrupts such as IO completion should not migrate the tasks given that
> the interrupt is not necessarily correlated with data hotness.
> 

It's quite strange, because dmatest substitutes verification time from
overall test time. I suspect measurement may be inaccurate.

> > 5) What DMA Engine enabled drivers (and dmatest) should use as design
> > pattern to conform migration/cache behavior? Does scheduler optimisation
> > conflict to DMA Engine performance in general?
> > 
> 
> I'm not familiar with DMA engine drivers but if they use wake_up
> interfaces then passing WF_SYNC or calling the wake_up_*_sync helpers
> may force the migration.
> 

Thanks for the advice. I'll try to check if this is a solution.


-- 
Regards,
  Alexander

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
       [not found]       ` <20220118020448.2399-1-hdanton@sina.com>
@ 2022-01-18 10:05         ` Mel Gorman
  2022-01-19 12:55         ` Alexander Fomichev
       [not found]         ` <20220121101217.2849-1-hdanton@sina.com>
  2 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2022-01-18 10:05 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Alexander Fomichev, linux-kernel, dmaengine, linux, Peter Zijlstra

On Tue, Jan 18, 2022 at 10:04:48AM +0800, Hillf Danton wrote:
> > > > 5) What DMA Engine enabled drivers (and dmatest) should use as design
> > > > pattern to conform migration/cache behavior? Does scheduler optimisation
> > > > conflict to DMA Engine performance in general?
> > > > 
> > > 
> > > I'm not familiar with DMA engine drivers but if they use wake_up
> > > interfaces then passing WF_SYNC or calling the wake_up_*_sync helpers
> > > may force the migration.
> > > 
> > 
> > Thanks for the advice. I'll try to check if this is a solution.
> 
> Check if cold cache provides some room for selecting CPU.
> 
> Only for thoughts now.
> 

That will still favour migrating tasks between CPUs that share LLC cache
at the expense of losing some higher level caches and cpufreq state
(depending on the CPUfreq governor).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
       [not found]       ` <20220118020448.2399-1-hdanton@sina.com>
  2022-01-18 10:05         ` Mel Gorman
@ 2022-01-19 12:55         ` Alexander Fomichev
       [not found]         ` <20220121101217.2849-1-hdanton@sina.com>
  2 siblings, 0 replies; 11+ messages in thread
From: Alexander Fomichev @ 2022-01-19 12:55 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mel Gorman, linux-kernel, dmaengine, linux, Peter Zijlstra

On Tue, Jan 18, 2022 at 10:04:48AM +0800, Hillf Danton wrote:
> On Mon, 17 Jan 2022 20:44:19 +0300 Alexander Fomichev wrote:
> > On Mon, Jan 17, 2022 at 10:27:01AM +0000, Mel Gorman wrote:
> > > > 1) You're right. When options "noverify=1" and "polling=1" are used.
> > > > then no performance reducing occurs.
> > > 
> > > How about just noverify=1 on its own? It's a stronger indicator that
> > > cache hotness is a factor.
> > > 
> > 
> > With "noverify=1 polled=0" the performance reduction is only 10-20%,
> > but still exists.
> > 
> > -----< v5.15.8-vanilla >-----
> > [17057.866760] dmatest: Added 1 threads using dma0chan0
> > [17060.133880] dmatest: Started 1 threads using dma0chan0
> > [17060.154343] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 49338.85 iops 3157686 KB/s (0)
> > [17063.737887] dmatest: Added 1 threads using dma0chan0
> > [17065.113838] dmatest: Started 1 threads using dma0chan0
> > [17065.137659] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42183.41 iops 2699738 KB/s (0)
> > [17100.339989] dmatest: Added 1 threads using dma0chan0
> > [17102.190764] dmatest: Started 1 threads using dma0chan0
> > [17102.214285] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42844.89 iops 2742073 KB/s (0)
> > -----< end >-----
> > 
> > -----< 5.15.8-ioat-ptdma-dirty-fix+ >-----
> > [ 6183.356549] dmatest: Added 1 threads using dma0chan0
> > [ 6187.868237] dmatest: Started 1 threads using dma0chan0
> > [ 6187.887389] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52753.74 iops 3376239 KB/s (0)
> > [ 6201.913154] dmatest: Added 1 threads using dma0chan0
> > [ 6204.701340] dmatest: Started 1 threads using dma0chan0
> > [ 6204.720490] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52614.96 iops 3367357 KB/s (0)
> > [ 6285.114603] dmatest: Added 1 threads using dma0chan0
> > [ 6287.031875] dmatest: Started 1 threads using dma0chan0
> > [ 6287.050278] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 54939.01 iops 3516097 KB/s (0)
> > -----< end >-----
> > 
> 
> Check if cold cache provides some room for selecting CPU.
> 
> Only for thoughts now.
> 
> Hillf
> 
> +++ x/kernel/sched/fair.c
> @@ -5889,19 +5889,16 @@ static int
>  wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>  {
>  	/*
> -	 * If this_cpu is idle, it implies the wakeup is from interrupt
> -	 * context. Only allow the move if cache is shared. Otherwise an
> -	 * interrupt intensive workload could force all tasks onto one
> -	 * node depending on the IO topology or IRQ affinity settings.
> -	 *
> -	 * If the prev_cpu is idle and cache affine then avoid a migration.
> -	 * There is no guarantee that the cache hot data from an interrupt
> -	 * is more important than cache hot data on the prev_cpu and from
> -	 * a cpufreq perspective, it's better to have higher utilisation
> -	 * on one CPU.
> +	 * select this cpu if both are idle because of
> +	 * cold shared cache
>  	 */
> -	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> -		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> +	if (cpus_share_cache(this_cpu, prev_cpu)) {
> +		if (available_idle_cpu(this_cpu))
> +			return this_cpu;
> +
> +		if (available_idle_cpu(prev_cpu))
> +			return prev_cpu;
> +	}
>  
>  	if (sync && cpu_rq(this_cpu)->nr_running == 1)
>  		return this_cpu;

Hi Hillf,

The results with your patch are controversial:

-----< v5.15.8-Hillf-Danton-patch+ >-----
[ 1572.178884] dmatest: Added 1 threads using dma0chan0
[ 1577.413535] dmatest: Started 1 threads using dma0chan0
[ 1577.432495] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 53188.66 iops 3404074 KB/s (0)
[ 1592.356173] dmatest: Added 1 threads using dma0chan0
[ 1593.791100] dmatest: Started 1 threads using dma0chan0
[ 1593.815282] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 41668.40 iops 2666777 KB/s (0)
[ 1617.117040] dmatest: Added 1 threads using dma0chan0
[ 1619.545890] dmatest: Started 1 threads using dma0chan0
[ 1619.569639] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42426.81 iops 2715316 KB/s (0)
-----< end >-----

Just to remind, used dmatest parameters:

/sys/module/dmatest/parameters/iterations:1000
/sys/module/dmatest/parameters/alignment:-1
/sys/module/dmatest/parameters/verbose:N
/sys/module/dmatest/parameters/norandom:Y
/sys/module/dmatest/parameters/max_channels:0
/sys/module/dmatest/parameters/dmatest:0
/sys/module/dmatest/parameters/polled:N
/sys/module/dmatest/parameters/threads_per_chan:1
/sys/module/dmatest/parameters/noverify:Y
/sys/module/dmatest/parameters/test_buf_size:1048576
/sys/module/dmatest/parameters/transfer_size:65536
/sys/module/dmatest/parameters/run:N
/sys/module/dmatest/parameters/wait:Y
/sys/module/dmatest/parameters/timeout:2000
/sys/module/dmatest/parameters/xor_sources:3
/sys/module/dmatest/parameters/pq_sources:3


-- 
Regards,
  Alexander

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
       [not found]         ` <20220121101217.2849-1-hdanton@sina.com>
@ 2022-01-21 13:46           ` Alexander Fomichev
       [not found]           ` <20220122233314.2999-1-hdanton@sina.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Alexander Fomichev @ 2022-01-21 13:46 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mel Gorman, linux-kernel, dmaengine, linux, Peter Zijlstra

On Fri, Jan 21, 2022 at 06:12:17PM +0800, Hillf Danton wrote:
> On Wed, 19 Jan 2022 15:55:13 +0300 Alexander Fomichev wrote:
> >On Tue, Jan 18, 2022 at 10:04:48AM +0800, Hillf Danton wrote:
> >> On Mon, 17 Jan 2022 20:44:19 +0300 Alexander Fomichev wrote:
> >> > On Mon, Jan 17, 2022 at 10:27:01AM +0000, Mel Gorman wrote:
> >> > 
> >> > -----< v5.15.8-vanilla >-----
> >> > [17057.866760] dmatest: Added 1 threads using dma0chan0
> >> > [17060.133880] dmatest: Started 1 threads using dma0chan0
> >> > [17060.154343] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 49338.85 iops 3157686 KB/s (0)
> >> > [17063.737887] dmatest: Added 1 threads using dma0chan0
> >> > [17065.113838] dmatest: Started 1 threads using dma0chan0
> >> > [17065.137659] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42183.41 iops 2699738 KB/s (0)
> >> > [17100.339989] dmatest: Added 1 threads using dma0chan0
> >> > [17102.190764] dmatest: Started 1 threads using dma0chan0
> >> > [17102.214285] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42844.89 iops 2742073 KB/s (0)
> >> > -----< end >-----
> >> > 
> >
> >Just to remind, used dmatest parameters:
> >
> >/sys/module/dmatest/parameters/iterations:1000
> >/sys/module/dmatest/parameters/alignment:-1
> >/sys/module/dmatest/parameters/verbose:N
> >/sys/module/dmatest/parameters/norandom:Y
> >/sys/module/dmatest/parameters/max_channels:0
> >/sys/module/dmatest/parameters/dmatest:0
> >/sys/module/dmatest/parameters/polled:N
> >/sys/module/dmatest/parameters/threads_per_chan:1
> >/sys/module/dmatest/parameters/noverify:Y
> >/sys/module/dmatest/parameters/test_buf_size:1048576
> >/sys/module/dmatest/parameters/transfer_size:65536
> >/sys/module/dmatest/parameters/run:N
> >/sys/module/dmatest/parameters/wait:Y
> >/sys/module/dmatest/parameters/timeout:2000
> >/sys/module/dmatest/parameters/xor_sources:3
> >/sys/module/dmatest/parameters/pq_sources:3
> 
> 
> See if tuning back down 10 degree can close the gap in iops, in the
> assumption that the prev CPU can be ignored in case of cold cache.
> 
> Also want to see the diff in output of "cat /proc/interrupts" before
> and after dmatest, wondering if the dma irq is bond to a CPU core of
> dancing on several ones.
> 
> Hillf
> 
> +++ x/kernel/sched/fair.c
> @@ -5888,20 +5888,10 @@ static int wake_wide(struct task_struct
>  static int
>  wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>  {
> -	/*
> -	 * If this_cpu is idle, it implies the wakeup is from interrupt
> -	 * context. Only allow the move if cache is shared. Otherwise an
> -	 * interrupt intensive workload could force all tasks onto one
> -	 * node depending on the IO topology or IRQ affinity settings.
> -	 *
> -	 * If the prev_cpu is idle and cache affine then avoid a migration.
> -	 * There is no guarantee that the cache hot data from an interrupt
> -	 * is more important than cache hot data on the prev_cpu and from
> -	 * a cpufreq perspective, it's better to have higher utilisation
> -	 * on one CPU.
> -	 */
> -	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> -		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> +	/* select this cpu because of cold cache */
> +	if (cpus_share_cache(this_cpu, prev_cpu))
> +		if (available_idle_cpu(this_cpu))
> +			return this_cpu;
>  
>  	if (sync && cpu_rq(this_cpu)->nr_running == 1)
>  		return this_cpu;
> --

Hi Hillf,

Thanks for the information.
With the recent patch (I called it patch2) the results are following:

-----< 5.15.8-Hillf-Danton-patch2+ noverify=Y >-----
[  646.568455] dmatest: Added 1 threads using dma0chan0
[  661.127077] dmatest: Started 1 threads using dma0chan0
[  661.147156] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 50251.25 iops 3216080 KB/s (0)
[  675.132323] dmatest: Added 1 threads using dma0chan0
[  676.205829] dmatest: Started 1 threads using dma0chan0
[  676.225991] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 50022.50 iops 3201440 KB/s (0)
[  703.100813] dmatest: Added 1 threads using dma0chan0
[  704.933579] dmatest: Started 1 threads using dma0chan0
[  704.953733] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 49950.04 iops 3196803 KB/s (0)
-----< end >-----

Also I have re-run the test with 'noverify=N' option, just for
illustration.

-----< 5.15.8-Hillf-Danton-patch2+ noverify=N >-----
[ 1614.739687] dmatest: Added 1 threads using dma0chan0
[ 1620.346536] dmatest: Started 1 threads using dma0chan0
[ 1623.254880] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 23544.92 iops 1506875 KB/s (0)
[ 1634.974200] dmatest: Added 1 threads using dma0chan0
[ 1635.981532] dmatest: Started 1 threads using dma0chan0
[ 1638.892182] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 23703.98 iops 1517055 KB/s (0)
[ 1652.878143] dmatest: Added 1 threads using dma0chan0
[ 1655.235130] dmatest: Started 1 threads using dma0chan0
[ 1658.143206] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 23526.64 iops 1505705 KB/s (0)
-----< end >-----

/proc/interrupts changes before/after the test:

-----< interrupts.diff >-----
- 184:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0       6000          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 103813120-edge      0000:c6:00.2
+ 184:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0       9000          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI 103813120-edge      0000:c6:00.2
-----< end >-----

It looks like the MSI handler is called on the same CPU all the time.

-- 
Regards,
  Alexander

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
       [not found]           ` <20220122233314.2999-1-hdanton@sina.com>
@ 2022-01-28 16:50             ` Alexander Fomichev
  2022-02-23 15:24               ` Thorsten Leemhuis
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Fomichev @ 2022-01-28 16:50 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Mel Gorman, linux-kernel, dmaengine, linux, Peter Zijlstra

On Sun, Jan 23, 2022 at 07:33:14AM +0800, Hillf Danton wrote:
> 
> Lets put pieces together based on the data collected.
> 
> 1, No irq migration was observed.
> 
> 2, Your patch that produced the highest iops fo far
> 
> -----< 5.15.8-ioat-ptdma-dirty-fix+ >-----
> [ 6183.356549] dmatest: Added 1 threads using dma0chan0
> [ 6187.868237] dmatest: Started 1 threads using dma0chan0
> [ 6187.887389] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52753.74 iops 3376239 KB/s (0)
> [ 6201.913154] dmatest: Added 1 threads using dma0chan0
> [ 6204.701340] dmatest: Started 1 threads using dma0chan0
> [ 6204.720490] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52614.96 iops 3367357 KB/s (0)
> [ 6285.114603] dmatest: Added 1 threads using dma0chan0
> [ 6287.031875] dmatest: Started 1 threads using dma0chan0
> [ 6287.050278] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 54939.01 iops 3516097 KB/s (0)
> -----< end >-----
> 
> 
> -       if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> -               return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> +       if (available_idle_cpu(this_cpu))
> +               return this_cpu;
> 
> prefers this cpu if it is idle regardless of cache affinity.
> 
> This implies task migration to the cpu that handled irq.
> 
> 3, Given this cpu is different from the prev cpu in addition to no irq
> migration, the tested task was bouncing between the two cpus, with one
> more question rising, why was task migrated off from the irq-handling cpu?
> 
> Despite no evidence, I bet no bounce occurred given iops observed.
> 

IMHO, your assumptions are correct.
I've added CPU number counters on every step of dmatest. It reveals that
test task migration between (at least 2) CPUs occurs in even one thread
mode. Below is for vanilla 5.15.8 kernel:

-----< threads_per_chan=1 >-----
[19449.557950] dmatest: Added 1 threads using dma0chan0
[19469.238180] dmatest: Started 1 threads using dma0chan0
[19469.253962] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 65291.19 iops 4178636 KB/s (0)
[19469.253973] dmatest: CPUs hit: #52:417 #63:583  (times)
-----< end >-----

Note that IRQ handler runs on CPU #52 in this environment.

In 4 thread mode the task migrates even more aggressively:

-----< threads_per_chan=4 >-----
[19355.460227] dmatest: Added 4 threads using dma0chan0
[19359.841182] dmatest: Started 4 threads using dma0chan0
[19359.860447] dmatest: dma0chan0-copy3: summary 1000 tests, 0 failures 53908.35 iops 3450134 KB/s (0)
[19359.860451] dmatest: dma0chan0-copy2: summary 1000 tests, 0 failures 54179.98 iops 3467519 KB/s (0)
[19359.860456] dmatest: CPUs hit: #50:1000  (times)
[19359.860459] dmatest: CPUs hit: #17:1000  (times)
[19359.860459] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 54048.21 iops 3459085 KB/s (0)
[19359.860466] dmatest: CPUs hit: #31:1000  (times)
[19359.861420] dmatest: dma0chan0-copy1: summary 1000 tests, 0 failures 51466.80 iops 3293875 KB/s (0)
[19359.861425] dmatest: CPUs hit: #30:213 #48:556 #52:231  (times)
-----< end >-----

On the other hand, for dirty-patched kernel task doesn't migrate:

-----< patched threads_per_chan=1 >-----
[ 2100.142002] dmatest: Added 1 threads using dma0chan0
[ 2102.359727] dmatest: Started 1 threads using dma0chan0
[ 2102.373594] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 76173.06 iops 4875076 KB/s (0)
[ 2102.373600] dmatest: CPUs hit: #49:1000  (times)
-----< end >-----

IRQ handler runs on CPU #49 in this case.

Although in 4 thread mode the task still migrates. I think we should
consider such scenario as non-relevant for this isue.

-- 
Regards,
  Alexander

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
  2022-01-28 16:50             ` Alexander Fomichev
@ 2022-02-23 15:24               ` Thorsten Leemhuis
  0 siblings, 0 replies; 11+ messages in thread
From: Thorsten Leemhuis @ 2022-02-23 15:24 UTC (permalink / raw)
  To: Alexander Fomichev, Hillf Danton
  Cc: Mel Gorman, linux-kernel, dmaengine, linux, Peter Zijlstra, regressions

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easy accessible to everyone.

After below message nothing happened anymore afaics. Could anybody be so
kind to provide a quick status update, to ensure this wasn't simply
forgotten? And if this is not considered a regression anymore (I didn#t
fully understand below message, sorry), it would be ideal if the person
that replies could include a paragraph stating something like "#regzbot
invalid: Some reason why this can be ignored", as then I'll stop
tracking this. tia!

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

#regzbot poke

On 28.01.22 17:50, Alexander Fomichev wrote:
> On Sun, Jan 23, 2022 at 07:33:14AM +0800, Hillf Danton wrote:
>>
>> Lets put pieces together based on the data collected.
>>
>> 1, No irq migration was observed.
>>
>> 2, Your patch that produced the highest iops fo far
>>
>> -----< 5.15.8-ioat-ptdma-dirty-fix+ >-----
>> [ 6183.356549] dmatest: Added 1 threads using dma0chan0
>> [ 6187.868237] dmatest: Started 1 threads using dma0chan0
>> [ 6187.887389] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52753.74 iops 3376239 KB/s (0)
>> [ 6201.913154] dmatest: Added 1 threads using dma0chan0
>> [ 6204.701340] dmatest: Started 1 threads using dma0chan0
>> [ 6204.720490] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52614.96 iops 3367357 KB/s (0)
>> [ 6285.114603] dmatest: Added 1 threads using dma0chan0
>> [ 6287.031875] dmatest: Started 1 threads using dma0chan0
>> [ 6287.050278] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 54939.01 iops 3516097 KB/s (0)
>> -----< end >-----
>>
>>
>> -       if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
>> -               return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
>> +       if (available_idle_cpu(this_cpu))
>> +               return this_cpu;
>>
>> prefers this cpu if it is idle regardless of cache affinity.
>>
>> This implies task migration to the cpu that handled irq.
>>
>> 3, Given this cpu is different from the prev cpu in addition to no irq
>> migration, the tested task was bouncing between the two cpus, with one
>> more question rising, why was task migrated off from the irq-handling cpu?
>>
>> Despite no evidence, I bet no bounce occurred given iops observed.
>>
> 
> IMHO, your assumptions are correct.
> I've added CPU number counters on every step of dmatest. It reveals that
> test task migration between (at least 2) CPUs occurs in even one thread
> mode. Below is for vanilla 5.15.8 kernel:
> 
> -----< threads_per_chan=1 >-----
> [19449.557950] dmatest: Added 1 threads using dma0chan0
> [19469.238180] dmatest: Started 1 threads using dma0chan0
> [19469.253962] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 65291.19 iops 4178636 KB/s (0)
> [19469.253973] dmatest: CPUs hit: #52:417 #63:583  (times)
> -----< end >-----
> 
> Note that IRQ handler runs on CPU #52 in this environment.
> 
> In 4 thread mode the task migrates even more aggressively:
> 
> -----< threads_per_chan=4 >-----
> [19355.460227] dmatest: Added 4 threads using dma0chan0
> [19359.841182] dmatest: Started 4 threads using dma0chan0
> [19359.860447] dmatest: dma0chan0-copy3: summary 1000 tests, 0 failures 53908.35 iops 3450134 KB/s (0)
> [19359.860451] dmatest: dma0chan0-copy2: summary 1000 tests, 0 failures 54179.98 iops 3467519 KB/s (0)
> [19359.860456] dmatest: CPUs hit: #50:1000  (times)
> [19359.860459] dmatest: CPUs hit: #17:1000  (times)
> [19359.860459] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 54048.21 iops 3459085 KB/s (0)
> [19359.860466] dmatest: CPUs hit: #31:1000  (times)
> [19359.861420] dmatest: dma0chan0-copy1: summary 1000 tests, 0 failures 51466.80 iops 3293875 KB/s (0)
> [19359.861425] dmatest: CPUs hit: #30:213 #48:556 #52:231  (times)
> -----< end >-----
> 
> On the other hand, for dirty-patched kernel task doesn't migrate:
> 
> -----< patched threads_per_chan=1 >-----
> [ 2100.142002] dmatest: Added 1 threads using dma0chan0
> [ 2102.359727] dmatest: Started 1 threads using dma0chan0
> [ 2102.373594] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 76173.06 iops 4875076 KB/s (0)
> [ 2102.373600] dmatest: CPUs hit: #49:1000  (times)
> -----< end >-----
> 
> IRQ handler runs on CPU #49 in this case.
> 
> Although in 4 thread mode the task still migrates. I think we should
> consider such scenario as non-relevant for this isue.
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-02-23 15:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-12 15:26 [RFC] Scheduler: DMA Engine regression because of sched/fair changes Alexander Fomichev
2022-01-12 17:05 ` Mel Gorman
2022-01-17  8:19   ` Alexander Fomichev
2022-01-17 10:27     ` Mel Gorman
2022-01-17 17:44       ` Alexander Fomichev
     [not found]       ` <20220118020448.2399-1-hdanton@sina.com>
2022-01-18 10:05         ` Mel Gorman
2022-01-19 12:55         ` Alexander Fomichev
     [not found]         ` <20220121101217.2849-1-hdanton@sina.com>
2022-01-21 13:46           ` Alexander Fomichev
     [not found]           ` <20220122233314.2999-1-hdanton@sina.com>
2022-01-28 16:50             ` Alexander Fomichev
2022-02-23 15:24               ` Thorsten Leemhuis
2022-01-16  9:55 ` Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).