All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bruno Goncalves <bgoncalv@redhat.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: CKI Project <cki-project@redhat.com>,
	linux-kernel@vger.kernel.org, nathan@kernel.org,
	Memory Management <mm-qe@redhat.com>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453
Date: Thu, 29 Jul 2021 14:36:56 +0200	[thread overview]
Message-ID: <CA+QYu4rCRR_pNQVxSwGpzcLWJKLUA3F7LAtBiU9hPzz3D0k0Cg@mail.gmail.com> (raw)
In-Reply-To: <d86333e7-bcde-2adb-e566-21ec97cda8bf@arm.com>

On Wed, Jul 28, 2021 at 5:55 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 28/07/2021 15:11, Bruno Goncalves wrote:
> > Hello,
> >
> > Since this commit (Commit: 45312bd762d3 - Merge tag 'zonefs-5.14-rc2')
> > we started to see the following call trace, it seems to be
> > reproducible only on aarch64.
>
> It should happen on platforms using a slow-switching cpufreq driver.
>
> Only in this case you have n (depends on nbr of frequency domains)
> special-purpose DL threads when using schedutil CPUFreq governor:
>
> root@juno: ps -eTo comm,pid,pri,class | grep sugov
>
> sugov:0           132 140 DLN
> sugov:1           134 140 DLN
>
> >
> > [  384.485614] ------------[ cut here ]------------
> > [  384.490227] rq->clock_update_flags < RQCF_ACT_SKIP
> > [  384.490232] WARNING: CPU: 112 PID: 2041 at
> > kernel/sched/sched.h:1453 sub_running_bw.isra.0+0x190/0x1a0
> > [  384.504312] Modules linked in: mlx5_ib ib_uverbs ib_core rfkill
> > sunrpc acpi_ipmi ipmi_ssif mlx5_core mlxfw psample ipmi_devintf
> > arm_cmn ipmi_msghandler arm_dsu_pmu cppc_cpufreq acpi_tad vfat fat
> > fuse zram ip_tables x_tables xfs crct10dif_ce ghash_ce ast
> > i2c_algo_bit drm_vram_helper sbsa_gwdt drm_kms_helper syscopyarea
> > sysfillrect sysimgblt fb_sys_fops cec drm_ttm_helper ttm nvme
> > nvme_core drm xgene_hwmon aes_neon_bs
> > [  384.541165] CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G        W
> >        5.14.0-rc1 #1
> > [  384.549244] Hardware name: WIWYNN Mt.Jade Server System
> > B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP:
> > 1.06.20210526) 2021/05/26
> > [  384.561922] pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO BTYPE=--)
> > [  384.567918] pc : sub_running_bw.isra.0+0x190/0x1a0
> > [  384.572698] lr : sub_running_bw.isra.0+0x190/0x1a0
> > [  384.577477] sp : ffff800024c4bb20
> > [  384.580779] x29: ffff800024c4bb20 x28: 0000000000000000 x27: ffffb9a9bbe1d200
> > [  384.587904] x26: 0000000000000074 x25: 0000000000000011 x24: ffffb9a9bdff9000
> > [  384.595029] x23: ffff07ffb36fcaa0 x22: ffff401ee09b65c0 x21: ffffb9a9bbe1de00
> > [  384.602153] x20: ffff401ee09a3360 x19: ffff401ee09b6f58 x18: 0000000000000000
> > [  384.609277] x17: ffff867522f0c000 x16: ffff800010384000 x15: 0000000000000030
> > [  384.616401] x14: 0000000000000000 x13: 50494b535f544341 x12: 5f46435152203c20
> > [  384.623526] x11: ffff401ee04b0ea8 x10: ffff401ee021e068 x9 : ffffb9a9bbe4214c
> > [  384.630650] x8 : 0000000000010ea8 x7 : ffff401ee01e0000 x6 : 0000000000017ffd
> > [  384.637774] x5 : ffff401ee09a3490 x4 : 0000000000000001 x3 : ffff867522f0c000
> > [  384.644898] x2 : ffff401ee09a3498 x1 : ffff07ffb53cc000 x0 : 0000000000000026
> > [  384.652022] Call trace:
> > [  384.654457]  sub_running_bw.isra.0+0x190/0x1a0
> > [  384.658890]  migrate_task_rq_dl+0xf8/0x1e0
> > [  384.662975]  set_task_cpu+0xa8/0x1f0
> > [  384.666540]  try_to_wake_up+0x150/0x3d4
> > [  384.670365]  wake_up_q+0x64/0xc0
> > [  384.673582]  __up_write+0xd0/0x1c0
> > [  384.676974]  up_write+0x4c/0x2b0
> > [  384.680191]  cppc_set_perf+0x120/0x2d0
> > [  384.683931]  cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq]
> > [  384.689756]  __cpufreq_driver_target+0x74/0x140
> > [  384.694277]  sugov_work+0x64/0x80
> > [  384.697580]  kthread_worker_fn+0xe0/0x230
> > [  384.701580]  kthread+0x138/0x140
> > [  384.704797]  ret_from_fork+0x10/0x18
>
> Don't quite get this.
> `sugov:112` should be a special DL entity (dl_se->flags &
> SCHED_FLAG_SUGOV) so sub_running_bw() should not call __sub_running_bw()
> and hence there won't be a call to cpufreq_update_util() which calls
> q_clock(rq) -> assert_clock_updated()?
>
> Can't reproduce it on my Juno (arm64) (slow-switching (scpi-cpufreq
> driver)).

We seem to be able to reproduce this only on Ampere Altra machines,
specifically on mtjade and mtsnow cpus.

# cpupower frequency-info
analyzing CPU 0:
  driver: cppc_cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1000 MHz - 2.80 GHz
  available cpufreq governors: conservative ondemand userspace
powersave performance schedutil
  current policy: frequency should be within 2.00 GHz and 2.80 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency: 1.55 GHz (asserted by call to hardware)

# ps -eTo comm,pid,pri,class | grep sugov
sugov:0            1082 140 DLN
sugov:1            1085 140 DLN
...
sugov:78           1319 140 DLN
sugov:79           1320 140 DLN


Bruno


>


WARNING: multiple messages have this Message-ID (diff)
From: Bruno Goncalves <bgoncalv@redhat.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: CKI Project <cki-project@redhat.com>,
	linux-kernel@vger.kernel.org, nathan@kernel.org,
	 Memory Management <mm-qe@redhat.com>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453
Date: Thu, 29 Jul 2021 14:36:56 +0200	[thread overview]
Message-ID: <CA+QYu4rCRR_pNQVxSwGpzcLWJKLUA3F7LAtBiU9hPzz3D0k0Cg@mail.gmail.com> (raw)
In-Reply-To: <d86333e7-bcde-2adb-e566-21ec97cda8bf@arm.com>

On Wed, Jul 28, 2021 at 5:55 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 28/07/2021 15:11, Bruno Goncalves wrote:
> > Hello,
> >
> > Since this commit (Commit: 45312bd762d3 - Merge tag 'zonefs-5.14-rc2')
> > we started to see the following call trace, it seems to be
> > reproducible only on aarch64.
>
> It should happen on platforms using a slow-switching cpufreq driver.
>
> Only in this case you have n (depends on nbr of frequency domains)
> special-purpose DL threads when using schedutil CPUFreq governor:
>
> root@juno: ps -eTo comm,pid,pri,class | grep sugov
>
> sugov:0           132 140 DLN
> sugov:1           134 140 DLN
>
> >
> > [  384.485614] ------------[ cut here ]------------
> > [  384.490227] rq->clock_update_flags < RQCF_ACT_SKIP
> > [  384.490232] WARNING: CPU: 112 PID: 2041 at
> > kernel/sched/sched.h:1453 sub_running_bw.isra.0+0x190/0x1a0
> > [  384.504312] Modules linked in: mlx5_ib ib_uverbs ib_core rfkill
> > sunrpc acpi_ipmi ipmi_ssif mlx5_core mlxfw psample ipmi_devintf
> > arm_cmn ipmi_msghandler arm_dsu_pmu cppc_cpufreq acpi_tad vfat fat
> > fuse zram ip_tables x_tables xfs crct10dif_ce ghash_ce ast
> > i2c_algo_bit drm_vram_helper sbsa_gwdt drm_kms_helper syscopyarea
> > sysfillrect sysimgblt fb_sys_fops cec drm_ttm_helper ttm nvme
> > nvme_core drm xgene_hwmon aes_neon_bs
> > [  384.541165] CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G        W
> >        5.14.0-rc1 #1
> > [  384.549244] Hardware name: WIWYNN Mt.Jade Server System
> > B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP:
> > 1.06.20210526) 2021/05/26
> > [  384.561922] pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO BTYPE=--)
> > [  384.567918] pc : sub_running_bw.isra.0+0x190/0x1a0
> > [  384.572698] lr : sub_running_bw.isra.0+0x190/0x1a0
> > [  384.577477] sp : ffff800024c4bb20
> > [  384.580779] x29: ffff800024c4bb20 x28: 0000000000000000 x27: ffffb9a9bbe1d200
> > [  384.587904] x26: 0000000000000074 x25: 0000000000000011 x24: ffffb9a9bdff9000
> > [  384.595029] x23: ffff07ffb36fcaa0 x22: ffff401ee09b65c0 x21: ffffb9a9bbe1de00
> > [  384.602153] x20: ffff401ee09a3360 x19: ffff401ee09b6f58 x18: 0000000000000000
> > [  384.609277] x17: ffff867522f0c000 x16: ffff800010384000 x15: 0000000000000030
> > [  384.616401] x14: 0000000000000000 x13: 50494b535f544341 x12: 5f46435152203c20
> > [  384.623526] x11: ffff401ee04b0ea8 x10: ffff401ee021e068 x9 : ffffb9a9bbe4214c
> > [  384.630650] x8 : 0000000000010ea8 x7 : ffff401ee01e0000 x6 : 0000000000017ffd
> > [  384.637774] x5 : ffff401ee09a3490 x4 : 0000000000000001 x3 : ffff867522f0c000
> > [  384.644898] x2 : ffff401ee09a3498 x1 : ffff07ffb53cc000 x0 : 0000000000000026
> > [  384.652022] Call trace:
> > [  384.654457]  sub_running_bw.isra.0+0x190/0x1a0
> > [  384.658890]  migrate_task_rq_dl+0xf8/0x1e0
> > [  384.662975]  set_task_cpu+0xa8/0x1f0
> > [  384.666540]  try_to_wake_up+0x150/0x3d4
> > [  384.670365]  wake_up_q+0x64/0xc0
> > [  384.673582]  __up_write+0xd0/0x1c0
> > [  384.676974]  up_write+0x4c/0x2b0
> > [  384.680191]  cppc_set_perf+0x120/0x2d0
> > [  384.683931]  cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq]
> > [  384.689756]  __cpufreq_driver_target+0x74/0x140
> > [  384.694277]  sugov_work+0x64/0x80
> > [  384.697580]  kthread_worker_fn+0xe0/0x230
> > [  384.701580]  kthread+0x138/0x140
> > [  384.704797]  ret_from_fork+0x10/0x18
>
> Don't quite get this.
> `sugov:112` should be a special DL entity (dl_se->flags &
> SCHED_FLAG_SUGOV) so sub_running_bw() should not call __sub_running_bw()
> and hence there won't be a call to cpufreq_update_util() which calls
> q_clock(rq) -> assert_clock_updated()?
>
> Can't reproduce it on my Juno (arm64) (slow-switching (scpi-cpufreq
> driver)).

We seem to be able to reproduce this only on Ampere Altra machines,
specifically on mtjade and mtsnow cpus.

# cpupower frequency-info
analyzing CPU 0:
  driver: cppc_cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1000 MHz - 2.80 GHz
  available cpufreq governors: conservative ondemand userspace
powersave performance schedutil
  current policy: frequency should be within 2.00 GHz and 2.80 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency: 1.55 GHz (asserted by call to hardware)

# ps -eTo comm,pid,pri,class | grep sugov
sugov:0            1082 140 DLN
sugov:1            1085 140 DLN
...
sugov:78           1319 140 DLN
sugov:79           1320 140 DLN


Bruno


>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-07-29 12:37 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-28 13:11 WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453 Bruno Goncalves
2021-07-28 15:55 ` Dietmar Eggemann
2021-07-29 12:36   ` Bruno Goncalves [this message]
2021-07-29 12:36     ` Bruno Goncalves
2021-07-29 14:38     ` Dietmar Eggemann
2021-07-29 14:38       ` Dietmar Eggemann
2021-07-30 12:22       ` Dietmar Eggemann
2021-07-30 12:22         ` Dietmar Eggemann
2021-07-30 15:23         ` Bruno Goncalves
2021-07-30 15:23           ` Bruno Goncalves
2021-08-02  8:43           ` Dietmar Eggemann
2021-08-02  8:43             ` Dietmar Eggemann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+QYu4rCRR_pNQVxSwGpzcLWJKLUA3F7LAtBiU9hPzz3D0k0Cg@mail.gmail.com \
    --to=bgoncalv@redhat.com \
    --cc=cki-project@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mm-qe@redhat.com \
    --cc=nathan@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.