All of lore.kernel.org
 help / color / mirror / Atom feed
* Long standing kernel warning: perfevents: irq loop stuck!
@ 2018-02-23  4:59 Cong Wang
  2018-02-23 12:14 ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2018-02-23  4:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen, Liang, Kan, jolsa, bigeasy,
	H. Peter Anvin, Ingo Molnar
  Cc: Thomas Gleixner, x86, LKML

Hello,

We keep seeing the following kernel warning from 3.10 kernel to 4.9
kernel, it exists for a rather long time.

Google search shows there was a patch from Ingo:
https://patchwork.kernel.org/patch/6308681/

but it doesn't look like ever merged into mainline...

I don't know how it is triggered. Please let me know if any other
information I can provide.

BTW, the 4.9.78 kernel we use is based on the upstream 4.9 release,
plus some fs and networking patches backported, everything is from
upstream.


Thanks!

----------->

[12032.813743] perf: interrupt took too long (7710 > 7696), lowering
kernel.perf_event_max_sample_rate to 25000
[14751.091121] perfevents: irq loop stuck!
[14751.095169] INFO: NMI handler (perf_event_nmi_handler) took too
long to run: 4.099 msecs
[14751.103265] perf: interrupt took too long (40100 > 9637), lowering
kernel.perf_event_max_sample_rate to 4000
[14751.113092] ------------[ cut here ]------------
[14751.117719] WARNING: CPU: 34 PID: 85204 at
arch/x86/events/intel/core.c:2093 intel_pmu_handle_irq+0x35d/0x4c0
[14751.127629] Modules linked in:^Ac sch_htb^Ac cls_basic^Ac
act_mirred^Ac cls_u32^Ac veth^Ac fuse^Ac sch_ingress^Ac iTCO_wdt^Ac
intel_rapl^Ac sb_edac^Ac edac_core^Ac iTCO_vendor_
support^Ac x86_pkg_temp_thermal^Ac coretemp^Ac crct10dif_pclmul^Ac
crc32_pclmul^Ac ghash_clmulni_intel^Ac i2c_i801^Ac i2c_smbus^Ac
ioatdma^Ac i2c_core^Ac lpc_ich^Ac shpchp^Ac tcp_
diag^Ac hed^Ac inet_diag^Ac wmi^Ac acpi_pad^Ac ipmi_si^Ac
ipmi_devintf^Ac ipmi_msghandler^Ac acpi_cpufreq^Ac sch_fq_codel^Ac
xfs^Ac libcrc32c^Ac ixgbe^Ac mdio^Ac ptp^Ac crc32c_int
el^Ac pps_core^Ac dca^Ac
[14751.172819] CPU: 34 PID: 85204 Comm: kworker/34:2 Not tainted
4.9.78.x86_64 #1
[14751.181341] Hardware name: SYNNEX F3HY-MX/X10DRD-LTP-B-TW008, BIOS
2.0 10/14/2016
[14751.188829]  ffff99577fa88b48^Ac ffffffff8138d5e7^Ac
ffff99577fa88b98^Ac 0000000000000000^Ac
[14751.196922]  ffff99577fa88b88^Ac ffffffff8108a7fb^Ac
0000082d00000000^Ac 0000000000000064^Ac
[14751.205015]  0000000200000000^Ac ffff99577fa8d440^Ac
ffff993902a16000^Ac 0000000000000040^Ac
[14751.213102] Call Trace:
[14751.215564]  <NMI>  [<ffffffff8138d5e7>] dump_stack+0x4d/0x66
[14751.221321]  [<ffffffff8108a7fb>] __warn+0xcb/0xf0
[14751.226124]  [<ffffffff8108a87f>] warn_slowpath_fmt+0x5f/0x80
[14751.231880]  [<ffffffff8100bc2d>] intel_pmu_handle_irq+0x35d/0x4c0
[14751.238062]  [<ffffffff810047dc>] perf_event_nmi_handler+0x2c/0x50
[14751.244248]  [<ffffffff81021eda>] nmi_handle+0x6a/0x120
[14751.249484]  [<ffffffff81022443>] default_do_nmi+0x53/0xf0
[14751.254992]  [<ffffffff810225c0>] do_nmi+0xe0/0x120
[14751.259884]  [<ffffffff8175535d>] end_repeat_nmi+0x87/0x8f
[14751.265377]  [<ffffffff8100b811>] ? intel_pmu_enable_event+0x1d1/0x230
[14751.271913]  [<ffffffff8100b811>] ? intel_pmu_enable_event+0x1d1/0x230
[14751.278446]  [<ffffffff8100b811>] ? intel_pmu_enable_event+0x1d1/0x230
[14751.284981]  <EOE>  [<ffffffff81005c6e>] x86_pmu_start+0x7e/0x100
[14751.291082]  [<ffffffff81005f62>] x86_pmu_enable+0x272/0x2e0
[14751.296754]  [<ffffffff811803b7>] perf_pmu_enable.part.92+0x7/0x10
[14751.302946]  [<ffffffff811854ab>] perf_cgroup_switch+0x17b/0x1b0
[14751.308963]  [<ffffffff81186636>] __perf_event_task_sched_in+0x66/0x1a0
[14751.315582]  [<ffffffff81186f11>] ? __perf_event_task_sched_out+0xb1/0x430
[14751.322463]  [<ffffffff810b1d7a>] finish_task_switch+0x10a/0x1b0
[14751.328476]  [<ffffffff8174edbd>] __schedule+0x20d/0x690
[14751.333797]  [<ffffffff8174f276>] schedule+0x36/0x80
[14751.338763]  [<ffffffff810a505e>] worker_thread+0xbe/0x480
[14751.344251]  [<ffffffff810a4fa0>] ? process_one_work+0x410/0x410
[14751.350265]  [<ffffffff810aa8e6>] kthread+0xe6/0x100
[14751.355238]  [<ffffffff8108f188>] ? do_exit+0x698/0xaa0
[14751.360475]  [<ffffffff810aa800>] ? kthread_park+0x60/0x60
[14751.365966]  [<ffffffff81754194>] ret_from_fork+0x54/0x60
[14751.371376] ---[ end trace fd59d29a318e02d5 ]---

[14751.377511] CPU#34: ctrl:       0000000000000000
[14751.382141] CPU#34: status:     0000000000000000
[14751.386770] CPU#34: overflow:   0000000000000000
[14751.391395] CPU#34: fixed:      00000000000000b0
[14751.396022] CPU#34: pebs:       0000000000000000
[14751.400648] CPU#34: debugctl:   0000000000000000
[14751.405281] CPU#34: active:     0000000200000000
[14751.409912] CPU#34:   gen-PMC0 ctrl:  00000000001301b7
[14751.415064] CPU#34:   gen-PMC0 count: 0000ffff0025fa88
[14751.420214] CPU#34:   gen-PMC0 left:  00000000ffda057b
[14751.425358] CPU#34:   gen-PMC1 ctrl:  00000000001301bb
[14751.430497] CPU#34:   gen-PMC1 count: 0000ffff005ad046
[14751.435643] CPU#34:   gen-PMC1 left:  00000000ffa52fc1
[14751.440786] CPU#34:   gen-PMC2 ctrl:  0000000000130151
[14751.445937] CPU#34:   gen-PMC2 count: 0000ffff069ffd2d
[14751.451091] CPU#34:   gen-PMC2 left:  00000000f9600409
[14751.456240] CPU#34:   gen-PMC3 ctrl:  000000000013003c
[14751.461383] CPU#34:   gen-PMC3 count: 0000ffff05abd0c9
[14751.466524] CPU#34:   gen-PMC3 left:  00000000fa54a75b
[14751.471670] CPU#34: fixed-PMC0 count: 0000ffffd26bbae7
[14751.476814] CPU#34: fixed-PMC1 count: 0000ffffffffffff
[14751.481958] CPU#34: fixed-PMC2 count: 0000000000000000
[14751.487100] core: clearing PMU state on CPU#34

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2018-02-23  4:59 Long standing kernel warning: perfevents: irq loop stuck! Cong Wang
@ 2018-02-23 12:14 ` Peter Zijlstra
  2018-02-26 20:32   ` Cong Wang
  2018-02-26 20:39   ` Andi Kleen
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2018-02-23 12:14 UTC (permalink / raw)
  To: Cong Wang
  Cc: Andi Kleen, Liang, Kan, jolsa, bigeasy, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, x86, LKML

On Thu, Feb 22, 2018 at 08:59:47PM -0800, Cong Wang wrote:
> Hello,
> 
> We keep seeing the following kernel warning from 3.10 kernel to 4.9
> kernel, it exists for a rather long time.
> 
> Google search shows there was a patch from Ingo:
> https://patchwork.kernel.org/patch/6308681/
> 
> but it doesn't look like ever merged into mainline...
> 
> I don't know how it is triggered. Please let me know if any other
> information I can provide.

What exact workload are you using to reproduce?

And I'm taking that the patch 'works' for you?

Given the HSD143 errata and its possible relevance, have you tried
changing the magic number to 32, does it then still fix things?

No real objection to the patch as such, it just needs a coherent comment
and a tested-by tag I think.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2018-02-23 12:14 ` Peter Zijlstra
@ 2018-02-26 20:32   ` Cong Wang
  2018-02-26 20:39   ` Andi Kleen
  1 sibling, 0 replies; 12+ messages in thread
From: Cong Wang @ 2018-02-26 20:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Liang, Kan, jolsa, bigeasy, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, x86, LKML

On Fri, Feb 23, 2018 at 4:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 22, 2018 at 08:59:47PM -0800, Cong Wang wrote:
>> Hello,
>>
>> We keep seeing the following kernel warning from 3.10 kernel to 4.9
>> kernel, it exists for a rather long time.
>>
>> Google search shows there was a patch from Ingo:
>> https://patchwork.kernel.org/patch/6308681/
>>
>> but it doesn't look like ever merged into mainline...
>>
>> I don't know how it is triggered. Please let me know if any other
>> information I can provide.
>
> What exact workload are you using to reproduce?

I have no idea how to reproduce it. It has been reported so many times
from so many different machines via ABRT.


>
> And I'm taking that the patch 'works' for you?

I don't try it yet, because according to Ingo himself, that patch
is not complete:

"
Also, I'd apply the quirk not just to Haswell, but Nehalem, Westmere
and Ivy Bridge as well, I have seen it as early as on a Nehalem
prototype box.
"

I can try it if that patch makes sense for you and if you can make it
complete. ;)


>
> Given the HSD143 errata and its possible relevance, have you tried
> changing the magic number to 32, does it then still fix things?
>
> No real objection to the patch as such, it just needs a coherent comment
> and a tested-by tag I think.

I will give it a try. Please let me know if you have an updated
version of that patch I can apply on recent kernel (4.9), since it was
made almost 3 years ago, otherwise I can apply it manually.

It will take some time due to the deployment process of a new kernel.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2018-02-23 12:14 ` Peter Zijlstra
  2018-02-26 20:32   ` Cong Wang
@ 2018-02-26 20:39   ` Andi Kleen
  2019-08-12 17:24     ` Josh Hunt
  1 sibling, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2018-02-26 20:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Cong Wang, Liang, Kan, jolsa, bigeasy, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, x86, LKML

> Given the HSD143 errata and its possible relevance, have you tried
> changing the magic number to 32, does it then still fix things?
> 
> No real objection to the patch as such, it just needs a coherent comment
> and a tested-by tag I think.

128 min period will affect a lot of valid use cases with slower ticking
events.  I often use smaller periods there.

It would be better to debug this properly.

Or at a minimum only do the limitation for the events that tick really
fast (like cycles, uops retired etc.)

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2018-02-26 20:39   ` Andi Kleen
@ 2019-08-12 17:24     ` Josh Hunt
  2019-08-12 17:54       ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Josh Hunt @ 2019-08-12 17:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Cong Wang, Liang, Kan, jolsa, bigeasy,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, x86, LKML

On Mon, Feb 26, 2018 at 12:40 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > Given the HSD143 errata and its possible relevance, have you tried
> > changing the magic number to 32, does it then still fix things?
> >
> > No real objection to the patch as such, it just needs a coherent comment
> > and a tested-by tag I think.
>
> 128 min period will affect a lot of valid use cases with slower ticking
> events.  I often use smaller periods there.
>
> It would be better to debug this properly.
>
> Or at a minimum only do the limitation for the events that tick really
> fast (like cycles, uops retired etc.)
>
> -Andi

Was there any progress made on debugging this issue? We are still
seeing it on 4.19.44:

[ 2660.685392] ------------[ cut here ]------------
[ 2660.685392] perfevents: irq loop stuck!
[ 2660.685392] WARNING: CPU: 1 PID: 4436 at
arch/x86/events/intel/core.c:2278 intel_pmu_handle_irq+0x37b/0x530
[ 2660.685393] Modules linked in: sch_fq ip6table_raw ip6table_filter
ip6_tables iptable_raw xt_TARPIT ts_bm xt_u32 xt_recent xt_string
xt_set ip_set_hash_ip ip_set_hash_ipportip ip_set_hash_net dev_cstack
tcp_bbr tcp_qdk netconsole aep ip6_udp_tunnel udp_tunnel dm_mod
i2c_dev tcp_fast w83627ehf hwmon_vid jc42 i2c_core softdog autofs4
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq libcrc32c raid1 raid0 linear md_mod ext4
crc32c_generic crc16 mbcache jbd2 xt_tcpudp ipv6 iptable_filter
ip_tables ip_set nfnetlink x_tables zfs(O) zunicode(O) zavl(O) icp(O)
sr_mod zcommon(O) znvpair(O) spl(O) coretemp hwmon kvm_intel
ipmi_devintf kvm ata_piix irqbypass crc32c_intel ipmi_msghandler
i7core_edac libata lpc_ich mfd_core e1000e pcc_cpufreq
[ 2660.685405] CPU: 1 PID: 4436 Comm: xx_yyyy01 Tainted: G           O
     4.19.44 #1
[ 2660.685405] Hardware name: Ciara Technologies
[ 2660.685405] RIP: 0010:intel_pmu_handle_irq+0x37b/0x530
[ 2660.685406] Code: 00 00 bf 40 03 00 00 48 8b 40 10 e8 bf 82 9f 00
e9 f3 fc ff ff 80 3d f3 54 61 01 00 75 1a 48 c7 c7 02 00 e1 93 e8 45
8d 06 00 <0f> 0b e8 2e a9 ff ff c6 05 d7 54 61 01 01 65 4c 8b 35 5f 4f
00 6d
[ 2660.685406] RSP: 0018:fffffe0000034c40 EFLAGS: 00010086
[ 2660.685407] RAX: 000000000000001c RBX: 0000000000000064 RCX: 0000000000000002
[ 2660.685407] RDX: 0000000000000003 RSI: ffffffff93e1001e RDI: ffff926bafa555a8
[ 2660.685407] RBP: fffffe0000034e30 R08: fffffff8919fe17a R09: 0000000000000000
[ 2660.685407] R10: fffffe0000034c40 R11: 0000000000000000 R12: ffff926bafa4f3a0
[ 2660.685408] R13: ffff926b93739000 R14: 0000000000000040 R15: ffff926bafa4f5a0
[ 2660.685408] FS:  00007f0b9ccd7940(0000) GS:ffff926bafa40000(0000)
knlGS:0000000000000000
[ 2660.685408] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2660.685409] CR2: 00007f45c3497750 CR3: 000000042388e000 CR4: 00000000000007e0
[ 2660.685409] Call Trace:
[ 2660.685409]  <NMI>
[ 2660.685409]  ? perf_event_nmi_handler+0x2e/0x50
[ 2660.685409]  ? intel_pmu_save_and_restart+0x50/0x50
[ 2660.685410]  perf_event_nmi_handler+0x2e/0x50
[ 2660.685410]  nmi_handle+0x6e/0x120
[ 2660.685410]  default_do_nmi+0x3e/0x100
[ 2660.685410]  do_nmi+0x102/0x160
[ 2660.685410]  end_repeat_nmi+0x16/0x50
[ 2660.685411] RIP: 0010:native_write_msr+0x6/0x20
[ 2660.685411] Code: c3 48 c1 e2 20 48 89 d3 8b 16 48 09 c3 48 89 de
e8 bf 53 3b 00 48 89 d8 5b c3 66 2e 0f 1f 84 00 00 00 00 00 89 f9 89
f0 0f 30 <66> 66 66 66 90 c3 48 c1 e2 20 89 f6 48 09 d6 31 d2 e9 24 53
3b 00
[ 2660.685411] RSP: 0018:ffffb04661fb7c60 EFLAGS: 00000046
[ 2660.685412] RAX: 0000000000000bb0 RBX: ffff926b93739000 RCX: 000000000000038d
[ 2660.685412] RDX: 0000000000000000 RSI: 0000000000000bb0 RDI: 000000000000038d
[ 2660.685412] RBP: 000000000000000b R08: fffffff8919fe17a R09: ffffffff941602d0
[ 2660.685413] R10: ffffb04661fb7bd0 R11: 0000000000000362 R12: 0000000000000008
[ 2660.685413] R13: 0000000000000001 R14: ffff926bafa4f5c4 R15: 0000000000000001
[ 2660.685413]  ? native_write_msr+0x6/0x20
[ 2660.685413]  ? native_write_msr+0x6/0x20
[ 2660.685414]  </NMI>
[ 2660.685414]  intel_pmu_enable_event+0x1ce/0x1f0
[ 2660.685414]  x86_pmu_start+0x78/0xa0
[ 2660.685414]  x86_pmu_enable+0x252/0x310
[ 2660.685414]  __perf_event_task_sched_in+0x181/0x190
[ 2660.685415]  ? __switch_to_asm+0x34/0x70
[ 2660.685415]  ? __switch_to_asm+0x40/0x70
[ 2660.685415]  ? __switch_to_asm+0x34/0x70
[ 2660.685415]  ? __switch_to_asm+0x40/0x70
[ 2660.685416]  finish_task_switch+0x158/0x260
[ 2660.685416]  __schedule+0x2f6/0x840
[ 2660.685416]  ? hrtimer_start_range_ns+0x153/0x210
[ 2660.685416]  schedule+0x32/0x80
[ 2660.685417]  schedule_hrtimeout_range_clock+0x8a/0x100
[ 2660.685417]  ? hrtimer_init+0x120/0x120
[ 2660.685417]  ep_poll+0x2f7/0x3a0
[ 2660.685417]  ? wake_up_q+0x60/0x60
[ 2660.685417]  do_epoll_wait+0xa9/0xc0
[ 2660.685418]  __x64_sys_epoll_wait+0x1a/0x20
[ 2660.685418]  do_syscall_64+0x4e/0x110
[ 2660.685418]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2660.685418] RIP: 0033:0x7f4d35107c03
[ 2660.685419] Code: 49 89 ca b8 e8 00 00 00 0f 05 48 3d 01 f0 ff ff
73 34 c3 48 83 ec 08 e8 cb d6 00 00 48 89 04 24 49 89 ca b8 e8 00 00
00 0f 05 <48> 8b 3c 24 48 89 c2 e8 11 d7 00 00 48 89 d0 48 83 c4 08 48
3d 01
[ 2660.685419] RSP: 002b:00007f0f6d11dd60 EFLAGS: 00000293 ORIG_RAX:
00000000000000e8
[ 2660.685420] RAX: ffffffffffffffda RBX: 000000001a3a3410 RCX: 00007f4d35107c03
[ 2660.685420] RDX: 00000000000003e8 RSI: 000000001a3a37b8 RDI: 0000000000000124
[ 2660.685420] RBP: 000000001a3a37b8 R08: 00007f4d36a86000 R09: 0000000000000000
[ 2660.685421] R10: 0000000000000004 R11: 0000000000000293 R12: 0000000000000004
[ 2660.685421] R13: 0000000000000000 R14: 000000001a3a3410 R15: 0000000000000004
[ 2660.685421] ---[ end trace 0e6128739ea4836a ]---

[ 2660.685421] CPU#1: ctrl:       0000000000000000
[ 2660.685422] CPU#1: status:     0000000400000000
[ 2660.685422] CPU#1: overflow:   0000000000000000
[ 2660.685422] CPU#1: fixed:      0000000000000bb0
[ 2660.685422] CPU#1: pebs:       0000000000000000
[ 2660.685422] CPU#1: debugctl:   0000000000000000
[ 2660.685423] CPU#1: active:     0000000600000000
[ 2660.685423] CPU#1:   gen-PMC0 ctrl:  0000000000000000
[ 2660.685423] CPU#1:   gen-PMC0 count: 0000000000000000
[ 2660.685423] CPU#1:   gen-PMC0 left:  0000000000000000
[ 2660.685424] CPU#1:   gen-PMC1 ctrl:  0000000000000000
[ 2660.685424] CPU#1:   gen-PMC1 count: 0000000000000000
[ 2660.685424] CPU#1:   gen-PMC1 left:  0000000000000000
[ 2660.685424] CPU#1:   gen-PMC2 ctrl:  0000000000000000
[ 2660.685425] CPU#1:   gen-PMC2 count: 0000000000000000
[ 2660.685425] CPU#1:   gen-PMC2 left:  0000000000000000
[ 2660.685425] CPU#1:   gen-PMC3 ctrl:  0000000000000000
[ 2660.685425] CPU#1:   gen-PMC3 count: 0000000000000000
[ 2660.685425] CPU#1:   gen-PMC3 left:  0000000000000000
[ 2660.685426] CPU#1: fixed-PMC0 count: 0000000000000000
[ 2660.685426] CPU#1: fixed-PMC1 count: 0000ffffeef5464e
[ 2660.685426] CPU#1: fixed-PMC2 count: 0000fffffffffff8
[ 2660.685426] core: clearing PMU state on CPU#1
[ 4700.443984] core: clearing PMU state on CPU#6

It does not reliably reproduce, but we have seen it in our lab. I'd be
happy to help debug, but need some guidance.
-- 
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2019-08-12 17:24     ` Josh Hunt
@ 2019-08-12 17:54       ` Thomas Gleixner
  2019-08-12 18:57         ` Josh Hunt
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2019-08-12 17:54 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Andi Kleen, Peter Zijlstra, Cong Wang, Liang, Kan, jolsa,
	bigeasy, H. Peter Anvin, Ingo Molnar, x86, LKML

On Mon, 12 Aug 2019, Josh Hunt wrote:
> Was there any progress made on debugging this issue? We are still
> seeing it on 4.19.44:

I haven't seen anyone looking at this.

Can you please try the patch Ingo posted:

  https://lore.kernel.org/lkml/20150501070226.GB18957@gmail.com/

and if it fixes the issue decrease the value from 128 to the point where it
comes back, i.e. 128 -> 64 -> 32 ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2019-08-12 17:54       ` Thomas Gleixner
@ 2019-08-12 18:57         ` Josh Hunt
  2019-08-12 19:34           ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Josh Hunt @ 2019-08-12 18:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Peter Zijlstra, Cong Wang, Liang, Kan, jolsa,
	bigeasy, H. Peter Anvin, Ingo Molnar, x86, LKML

On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, 12 Aug 2019, Josh Hunt wrote:
> > Was there any progress made on debugging this issue? We are still
> > seeing it on 4.19.44:
>
> I haven't seen anyone looking at this.
>
> Can you please try the patch Ingo posted:
>
>   https://lore.kernel.org/lkml/20150501070226.GB18957@gmail.com/
>
> and if it fixes the issue decrease the value from 128 to the point where it
> comes back, i.e. 128 -> 64 -> 32 ...
>
> Thanks,
>
>         tglx

I just checked the machines where this problem occurs and they're both
Nehalem boxes. I think Ingo's patch would only help Haswell machines.
Please let me know if I misread the patch or if what I'm seeing is a
different issue than the one Cong originally reported.

Thanks
-- 
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2019-08-12 18:57         ` Josh Hunt
@ 2019-08-12 19:34           ` Thomas Gleixner
  2019-08-12 19:42             ` Josh Hunt
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2019-08-12 19:34 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Andi Kleen, Peter Zijlstra, Cong Wang, Liang, Kan, jolsa,
	bigeasy, H. Peter Anvin, Ingo Molnar, x86, LKML

On Mon, 12 Aug 2019, Josh Hunt wrote:
> On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > Was there any progress made on debugging this issue? We are still
> > > seeing it on 4.19.44:
> >
> > I haven't seen anyone looking at this.
> >
> > Can you please try the patch Ingo posted:
> >
> >   https://lore.kernel.org/lkml/20150501070226.GB18957@gmail.com/
> >
> > and if it fixes the issue decrease the value from 128 to the point where it
> > comes back, i.e. 128 -> 64 -> 32 ...
> >
> > Thanks,
> >
> >         tglx
> 
> I just checked the machines where this problem occurs and they're both
> Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> Please let me know if I misread the patch or if what I'm seeing is a
> different issue than the one Cong originally reported.

Find the NHM hack below.

Thanks,

	tglx
	
8<----------------

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 648260b5f367..93c1a4f0e73e 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, u64 left)
 	return left;
 }
 
+static u64 nhm_limit_period(struct perf_event *event, u64 left)
+{
+	return max(left, 128ULL);
+}
+
 PMU_FORMAT_ATTR(event,	"config:0-7"	);
 PMU_FORMAT_ATTR(umask,	"config:8-15"	);
 PMU_FORMAT_ATTR(edge,	"config:18"	);
@@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
 		x86_pmu.enable_all = intel_pmu_nhm_enable_all;
 		x86_pmu.extra_regs = intel_nehalem_extra_regs;
+		x86_pmu.limit_period = nhm_limit_period;
 
 		mem_attr = nhm_mem_events_attrs;
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2019-08-12 19:34           ` Thomas Gleixner
@ 2019-08-12 19:42             ` Josh Hunt
  2019-08-19 21:17               ` Josh Hunt
  0 siblings, 1 reply; 12+ messages in thread
From: Josh Hunt @ 2019-08-12 19:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Peter Zijlstra, Cong Wang, Liang, Kan, jolsa,
	bigeasy, H. Peter Anvin, Ingo Molnar, x86, LKML

On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, 12 Aug 2019, Josh Hunt wrote:
> > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > Was there any progress made on debugging this issue? We are still
> > > > seeing it on 4.19.44:
> > >
> > > I haven't seen anyone looking at this.
> > >
> > > Can you please try the patch Ingo posted:
> > >
> > >   https://lore.kernel.org/lkml/20150501070226.GB18957@gmail.com/
> > >
> > > and if it fixes the issue decrease the value from 128 to the point where it
> > > comes back, i.e. 128 -> 64 -> 32 ...
> > >
> > > Thanks,
> > >
> > >         tglx
> >
> > I just checked the machines where this problem occurs and they're both
> > Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> > Please let me know if I misread the patch or if what I'm seeing is a
> > different issue than the one Cong originally reported.
>
> Find the NHM hack below.
>
> Thanks,
>
>         tglx
>
> 8<----------------
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 648260b5f367..93c1a4f0e73e 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, u64 left)
>         return left;
>  }
>
> +static u64 nhm_limit_period(struct perf_event *event, u64 left)
> +{
> +       return max(left, 128ULL);
> +}
> +
>  PMU_FORMAT_ATTR(event, "config:0-7"    );
>  PMU_FORMAT_ATTR(umask, "config:8-15"   );
>  PMU_FORMAT_ATTR(edge,  "config:18"     );
> @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
>                 x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
>                 x86_pmu.enable_all = intel_pmu_nhm_enable_all;
>                 x86_pmu.extra_regs = intel_nehalem_extra_regs;
> +               x86_pmu.limit_period = nhm_limit_period;
>
>                 mem_attr = nhm_mem_events_attrs;
>
Thanks Thomas. Will try this and let you know.

-- 
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2019-08-12 19:42             ` Josh Hunt
@ 2019-08-19 21:17               ` Josh Hunt
  2019-08-19 23:16                 ` Josh Hunt
  0 siblings, 1 reply; 12+ messages in thread
From: Josh Hunt @ 2019-08-19 21:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Peter Zijlstra, Cong Wang, Liang, Kan, jolsa,
	bigeasy, H. Peter Anvin, Ingo Molnar, x86, LKML

On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt <joshhunt00@gmail.com> wrote:
>
> On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > > Was there any progress made on debugging this issue? We are still
> > > > > seeing it on 4.19.44:
> > > >
> > > > I haven't seen anyone looking at this.
> > > >
> > > > Can you please try the patch Ingo posted:
> > > >
> > > >   https://lore.kernel.org/lkml/20150501070226.GB18957@gmail.com/
> > > >
> > > > and if it fixes the issue decrease the value from 128 to the point where it
> > > > comes back, i.e. 128 -> 64 -> 32 ...
> > > >
> > > > Thanks,
> > > >
> > > >         tglx
> > >
> > > I just checked the machines where this problem occurs and they're both
> > > Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> > > Please let me know if I misread the patch or if what I'm seeing is a
> > > different issue than the one Cong originally reported.
> >
> > Find the NHM hack below.
> >
> > Thanks,
> >
> >         tglx
> >
> > 8<----------------
> >
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index 648260b5f367..93c1a4f0e73e 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, u64 left)
> >         return left;
> >  }
> >
> > +static u64 nhm_limit_period(struct perf_event *event, u64 left)
> > +{
> > +       return max(left, 128ULL);
> > +}
> > +
> >  PMU_FORMAT_ATTR(event, "config:0-7"    );
> >  PMU_FORMAT_ATTR(umask, "config:8-15"   );
> >  PMU_FORMAT_ATTR(edge,  "config:18"     );
> > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
> >                 x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
> >                 x86_pmu.enable_all = intel_pmu_nhm_enable_all;
> >                 x86_pmu.extra_regs = intel_nehalem_extra_regs;
> > +               x86_pmu.limit_period = nhm_limit_period;
> >
> >                 mem_attr = nhm_mem_events_attrs;
> >
> Thanks Thomas. Will try this and let you know.
>
> --
> Josh

Thomas

I found on my setup that setting the value to 32 was the lowest value
I could use to keep the problem from happening. Let me know if you
want me to send a patch with the updated value, etc.

I saw in the original thread from Ingo and Vince that this was seen on
Haswell, but I checked our Haswell boxes and so far we have not
reproduced the problem there.

-- 
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2019-08-19 21:17               ` Josh Hunt
@ 2019-08-19 23:16                 ` Josh Hunt
  2019-08-22 14:31                   ` Josh Hunt
  0 siblings, 1 reply; 12+ messages in thread
From: Josh Hunt @ 2019-08-19 23:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Peter Zijlstra, Cong Wang, Liang, Kan, jolsa,
	bigeasy, H. Peter Anvin, Ingo Molnar, x86, LKML

On Mon, Aug 19, 2019 at 2:17 PM Josh Hunt <joshhunt00@gmail.com> wrote:
>
> On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt <joshhunt00@gmail.com> wrote:
> >
> > On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > >
> > > > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > > > Was there any progress made on debugging this issue? We are still
> > > > > > seeing it on 4.19.44:
> > > > >
> > > > > I haven't seen anyone looking at this.
> > > > >
> > > > > Can you please try the patch Ingo posted:
> > > > >
> > > > >   https://lore.kernel.org/lkml/20150501070226.GB18957@gmail.com/
> > > > >
> > > > > and if it fixes the issue decrease the value from 128 to the point where it
> > > > > comes back, i.e. 128 -> 64 -> 32 ...
> > > > >
> > > > > Thanks,
> > > > >
> > > > >         tglx
> > > >
> > > > I just checked the machines where this problem occurs and they're both
> > > > Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> > > > Please let me know if I misread the patch or if what I'm seeing is a
> > > > different issue than the one Cong originally reported.
> > >
> > > Find the NHM hack below.
> > >
> > > Thanks,
> > >
> > >         tglx
> > >
> > > 8<----------------
> > >
> > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > index 648260b5f367..93c1a4f0e73e 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, u64 left)
> > >         return left;
> > >  }
> > >
> > > +static u64 nhm_limit_period(struct perf_event *event, u64 left)
> > > +{
> > > +       return max(left, 128ULL);
> > > +}
> > > +
> > >  PMU_FORMAT_ATTR(event, "config:0-7"    );
> > >  PMU_FORMAT_ATTR(umask, "config:8-15"   );
> > >  PMU_FORMAT_ATTR(edge,  "config:18"     );
> > > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
> > >                 x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
> > >                 x86_pmu.enable_all = intel_pmu_nhm_enable_all;
> > >                 x86_pmu.extra_regs = intel_nehalem_extra_regs;
> > > +               x86_pmu.limit_period = nhm_limit_period;
> > >
> > >                 mem_attr = nhm_mem_events_attrs;
> > >
> > Thanks Thomas. Will try this and let you know.
> >
> > --
> > Josh
>
> Thomas
>
> I found on my setup that setting the value to 32 was the lowest value
> I could use to keep the problem from happening. Let me know if you
> want me to send a patch with the updated value, etc.
>
> I saw in the original thread from Ingo and Vince that this was seen on
> Haswell, but I checked our Haswell boxes and so far we have not
> reproduced the problem there.
>
> --
> Josh

I went ahead and sent this patch with the value set to 32:
https://lore.kernel.org/lkml/1566256411-18820-1-git-send-email-johunt@akamai.com/T/#u

I wasn't sure how/who to give credit to for the change, so please
resubmit if what I did is incorrect or if you wanted to debug further.
If you decide to resubmit the patch please add my tested-by and
Bhupesh's reported-by. I'm able to reproduce the problem within about
2 hours if there's anything else you wanted to look into before going
with this approach.

Thanks!
-- 
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Long standing kernel warning: perfevents: irq loop stuck!
  2019-08-19 23:16                 ` Josh Hunt
@ 2019-08-22 14:31                   ` Josh Hunt
  0 siblings, 0 replies; 12+ messages in thread
From: Josh Hunt @ 2019-08-22 14:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Peter Zijlstra, Cong Wang, Liang, Kan, jolsa,
	bigeasy, H. Peter Anvin, Ingo Molnar, x86, LKML

On Mon, Aug 19, 2019 at 4:16 PM Josh Hunt <joshhunt00@gmail.com> wrote:
>
> On Mon, Aug 19, 2019 at 2:17 PM Josh Hunt <joshhunt00@gmail.com> wrote:
> >
> > On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt <joshhunt00@gmail.com> wrote:
> > >
> > > On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > > >
> > > > > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > > > > Was there any progress made on debugging this issue? We are still
> > > > > > > seeing it on 4.19.44:
> > > > > >
> > > > > > I haven't seen anyone looking at this.
> > > > > >
> > > > > > Can you please try the patch Ingo posted:
> > > > > >
> > > > > >   https://lore.kernel.org/lkml/20150501070226.GB18957@gmail.com/
> > > > > >
> > > > > > and if it fixes the issue decrease the value from 128 to the point where it
> > > > > > comes back, i.e. 128 -> 64 -> 32 ...
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > >         tglx
> > > > >
> > > > > I just checked the machines where this problem occurs and they're both
> > > > > Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> > > > > Please let me know if I misread the patch or if what I'm seeing is a
> > > > > different issue than the one Cong originally reported.
> > > >
> > > > Find the NHM hack below.
> > > >
> > > > Thanks,
> > > >
> > > >         tglx
> > > >
> > > > 8<----------------
> > > >
> > > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > > index 648260b5f367..93c1a4f0e73e 100644
> > > > --- a/arch/x86/events/intel/core.c
> > > > +++ b/arch/x86/events/intel/core.c
> > > > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, u64 left)
> > > >         return left;
> > > >  }
> > > >
> > > > +static u64 nhm_limit_period(struct perf_event *event, u64 left)
> > > > +{
> > > > +       return max(left, 128ULL);
> > > > +}
> > > > +
> > > >  PMU_FORMAT_ATTR(event, "config:0-7"    );
> > > >  PMU_FORMAT_ATTR(umask, "config:8-15"   );
> > > >  PMU_FORMAT_ATTR(edge,  "config:18"     );
> > > > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
> > > >                 x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
> > > >                 x86_pmu.enable_all = intel_pmu_nhm_enable_all;
> > > >                 x86_pmu.extra_regs = intel_nehalem_extra_regs;
> > > > +               x86_pmu.limit_period = nhm_limit_period;
> > > >
> > > >                 mem_attr = nhm_mem_events_attrs;
> > > >
> > > Thanks Thomas. Will try this and let you know.
> > >
> > > --
> > > Josh
> >
> > Thomas
> >
> > I found on my setup that setting the value to 32 was the lowest value
> > I could use to keep the problem from happening. Let me know if you
> > want me to send a patch with the updated value, etc.
> >
> > I saw in the original thread from Ingo and Vince that this was seen on
> > Haswell, but I checked our Haswell boxes and so far we have not
> > reproduced the problem there.
> >
> > --
> > Josh
>
> I went ahead and sent this patch with the value set to 32:
> https://lore.kernel.org/lkml/1566256411-18820-1-git-send-email-johunt@akamai.com/T/#u
>
> I wasn't sure how/who to give credit to for the change, so please
> resubmit if what I did is incorrect or if you wanted to debug further.
> If you decide to resubmit the patch please add my tested-by and
> Bhupesh's reported-by. I'm able to reproduce the problem within about
> 2 hours if there's anything else you wanted to look into before going
> with this approach.
>
> Thanks!
> --
> Josh

Thomas

Any thoughts on the above or the patch that I sent?

Thanks!
-- 
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-08-22 14:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-23  4:59 Long standing kernel warning: perfevents: irq loop stuck! Cong Wang
2018-02-23 12:14 ` Peter Zijlstra
2018-02-26 20:32   ` Cong Wang
2018-02-26 20:39   ` Andi Kleen
2019-08-12 17:24     ` Josh Hunt
2019-08-12 17:54       ` Thomas Gleixner
2019-08-12 18:57         ` Josh Hunt
2019-08-12 19:34           ` Thomas Gleixner
2019-08-12 19:42             ` Josh Hunt
2019-08-19 21:17               ` Josh Hunt
2019-08-19 23:16                 ` Josh Hunt
2019-08-22 14:31                   ` Josh Hunt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.