linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* oprofile + hibernation = badness
@ 2008-08-18 20:32 Vegard Nossum
  2008-08-18 20:51 ` Rafael J. Wysocki
  2008-08-19  1:13 ` Andi Kleen
  0 siblings, 2 replies; 17+ messages in thread
From: Vegard Nossum @ 2008-08-18 20:32 UTC (permalink / raw)
  To: Pavel Machek, Rafael J. Wysocki, Robert Richter, Ingo Molnar,
	Andi Kleen, Philippe Elie
  Cc: Linux Kernel Mailing List

Hi,

I'm probably crazy to do it, but...

    $ opcontrol --start
    $ echo disk > /sys/power/state

...leads to lots of badness/strangeness. First, lots of APIC errors:

APIC error on CPU0: 40(40)
[repeat 8 times]
APIC error on CPU1: 40(40)
APIC error on CPU1: 40(40)
APIC error on CPU0: 40(40)

These keep on coming all through the suspend/shutdown sequence, also
intermixing with other messages. I'm guessing oprofile is trying to
NMI CPUs that have been brought down?

Now I get some ACPI Exceptions, but I think that these are unrelated
to starting oprofile, because I have seen them on regular shutdowns as
well:

PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
PM: Shrinking memory... done (0 pages freed)
PM: Freed 0 kbytes in 0.44 seconds (0.00 MB/s)
Suspending console(s) (use no_console_suspend to debug)
ACPI Exception (exoparg2-0444): AE_AML_PACKAGE_LIMIT, Index
(000000007) is beyond end of object [20080609]
ACPI Error (psparse-0530): Method parse/execution failed
[\_SB_.PCI0.IDE0.GTM_] (Node f783bfc0), AE_AML_PACKAGE_LIMIT
ACPI Error (psparse-0530): Method parse/execution failed
[\_SB_.PCI0.IDE0.CHN0._GTM] (Node f783bb40), AE_AML_PACKAGE_LIMIT
ACPI handle has no context!
serial 00:0d: disabled
serial 00:06: disabled
ehci_hcd 0000:00:1d.7: PCI INT A disabled
uhci_hcd 0000:00:1d.3: PCI INT D disabled
uhci_hcd 0000:00:1d.2: PCI INT C disabled
uhci_hcd 0000:00:1d.1: PCI INT B disabled
uhci_hcd 0000:00:1d.0: PCI INT A disabled
ACPI: Preparing to enter system sleep state S4

After that, I see the message "WQ on CPU0, prefer CPU1" a few times.
This must also be bad.

Now some warnings:

------------[ cut here ]------------
WARNING: at /uio/arkimedes/s29/vegardno/git-working/linux-2.6/kernel/smp.c:328 s
mp_call_function_mask+0x194/0x1a0()
Pid: 3711, comm: bash Not tainted 2.6.27-rc3-00415-g122c9e0 #10
 [<c013584f>] warn_on_slowpath+0x4f/0x80
 [<c0585ff8>] ? log_sample+0xa8/0xc0
 [<c03e4e8d>] ? device_power_down+0x5d/0x110
 [<c0586140>] ? oprofile_add_ext_sample+0x40/0xc0
 [<c05861f4>] ? oprofile_add_sample+0x34/0x50
 [<c0588f5c>] ? p4_check_ctr+0x13c/0x160
 [<c0160c54>] smp_call_function_mask+0x194/0x1a0
 [<c0587940>] ? nmi_cpu_stop+0x0/0x30
 [<c014e04c>] ? __atomic_notifier_call_chain+0x3c/0x50
 [<c014e07a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<c014e0ad>] ? notify_die+0x2d/0x30
 [<c0587940>] ? nmi_cpu_stop+0x0/0x30
 [<c0160c90>] smp_call_function+0x30/0x60
 [<c0587940>] ? nmi_cpu_stop+0x0/0x30
 [<c013a16c>] on_each_cpu+0x2c/0x80
 [<c05878d4>] nmi_stop+0x14/0x20
 [<c05878f5>] nmi_suspend+0x15/0x20
 [<c03dff63>] sysdev_suspend+0xa3/0x1f0
 [<c03e4f0c>] device_power_down+0xdc/0x110
 [<c016355b>] hibernation_snapshot+0x15b/0x220
 [<c013644b>] ? printk+0x1b/0x20
 [<c0162742>] ? freeze_processes+0x52/0xa0
 [<c0163700>] hibernate+0xe0/0x180
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c016226f>] state_store+0xbf/0xd0
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c03618d4>] kobj_attr_store+0x24/0x30
 [<c01e66c2>] sysfs_write_file+0xa2/0x100
 [<c01a85b6>] vfs_write+0x96/0x130
 [<c01e6620>] ? sysfs_write_file+0x0/0x100
 [<c01a8afd>] sys_write+0x3d/0x70
 [<c01040db>] sysenter_do_call+0x12/0x3f
 [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
 =======================
---[ end trace df36f449a70f1c8e ]---
PM: Creating hibernation image:
PM: Need to copy 54048 pages
PM: Hibernation image created (54048 pages copied)
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Intel machine check architecture supported.
BUG: using smp_processor_id() in preemptible [00000000] code: bash/3711
caller is intel_p4_mcheck_init+0xb9/0x2b0
Pid: 3711, comm: bash Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
 [<c037664e>] debug_smp_processor_id+0xce/0xd0
 [<c010e919>] intel_p4_mcheck_init+0xb9/0x2b0
 [<c010dcd9>] mcheck_init+0x59/0x80
 [<c0589636>] restore_processor_state+0x1f6/0x230
 [<c01635d1>] hibernation_snapshot+0x1d1/0x220
 [<c013644b>] ? printk+0x1b/0x20
 [<c0162742>] ? freeze_processes+0x52/0xa0
 [<c0163700>] hibernate+0xe0/0x180
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c016226f>] state_store+0xbf/0xd0
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c03618d4>] kobj_attr_store+0x24/0x30
 [<c01e66c2>] sysfs_write_file+0xa2/0x100
 [<c01a85b6>] vfs_write+0x96/0x130
 [<c01e6620>] ? sysfs_write_file+0x0/0x100
 [<c01a8afd>] sys_write+0x3d/0x70
 [<c01040db>] sysenter_do_call+0x12/0x3f
 [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
 =======================
Intel machine check reporting enabled on CPU#0.
BUG: using smp_processor_id() in preemptible [00000000] code: bash/3711
caller is intel_p4_mcheck_init+0x100/0x2b0
Pid: 3711, comm: bash Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
 [<c037664e>] debug_smp_processor_id+0xce/0xd0
 [<c010e960>] intel_p4_mcheck_init+0x100/0x2b0
 [<c010dcd9>] mcheck_init+0x59/0x80
 [<c0589636>] restore_processor_state+0x1f6/0x230
 [<c01635d1>] hibernation_snapshot+0x1d1/0x220
 [<c013644b>] ? printk+0x1b/0x20
 [<c0162742>] ? freeze_processes+0x52/0xa0
 [<c0163700>] hibernate+0xe0/0x180
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c016226f>] state_store+0xbf/0xd0
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c03618d4>] kobj_attr_store+0x24/0x30
 [<c01e66c2>] sysfs_write_file+0xa2/0x100
 [<c01a85b6>] vfs_write+0x96/0x130
 [<c01e6620>] ? sysfs_write_file+0x0/0x100
 [<c01a8afd>] sys_write+0x3d/0x70
 [<c01040db>] sysenter_do_call+0x12/0x3f
 [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
 =======================
CPU0: Intel P4/Xeon Extended MCE MSRs (24) available
BUG: using smp_processor_id() in preemptible [00000000] code: bash/3711
caller is intel_p4_mcheck_init+0x119/0x2b0
Pid: 3711, comm: bash Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
 [<c037664e>] debug_smp_processor_id+0xce/0xd0
 [<c010e979>] intel_p4_mcheck_init+0x119/0x2b0
 [<c010dcd9>] mcheck_init+0x59/0x80
 [<c0589636>] restore_processor_state+0x1f6/0x230
 [<c01635d1>] hibernation_snapshot+0x1d1/0x220
 [<c013644b>] ? printk+0x1b/0x20
 [<c0162742>] ? freeze_processes+0x52/0xa0
 [<c0163700>] hibernate+0xe0/0x180
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c016226f>] state_store+0xbf/0xd0
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c03618d4>] kobj_attr_store+0x24/0x30
 [<c01e66c2>] sysfs_write_file+0xa2/0x100
 [<c01a85b6>] vfs_write+0x96/0x130
 [<c01e6620>] ? sysfs_write_file+0x0/0x100
 [<c01a8afd>] sys_write+0x3d/0x70
 [<c01040db>] sysenter_do_call+0x12/0x3f
 [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
 =======================
CPU0: Thermal monitoring enabled
Force enabled HPET at resume
BUG: using smp_processor_id() in preemptible [00000000] code: bash/3711
caller is retrigger_next_event+0x15/0xb0
Pid: 3711, comm: bash Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
 [<c037664e>] debug_smp_processor_id+0xce/0xd0
 [<c014c6c5>] retrigger_next_event+0x15/0xb0
 [<c066d3bf>] ? preempt_schedule+0x3f/0x50
 [<c066fe1c>] ? _spin_unlock+0x3c/0x50
 [<c014c76a>] hres_timers_resume+0xa/0x10
 [<c01509bc>] timekeeping_resume+0x12c/0x180
 [<c03df894>] __sysdev_resume+0x14/0x50
 [<c03dfdae>] sysdev_resume+0x3e/0x80
 [<c03e4f4b>] device_power_up+0xb/0x20
 [<c01635b0>] hibernation_snapshot+0x1b0/0x220
 [<c013644b>] ? printk+0x1b/0x20
 [<c0162742>] ? freeze_processes+0x52/0xa0
 [<c0163700>] hibernate+0xe0/0x180
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c016226f>] state_store+0xbf/0xd0
 [<c01621b0>] ? state_store+0x0/0xd0
 [<c03618d4>] kobj_attr_store+0x24/0x30
 [<c01e66c2>] sysfs_write_file+0xa2/0x100
 [<c01a85b6>] vfs_write+0x96/0x130
 [<c01e6620>] ? sysfs_write_file+0x0/0x100
 [<c01a8afd>] sys_write+0x3d/0x70
 [<c01040db>] sysenter_do_call+0x12/0x3f
 [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
 =======================
Enabling non-boot CPUs ...

...and then, while writing the image, it looks something like this:

APIC error on CPU1: 40(40)
^H^H^H^H  3%<7>APIC error on CPU1: 40(40)
WQ on CPU0, prefer CPU1
APIC error on CPU1: 40(40)
[8 times]
APIC error on CPU1: 40(40)
WQ on CPU0, prefer CPU1
...

The machine comes back with this spamming the logs (rate of about 2 KB/sec):

WQ on CPU0, prefer CPU1
APIC error on CPU1: 40(40)
[9 times, repeat]

I can still log in with ssh and reboot the machine, which finally
gives this warning:

BUG: sleeping function called from invalid context at kernel/sched.c:4625
in_atomic():1, irqs_disabled():0
INFO: lockdep is turned off.
Pid: 4027, comm: oprofiled Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
 [<c012c98e>] __might_sleep+0xee/0x140
 [<c066c8d8>] wait_for_common+0x28/0x130
 [<c01595bb>] ? trace_hardirqs_on+0xb/0x10
 [<c066ff13>] ? _spin_unlock_irqrestore+0x43/0x70
 [<c066ca72>] wait_for_completion+0x12/0x20
 [<c014715d>] synchronize_rcu+0x2d/0x40
 [<c01471f0>] ? wakeme_after_rcu+0x0/0x10
 [<c01595bb>] ? trace_hardirqs_on+0xb/0x10
 [<c014e156>] atomic_notifier_chain_unregister+0x36/0x50
 [<c014e17f>] unregister_die_notifier+0xf/0x20
 [<c0587b84>] nmi_shutdown+0x44/0x80
 [<c0585d4a>] oprofile_shutdown+0x2a/0x60
 [<c0586b50>] event_buffer_release+0x10/0x40
 [<c01a8f26>] __fput+0xb6/0x180
 [<c01a92a9>] fput+0x19/0x20
 [<c01a63c7>] filp_close+0x47/0x70
 [<c013741b>] put_files_struct+0x9b/0xb0
 [<c0137472>] exit_files+0x42/0x60
 [<c01386a2>] do_exit+0x192/0x880
 [<c014d3b6>] ? up_read+0x16/0x30
 [<c0120493>] ? do_page_fault+0x2e3/0x700
 [<c01a7cc0>] ? do_sync_write+0x0/0x110
 [<c0138dc1>] do_group_exit+0x31/0x90
 [<c0138e2f>] sys_exit_group+0xf/0x20
 [<c01040db>] sysenter_do_call+0x12/0x3f
 =======================
Sending all processes the KILL signal...

The last time I did this, the machine also came up with a corrupt ext3
root filesystem. But it didn't happen this time, so I guess it could
also be unrelated.

What would be the proper way to fix all this? Should oprofile be able
to profile the suspend/resume code as well? Or should "echo disk" give
-EBUSY if oprofile is running? Or
do we simply insert a huge kludge in the form of mutual exclusion
between oprofile and suspend in the Kconfig?

Thanks,


Vegard

PS: This is not a regression. It's probably always been like that.

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-18 20:32 oprofile + hibernation = badness Vegard Nossum
@ 2008-08-18 20:51 ` Rafael J. Wysocki
  2008-08-18 21:08   ` Vegard Nossum
  2008-08-19  1:13 ` Andi Kleen
  1 sibling, 1 reply; 17+ messages in thread
From: Rafael J. Wysocki @ 2008-08-18 20:51 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Pavel Machek, Robert Richter, Ingo Molnar, Andi Kleen,
	Philippe Elie, Linux Kernel Mailing List

On Monday, 18 of August 2008, Vegard Nossum wrote:
> Hi,
> 
> I'm probably crazy to do it, but...
> 
>     $ opcontrol --start
>     $ echo disk > /sys/power/state
> 
> ...leads to lots of badness/strangeness. First, lots of APIC errors:
> 
> APIC error on CPU0: 40(40)
> [repeat 8 times]
> APIC error on CPU1: 40(40)
> APIC error on CPU1: 40(40)
> APIC error on CPU0: 40(40)
> 
> These keep on coming all through the suspend/shutdown sequence, also
> intermixing with other messages. I'm guessing oprofile is trying to
> NMI CPUs that have been brought down?

They CPUs are taken down after suspending devices.  If you boot the
kernel with 'no_console_suspend' in the command line, there will be more
debug information to have a look at.

Also, you can do

# echo core > /sys/power/pm_test

before the '$ echo disk > /sys/power/state' (that only tests the suspend
sequence without actually hibernating) and see what symptoms will be
reproduced.

> Now I get some ACPI Exceptions, but I think that these are unrelated
> to starting oprofile, because I have seen them on regular shutdowns as
> well:
> 
> PM: Syncing filesystems ... done.
> Freezing user space processes ... (elapsed 0.00 seconds) done.
> Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
> PM: Shrinking memory... done (0 pages freed)
> PM: Freed 0 kbytes in 0.44 seconds (0.00 MB/s)
> Suspending console(s) (use no_console_suspend to debug)
> ACPI Exception (exoparg2-0444): AE_AML_PACKAGE_LIMIT, Index
> (000000007) is beyond end of object [20080609]
> ACPI Error (psparse-0530): Method parse/execution failed
> [\_SB_.PCI0.IDE0.GTM_] (Node f783bfc0), AE_AML_PACKAGE_LIMIT
> ACPI Error (psparse-0530): Method parse/execution failed
> [\_SB_.PCI0.IDE0.CHN0._GTM] (Node f783bb40), AE_AML_PACKAGE_LIMIT
> ACPI handle has no context!
> serial 00:0d: disabled
> serial 00:06: disabled
> ehci_hcd 0000:00:1d.7: PCI INT A disabled
> uhci_hcd 0000:00:1d.3: PCI INT D disabled
> uhci_hcd 0000:00:1d.2: PCI INT C disabled
> uhci_hcd 0000:00:1d.1: PCI INT B disabled
> uhci_hcd 0000:00:1d.0: PCI INT A disabled
> ACPI: Preparing to enter system sleep state S4
> 
> After that, I see the message "WQ on CPU0, prefer CPU1" a few times.
> This must also be bad.
> 
> Now some warnings:
> 
> ------------[ cut here ]------------
> WARNING: at /uio/arkimedes/s29/vegardno/git-working/linux-2.6/kernel/smp.c:328 s
> mp_call_function_mask+0x194/0x1a0()
> Pid: 3711, comm: bash Not tainted 2.6.27-rc3-00415-g122c9e0 #10
>  [<c013584f>] warn_on_slowpath+0x4f/0x80
>  [<c0585ff8>] ? log_sample+0xa8/0xc0
>  [<c03e4e8d>] ? device_power_down+0x5d/0x110
>  [<c0586140>] ? oprofile_add_ext_sample+0x40/0xc0
>  [<c05861f4>] ? oprofile_add_sample+0x34/0x50
>  [<c0588f5c>] ? p4_check_ctr+0x13c/0x160
>  [<c0160c54>] smp_call_function_mask+0x194/0x1a0
>  [<c0587940>] ? nmi_cpu_stop+0x0/0x30
>  [<c014e04c>] ? __atomic_notifier_call_chain+0x3c/0x50
>  [<c014e07a>] ? atomic_notifier_call_chain+0x1a/0x20
>  [<c014e0ad>] ? notify_die+0x2d/0x30
>  [<c0587940>] ? nmi_cpu_stop+0x0/0x30
>  [<c0160c90>] smp_call_function+0x30/0x60
>  [<c0587940>] ? nmi_cpu_stop+0x0/0x30
>  [<c013a16c>] on_each_cpu+0x2c/0x80
>  [<c05878d4>] nmi_stop+0x14/0x20
>  [<c05878f5>] nmi_suspend+0x15/0x20
>  [<c03dff63>] sysdev_suspend+0xa3/0x1f0
>  [<c03e4f0c>] device_power_down+0xdc/0x110
>  [<c016355b>] hibernation_snapshot+0x15b/0x220
>  [<c013644b>] ? printk+0x1b/0x20
>  [<c0162742>] ? freeze_processes+0x52/0xa0
>  [<c0163700>] hibernate+0xe0/0x180
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c016226f>] state_store+0xbf/0xd0
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c03618d4>] kobj_attr_store+0x24/0x30
>  [<c01e66c2>] sysfs_write_file+0xa2/0x100
>  [<c01a85b6>] vfs_write+0x96/0x130
>  [<c01e6620>] ? sysfs_write_file+0x0/0x100
>  [<c01a8afd>] sys_write+0x3d/0x70
>  [<c01040db>] sysenter_do_call+0x12/0x3f
>  [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
>  =======================
> ---[ end trace df36f449a70f1c8e ]---
> PM: Creating hibernation image:
> PM: Need to copy 54048 pages
> PM: Hibernation image created (54048 pages copied)
> x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
> Intel machine check architecture supported.
> BUG: using smp_processor_id() in preemptible [00000000] code: bash/3711
> caller is intel_p4_mcheck_init+0xb9/0x2b0
> Pid: 3711, comm: bash Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
>  [<c037664e>] debug_smp_processor_id+0xce/0xd0
>  [<c010e919>] intel_p4_mcheck_init+0xb9/0x2b0
>  [<c010dcd9>] mcheck_init+0x59/0x80
>  [<c0589636>] restore_processor_state+0x1f6/0x230
>  [<c01635d1>] hibernation_snapshot+0x1d1/0x220
>  [<c013644b>] ? printk+0x1b/0x20
>  [<c0162742>] ? freeze_processes+0x52/0xa0
>  [<c0163700>] hibernate+0xe0/0x180
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c016226f>] state_store+0xbf/0xd0
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c03618d4>] kobj_attr_store+0x24/0x30
>  [<c01e66c2>] sysfs_write_file+0xa2/0x100
>  [<c01a85b6>] vfs_write+0x96/0x130
>  [<c01e6620>] ? sysfs_write_file+0x0/0x100
>  [<c01a8afd>] sys_write+0x3d/0x70
>  [<c01040db>] sysenter_do_call+0x12/0x3f
>  [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
>  =======================
> Intel machine check reporting enabled on CPU#0.
> BUG: using smp_processor_id() in preemptible [00000000] code: bash/3711
> caller is intel_p4_mcheck_init+0x100/0x2b0
> Pid: 3711, comm: bash Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
>  [<c037664e>] debug_smp_processor_id+0xce/0xd0
>  [<c010e960>] intel_p4_mcheck_init+0x100/0x2b0
>  [<c010dcd9>] mcheck_init+0x59/0x80
>  [<c0589636>] restore_processor_state+0x1f6/0x230
>  [<c01635d1>] hibernation_snapshot+0x1d1/0x220
>  [<c013644b>] ? printk+0x1b/0x20
>  [<c0162742>] ? freeze_processes+0x52/0xa0
>  [<c0163700>] hibernate+0xe0/0x180
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c016226f>] state_store+0xbf/0xd0
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c03618d4>] kobj_attr_store+0x24/0x30
>  [<c01e66c2>] sysfs_write_file+0xa2/0x100
>  [<c01a85b6>] vfs_write+0x96/0x130
>  [<c01e6620>] ? sysfs_write_file+0x0/0x100
>  [<c01a8afd>] sys_write+0x3d/0x70
>  [<c01040db>] sysenter_do_call+0x12/0x3f
>  [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
>  =======================
> CPU0: Intel P4/Xeon Extended MCE MSRs (24) available
> BUG: using smp_processor_id() in preemptible [00000000] code: bash/3711
> caller is intel_p4_mcheck_init+0x119/0x2b0
> Pid: 3711, comm: bash Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
>  [<c037664e>] debug_smp_processor_id+0xce/0xd0
>  [<c010e979>] intel_p4_mcheck_init+0x119/0x2b0
>  [<c010dcd9>] mcheck_init+0x59/0x80
>  [<c0589636>] restore_processor_state+0x1f6/0x230
>  [<c01635d1>] hibernation_snapshot+0x1d1/0x220
>  [<c013644b>] ? printk+0x1b/0x20
>  [<c0162742>] ? freeze_processes+0x52/0xa0
>  [<c0163700>] hibernate+0xe0/0x180
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c016226f>] state_store+0xbf/0xd0
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c03618d4>] kobj_attr_store+0x24/0x30
>  [<c01e66c2>] sysfs_write_file+0xa2/0x100
>  [<c01a85b6>] vfs_write+0x96/0x130
>  [<c01e6620>] ? sysfs_write_file+0x0/0x100
>  [<c01a8afd>] sys_write+0x3d/0x70
>  [<c01040db>] sysenter_do_call+0x12/0x3f
>  [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
>  =======================
> CPU0: Thermal monitoring enabled
> Force enabled HPET at resume
> BUG: using smp_processor_id() in preemptible [00000000] code: bash/3711
> caller is retrigger_next_event+0x15/0xb0
> Pid: 3711, comm: bash Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
>  [<c037664e>] debug_smp_processor_id+0xce/0xd0
>  [<c014c6c5>] retrigger_next_event+0x15/0xb0
>  [<c066d3bf>] ? preempt_schedule+0x3f/0x50
>  [<c066fe1c>] ? _spin_unlock+0x3c/0x50
>  [<c014c76a>] hres_timers_resume+0xa/0x10
>  [<c01509bc>] timekeeping_resume+0x12c/0x180
>  [<c03df894>] __sysdev_resume+0x14/0x50
>  [<c03dfdae>] sysdev_resume+0x3e/0x80
>  [<c03e4f4b>] device_power_up+0xb/0x20
>  [<c01635b0>] hibernation_snapshot+0x1b0/0x220
>  [<c013644b>] ? printk+0x1b/0x20
>  [<c0162742>] ? freeze_processes+0x52/0xa0
>  [<c0163700>] hibernate+0xe0/0x180
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c016226f>] state_store+0xbf/0xd0
>  [<c01621b0>] ? state_store+0x0/0xd0
>  [<c03618d4>] kobj_attr_store+0x24/0x30
>  [<c01e66c2>] sysfs_write_file+0xa2/0x100
>  [<c01a85b6>] vfs_write+0x96/0x130
>  [<c01e6620>] ? sysfs_write_file+0x0/0x100
>  [<c01a8afd>] sys_write+0x3d/0x70
>  [<c01040db>] sysenter_do_call+0x12/0x3f
>  [<c0670000>] ? _read_unlock_irqrestore+0x60/0x70
>  =======================
> Enabling non-boot CPUs ...
> 
> ...and then, while writing the image, it looks something like this:
> 
> APIC error on CPU1: 40(40)
> ^H^H^H^H  3%<7>APIC error on CPU1: 40(40)
> WQ on CPU0, prefer CPU1
> APIC error on CPU1: 40(40)
> [8 times]
> APIC error on CPU1: 40(40)
> WQ on CPU0, prefer CPU1
> ...
> 
> The machine comes back with this spamming the logs (rate of about 2 KB/sec):
> 
> WQ on CPU0, prefer CPU1
> APIC error on CPU1: 40(40)
> [9 times, repeat]
> 
> I can still log in with ssh and reboot the machine, which finally
> gives this warning:
> 
> BUG: sleeping function called from invalid context at kernel/sched.c:4625
> in_atomic():1, irqs_disabled():0
> INFO: lockdep is turned off.
> Pid: 4027, comm: oprofiled Tainted: G        W 2.6.27-rc3-00415-g122c9e0 #10
>  [<c012c98e>] __might_sleep+0xee/0x140
>  [<c066c8d8>] wait_for_common+0x28/0x130
>  [<c01595bb>] ? trace_hardirqs_on+0xb/0x10
>  [<c066ff13>] ? _spin_unlock_irqrestore+0x43/0x70
>  [<c066ca72>] wait_for_completion+0x12/0x20
>  [<c014715d>] synchronize_rcu+0x2d/0x40
>  [<c01471f0>] ? wakeme_after_rcu+0x0/0x10
>  [<c01595bb>] ? trace_hardirqs_on+0xb/0x10
>  [<c014e156>] atomic_notifier_chain_unregister+0x36/0x50
>  [<c014e17f>] unregister_die_notifier+0xf/0x20
>  [<c0587b84>] nmi_shutdown+0x44/0x80
>  [<c0585d4a>] oprofile_shutdown+0x2a/0x60
>  [<c0586b50>] event_buffer_release+0x10/0x40
>  [<c01a8f26>] __fput+0xb6/0x180
>  [<c01a92a9>] fput+0x19/0x20
>  [<c01a63c7>] filp_close+0x47/0x70
>  [<c013741b>] put_files_struct+0x9b/0xb0
>  [<c0137472>] exit_files+0x42/0x60
>  [<c01386a2>] do_exit+0x192/0x880
>  [<c014d3b6>] ? up_read+0x16/0x30
>  [<c0120493>] ? do_page_fault+0x2e3/0x700
>  [<c01a7cc0>] ? do_sync_write+0x0/0x110
>  [<c0138dc1>] do_group_exit+0x31/0x90
>  [<c0138e2f>] sys_exit_group+0xf/0x20
>  [<c01040db>] sysenter_do_call+0x12/0x3f
>  =======================
> Sending all processes the KILL signal...
> 
> The last time I did this, the machine also came up with a corrupt ext3
> root filesystem. But it didn't happen this time, so I guess it could
> also be unrelated.
> 
> What would be the proper way to fix all this? Should oprofile be able
> to profile the suspend/resume code as well? Or should "echo disk" give
> -EBUSY if oprofile is running? Or
> do we simply insert a huge kludge in the form of mutual exclusion
> between oprofile and suspend in the Kconfig?

That last thing is probably the safest to do at the moment.

Apparently nmi_suspend() conflicts with oprofile somehow.  Also, the offlining
of non-boot CPUs may confuse it.  It would be helpful to check if the CPU
hotplug works with oprofile.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-18 20:51 ` Rafael J. Wysocki
@ 2008-08-18 21:08   ` Vegard Nossum
  2008-08-18 21:15     ` Rafael J. Wysocki
  0 siblings, 1 reply; 17+ messages in thread
From: Vegard Nossum @ 2008-08-18 21:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Robert Richter, Ingo Molnar, Andi Kleen,
	Philippe Elie, Linux Kernel Mailing List

On Mon, Aug 18, 2008 at 10:51 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> Apparently nmi_suspend() conflicts with oprofile somehow.  Also, the offlining
> of non-boot CPUs may confuse it.  It would be helpful to check if the CPU
> hotplug works with oprofile.

That is a good suggestion :-)

Here is offlining:

CPU 1 is now offline
lockdep: fixing up alternatives.
SMP alternatives: switching to UP code
CPU0 attaching NULL sched-domain.
WQ on CPU0, prefer CPU1
CPU1 attaching NULL sched-domain.
CPU0 attaching sched-domain:
 domain 0: span 0 level CPU
  groups: 0
WQ on CPU0, prefer CPU1
WQ on CPU0, prefer CPU1
WQ on CPU0, prefer CPU1
[repeat last message indefinitely]

Here is onlining:

Booting processor 1/1 ip 6000
Initializing CPU#1
WQ on CPU0, prefer CPU1
WQ on CPU0, prefer CPU1
Calibrating delay using timer specific routine.. 5986.15 BogoMIPS (lpj=29930790)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU: Physical Processor ID: 0
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel P4/Xeon Extended MCE MSRs (24) available
CPU1: Thermal monitoring enabled
x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
CPU1: Intel(R) Pentium(R) 4 CPU 3.00GHz stepping 05
checking TSC synchronization [CPU#0 -> CPU#1]:
Measured 120 cycles TSC warp between CPUs, turning off TSC clock.
Marking TSC unstable due to check_tsc_sync_source failed
APIC error on CPU1: 00(40)
Clockevents: could not switch to one-shot mode:<7>APIC error on CPU1: 40(40)
 lapic is not functional.
Could not switch to high resolution mode on CPU 0
Clockevents: could not switch to one-shot mode: lapic is not functional.
Could not switch to high resolution mode on CPU 1
APIC error on CPU1: 40(40)
[sched domains messages
WQ on CPU0, prefer CPU1
APIC error on CPU1: 40(40)
[repeat last message 9 times]

Then follows this pattern indefinitely:

WQ on CPU0, prefer CPU1
APIC error on CPU1: 40(40)
[repeat last message 9 times]

That's basically the same thing as I saw with suspend. So it can be
reproduced easily with CPU hotplug.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-18 21:08   ` Vegard Nossum
@ 2008-08-18 21:15     ` Rafael J. Wysocki
  2008-08-18 21:29       ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Rafael J. Wysocki @ 2008-08-18 21:15 UTC (permalink / raw)
  To: Vegard Nossum, Andrew Morton
  Cc: Pavel Machek, Robert Richter, Ingo Molnar, Andi Kleen,
	Philippe Elie, Linux Kernel Mailing List

On Monday, 18 of August 2008, Vegard Nossum wrote:
> On Mon, Aug 18, 2008 at 10:51 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > Apparently nmi_suspend() conflicts with oprofile somehow.  Also, the offlining
> > of non-boot CPUs may confuse it.  It would be helpful to check if the CPU
> > hotplug works with oprofile.
> 
> That is a good suggestion :-)
> 
> Here is offlining:
> 
> CPU 1 is now offline
> lockdep: fixing up alternatives.
> SMP alternatives: switching to UP code
> CPU0 attaching NULL sched-domain.
> WQ on CPU0, prefer CPU1
> CPU1 attaching NULL sched-domain.
> CPU0 attaching sched-domain:
>  domain 0: span 0 level CPU
>   groups: 0
> WQ on CPU0, prefer CPU1
> WQ on CPU0, prefer CPU1
> WQ on CPU0, prefer CPU1
> [repeat last message indefinitely]
> 
> Here is onlining:
> 
> Booting processor 1/1 ip 6000
> Initializing CPU#1
> WQ on CPU0, prefer CPU1
> WQ on CPU0, prefer CPU1
> Calibrating delay using timer specific routine.. 5986.15 BogoMIPS (lpj=29930790)
> CPU: Trace cache: 12K uops, L1 D cache: 16K
> CPU: L2 cache: 2048K
> CPU: Physical Processor ID: 0
> Intel machine check architecture supported.
> Intel machine check reporting enabled on CPU#1.
> CPU1: Intel P4/Xeon Extended MCE MSRs (24) available
> CPU1: Thermal monitoring enabled
> x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
> CPU1: Intel(R) Pentium(R) 4 CPU 3.00GHz stepping 05
> checking TSC synchronization [CPU#0 -> CPU#1]:
> Measured 120 cycles TSC warp between CPUs, turning off TSC clock.
> Marking TSC unstable due to check_tsc_sync_source failed
> APIC error on CPU1: 00(40)
> Clockevents: could not switch to one-shot mode:<7>APIC error on CPU1: 40(40)
>  lapic is not functional.
> Could not switch to high resolution mode on CPU 0
> Clockevents: could not switch to one-shot mode: lapic is not functional.
> Could not switch to high resolution mode on CPU 1
> APIC error on CPU1: 40(40)
> [sched domains messages
> WQ on CPU0, prefer CPU1
> APIC error on CPU1: 40(40)
> [repeat last message 9 times]
> 
> Then follows this pattern indefinitely:
> 
> WQ on CPU0, prefer CPU1
> APIC error on CPU1: 40(40)
> [repeat last message 9 times]
> 
> That's basically the same thing as I saw with suspend. So it can be
> reproduced easily with CPU hotplug.

Well, I don't know who's the right person to ask about the CPU hotplug.
Andrew, can you help please?

Rafael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-18 21:15     ` Rafael J. Wysocki
@ 2008-08-18 21:29       ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2008-08-18 21:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: vegard.nossum, pavel, robert.richter, mingo, ak, phil.el,
	linux-kernel, Thomas Gleixner

On Mon, 18 Aug 2008 23:15:59 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

> On Monday, 18 of August 2008, Vegard Nossum wrote:
> > On Mon, Aug 18, 2008 at 10:51 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > > Apparently nmi_suspend() conflicts with oprofile somehow.  Also, the offlining
> > > of non-boot CPUs may confuse it.  It would be helpful to check if the CPU
> > > hotplug works with oprofile.
> > 
> > That is a good suggestion :-)
> > 
> > Here is offlining:
> > 
> > CPU 1 is now offline
> > lockdep: fixing up alternatives.
> > SMP alternatives: switching to UP code
> > CPU0 attaching NULL sched-domain.
> > WQ on CPU0, prefer CPU1
> > CPU1 attaching NULL sched-domain.
> > CPU0 attaching sched-domain:
> >  domain 0: span 0 level CPU
> >   groups: 0
> > WQ on CPU0, prefer CPU1
> > WQ on CPU0, prefer CPU1
> > WQ on CPU0, prefer CPU1
> > [repeat last message indefinitely]
> > 
> > Here is onlining:
> > 
> > Booting processor 1/1 ip 6000
> > Initializing CPU#1
> > WQ on CPU0, prefer CPU1
> > WQ on CPU0, prefer CPU1
> > Calibrating delay using timer specific routine.. 5986.15 BogoMIPS (lpj=29930790)
> > CPU: Trace cache: 12K uops, L1 D cache: 16K
> > CPU: L2 cache: 2048K
> > CPU: Physical Processor ID: 0
> > Intel machine check architecture supported.
> > Intel machine check reporting enabled on CPU#1.
> > CPU1: Intel P4/Xeon Extended MCE MSRs (24) available
> > CPU1: Thermal monitoring enabled
> > x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
> > CPU1: Intel(R) Pentium(R) 4 CPU 3.00GHz stepping 05
> > checking TSC synchronization [CPU#0 -> CPU#1]:
> > Measured 120 cycles TSC warp between CPUs, turning off TSC clock.
> > Marking TSC unstable due to check_tsc_sync_source failed
> > APIC error on CPU1: 00(40)
> > Clockevents: could not switch to one-shot mode:<7>APIC error on CPU1: 40(40)
> >  lapic is not functional.
> > Could not switch to high resolution mode on CPU 0
> > Clockevents: could not switch to one-shot mode: lapic is not functional.
> > Could not switch to high resolution mode on CPU 1
> > APIC error on CPU1: 40(40)
> > [sched domains messages
> > WQ on CPU0, prefer CPU1
> > APIC error on CPU1: 40(40)
> > [repeat last message 9 times]
> > 
> > Then follows this pattern indefinitely:
> > 
> > WQ on CPU0, prefer CPU1
> > APIC error on CPU1: 40(40)
> > [repeat last message 9 times]
> > 
> > That's basically the same thing as I saw with suspend. So it can be
> > reproduced easily with CPU hotplug.
> 
> Well, I don't know who's the right person to ask about the CPU hotplug.
> Andrew, can you help please?
> 

The CPU hotplug maintainer is basically "everyone", because many
subsystems need to interact correctly with hotplugging, and many
subsystems can break it.

This one looks like a clocksource/apic/resume problem?  Probably the
next port of call would be Thomas, with Robert looking on.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-18 20:32 oprofile + hibernation = badness Vegard Nossum
  2008-08-18 20:51 ` Rafael J. Wysocki
@ 2008-08-19  1:13 ` Andi Kleen
  2008-08-19  7:12   ` Vegard Nossum
  2008-09-01 16:34   ` Robert Richter
  1 sibling, 2 replies; 17+ messages in thread
From: Andi Kleen @ 2008-08-19  1:13 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Pavel Machek, Rafael J. Wysocki, Robert Richter, Ingo Molnar,
	Philippe Elie, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2284 bytes --]


> 
> These keep on coming all through the suspend/shutdown sequence, also
> intermixing with other messages. I'm guessing oprofile is trying to
> NMI CPUs that have been brought down?

That should actually work in theory. Linux CPU offlining puts CPUs in a state where
they can still process NMIs. Hmm actually there was a change recently to free
their exception stacks. Maybe it's broken now.

Also oprofile has a suspend method that disables NMIs.

> Now I get some ACPI Exceptions, but I think that these are unrelated
> to starting oprofile, because I have seen them on regular shutdowns as
> well:
> 
> PM: Syncing filesystems ... done.
> Freezing user space processes ... (elapsed 0.00 seconds) done.
> Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
> PM: Shrinking memory... done (0 pages freed)
> PM: Freed 0 kbytes in 0.44 seconds (0.00 MB/s)
> Suspending console(s) (use no_console_suspend to debug)
> ACPI Exception (exoparg2-0444): AE_AML_PACKAGE_LIMIT, Index
> (000000007) is beyond end of object [20080609]
> ACPI Error (psparse-0530): Method parse/execution failed
> [\_SB_.PCI0.IDE0.GTM_] (Node f783bfc0), AE_AML_PACKAGE_LIMIT
> ACPI Error (psparse-0530): Method parse/execution failed
> [\_SB_.PCI0.IDE0.CHN0._GTM] (Node f783bb40), AE_AML_PACKAGE_LIMIT
> ACPI handle has no context!
> serial 00:0d: disabled
> serial 00:06: disabled
> ehci_hcd 0000:00:1d.7: PCI INT A disabled
> uhci_hcd 0000:00:1d.3: PCI INT D disabled
> uhci_hcd 0000:00:1d.2: PCI INT C disabled
> uhci_hcd 0000:00:1d.1: PCI INT B disabled
> uhci_hcd 0000:00:1d.0: PCI INT A disabled
> ACPI: Preparing to enter system sleep state S4
> 
> After that, I see the message "WQ on CPU0, prefer CPU1" a few times.
> This must also be bad.
> 
> Now some warnings:
> 
> ------------[ cut here ]------------
> WARNING: at /uio/arkimedes/s29/vegardno/git-working/linux-2.6/kernel/smp.c:328 s
> mp_call_function_mask+0x194/0x1a0()

The usual problem: the suspend function when interrupts are
already disabled calls smp_call_function which is not allowed with
interrupt off. But at this point all the other CPUs should be already
down anyways, so it should be enough to just drop that.

This patch should fix that problem at least by fixing cpu hotplug&
suspend support. Untested.

-Andi


[-- Attachment #2: 0001-oprofile-fix-cpu-hotcplug.patch --]
[-- Type: text/plain, Size: 2590 bytes --]

From 3967a8b03a30f26a6a5f41b4cc6a6772b5d0ce4f Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Tue, 19 Aug 2008 03:10:03 +0200
Subject: [PATCH] oprofile: fix cpu hotcplug

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/oprofile/nmi_int.c |   40 +++++++++++++++++++++++++++++++++++++---
 1 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/x86/oprofile/nmi_int.c b/arch/x86/oprofile/nmi_int.c
index 529d625..6438c32 100644
--- a/arch/x86/oprofile/nmi_int.c
+++ b/arch/x86/oprofile/nmi_int.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/moduleparam.h>
 #include <linux/kdebug.h>
+#include <linux/cpu.h>
 #include <asm/nmi.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
@@ -28,23 +29,48 @@ static DEFINE_PER_CPU(unsigned long, saved_lvtpc);
 
 static int nmi_start(void);
 static void nmi_stop(void);
+static void nmi_cpu_start(void *dummy);
+static void nmi_cpu_stop(void *dummy);
 
 /* 0 == registered but off, 1 == registered and on */
 static int nmi_enabled = 0;
 
+#ifdef CONFIG_SMP
+static int oprofile_cpu_notifier(struct notifier_block *b, unsigned long action,
+				 void *data)
+{
+	int cpu = (unsigned long)data;
+	switch (action) { 
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+		smp_call_function_single(cpu, nmi_cpu_start, NULL, 0);
+		break;
+	case CPU_DOWN_PREPARE:
+		smp_call_function_single(cpu, nmi_cpu_stop, NULL, 1);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block oprofile_cpu_nb = { 
+	.notifier_call = oprofile_cpu_notifier
+};	
+#endif
+
 #ifdef CONFIG_PM
 
 static int nmi_suspend(struct sys_device *dev, pm_message_t state)
 {
+	/* Only one CPU left, just stop that one */
 	if (nmi_enabled == 1)
-		nmi_stop();
+		nmi_cpu_stop(NULL);
 	return 0;
 }
 
 static int nmi_resume(struct sys_device *dev)
 {
 	if (nmi_enabled == 1)
-		nmi_start();
+		nmi_cpu_start(NULL);
 	return 0;
 }
 
@@ -66,6 +92,7 @@ static int __init init_sysfs(void)
 	error = sysdev_class_register(&oprofile_sysclass);
 	if (!error)
 		error = sysdev_register(&device_oprofile);
+	
 	return error;
 }
 
@@ -483,6 +510,9 @@ int __init op_nmi_init(struct oprofile_operations *ops)
 	}
 
 	init_sysfs();
+#ifdef CONFIG_SMP
+	register_cpu_notifier(&oprofile_cpu_nb);
+#endif
 	using_nmi = 1;
 	ops->create_files = nmi_create_files;
 	ops->setup = nmi_setup;
@@ -496,6 +526,10 @@ int __init op_nmi_init(struct oprofile_operations *ops)
 
 void op_nmi_exit(void)
 {
-	if (using_nmi)
+	if (using_nmi) { 
 		exit_sysfs();
+#ifdef CONFIG_SMP
+		unregister_cpu_notifier(&oprofile_cpu_nb);
+#endif
+	}
 }
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-19  1:13 ` Andi Kleen
@ 2008-08-19  7:12   ` Vegard Nossum
  2008-08-19  9:49     ` Ingo Molnar
  2008-09-01 16:34   ` Robert Richter
  1 sibling, 1 reply; 17+ messages in thread
From: Vegard Nossum @ 2008-08-19  7:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pavel Machek, Rafael J. Wysocki, Robert Richter, Ingo Molnar,
	Philippe Elie, Linux Kernel Mailing List

On Tue, Aug 19, 2008 at 3:13 AM, Andi Kleen <ak@linux.intel.com> wrote:
>> ------------[ cut here ]------------
>> WARNING: at
>> /uio/arkimedes/s29/vegardno/git-working/linux-2.6/kernel/smp.c:328 s
>> mp_call_function_mask+0x194/0x1a0()
>
> The usual problem: the suspend function when interrupts are
> already disabled calls smp_call_function which is not allowed with
> interrupt off. But at this point all the other CPUs should be already
> down anyways, so it should be enough to just drop that.
>
> This patch should fix that problem at least by fixing cpu hotplug&
> suspend support. Untested.
>
> -Andi
>
>
> From 3967a8b03a30f26a6a5f41b4cc6a6772b5d0ce4f Mon Sep 17 00:00:00 2001
> From: Andi Kleen <ak@linux.intel.com>
> Date: Tue, 19 Aug 2008 03:10:03 +0200
> Subject: [PATCH] oprofile: fix cpu hotcplug
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

This gets rid of the suspend warnings, so it looks to be a step in the
right direction! FWIW, you may add my Tested-by to the patch.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-19  7:12   ` Vegard Nossum
@ 2008-08-19  9:49     ` Ingo Molnar
  2008-08-19 12:12       ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-08-19  9:49 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Andi Kleen, Pavel Machek, Rafael J. Wysocki, Robert Richter,
	Philippe Elie, Linux Kernel Mailing List


* Vegard Nossum <vegard.nossum@gmail.com> wrote:

> This gets rid of the suspend warnings, so it looks to be a step in the 
> right direction! FWIW, you may add my Tested-by to the patch.

i've queued it up in tip/x86/oprofile - thanks guys.

Andi, another issue: your patch had 5 spurious whitespace errors. This 
is a reoccuring, many years pattern of behavior from you and you need to 
fix your workflow to send less sloppy patches. You should check your 
editor's settings (and generally you should take a good look at patches 
you send out).

I fixed up the problems of this patch, no need to resend.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-19  9:49     ` Ingo Molnar
@ 2008-08-19 12:12       ` Andi Kleen
  2008-08-19 12:37         ` Robert Richter
                           ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Andi Kleen @ 2008-08-19 12:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, Pavel Machek, Rafael J. Wysocki, Robert Richter,
	Philippe Elie, Linux Kernel Mailing List

Ingo Molnar wrote:
> * Vegard Nossum <vegard.nossum@gmail.com> wrote:
> 
>> This gets rid of the suspend warnings, so it looks to be a step in the 
>> right direction! FWIW, you may add my Tested-by to the patch.
> 
> i've queued it up in tip/x86/oprofile - thanks guys.

I would have thought Robert would take them? He's oprofile
maintainer after all.

> 
> Andi, another issue: your patch had 5 spurious whitespace errors. This 
> is a reoccuring, many years pattern of behavior from you and you need to 
> fix your workflow to send less sloppy patches. 

Yes I forgot you define patch cleanness based on white space.

Seriously, It's actually new -- i did this one in git directly instead of
quilt and git-commit doesn't seem to know how to drop them. In my older quilt
workflow they were always automatically dropped since many years. Both emacs
and vim add them and I don't plan to switch editors, sorry.

> I fixed up the problems of this patch, no need to resend.

I hope you fixed the typos in the commit log too (it really was just
a RFC test patch)

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-19 12:12       ` Andi Kleen
@ 2008-08-19 12:37         ` Robert Richter
  2008-08-19 12:56         ` Johannes Weiner
  2008-08-19 13:18         ` Ingo Molnar
  2 siblings, 0 replies; 17+ messages in thread
From: Robert Richter @ 2008-08-19 12:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Vegard Nossum, Pavel Machek, Rafael J. Wysocki,
	Philippe Elie, Linux Kernel Mailing List

On 19.08.08 14:12:38, Andi Kleen wrote:
> Ingo Molnar wrote:
>> * Vegard Nossum <vegard.nossum@gmail.com> wrote:
>>> This gets rid of the suspend warnings, so it looks to be a step in the 
>>> right direction! FWIW, you may add my Tested-by to the patch.
>> i've queued it up in tip/x86/oprofile - thanks guys.
>
> I would have thought Robert would take them? He's oprofile
> maintainer after all.

Yes, as soon as I have my own repository available this will change.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-19 12:12       ` Andi Kleen
  2008-08-19 12:37         ` Robert Richter
@ 2008-08-19 12:56         ` Johannes Weiner
  2008-08-19 13:18           ` Andi Kleen
  2008-08-19 13:18         ` Ingo Molnar
  2 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2008-08-19 12:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Vegard Nossum, Pavel Machek, Rafael J. Wysocki,
	Robert Richter, Philippe Elie, Linux Kernel Mailing List

Andi Kleen <ak@linux.intel.com> writes:

> Ingo Molnar wrote:
>> * Vegard Nossum <vegard.nossum@gmail.com> wrote:
>>
>>> This gets rid of the suspend warnings, so it looks to be a step in
>>> the right direction! FWIW, you may add my Tested-by to the patch.
>>
>> i've queued it up in tip/x86/oprofile - thanks guys.
>
> I would have thought Robert would take them? He's oprofile
> maintainer after all.
>
>>
>> Andi, another issue: your patch had 5 spurious whitespace
>> errors. This is a reoccuring, many years pattern of behavior from
>> you and you need to fix your workflow to send less sloppy patches. 
>
> Yes I forgot you define patch cleanness based on white space.
>
> Seriously, It's actually new -- i did this one in git directly instead of
> quilt and git-commit doesn't seem to know how to drop them. In my older quilt
> workflow they were always automatically dropped since many years. Both emacs
> and vim add them and I don't plan to switch editors, sorry.

You probably misconfigured emacs then.  I ran checkpatch over your patch
and don't see how and why emacs would add trailing whitespace after a
opening brace.

You can use this to be on the safe side:

(add-hook 'before-save-hook 'delete-trailing-whitespace)

	Hannes

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-19 12:56         ` Johannes Weiner
@ 2008-08-19 13:18           ` Andi Kleen
  0 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2008-08-19 13:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vegard Nossum, Pavel Machek, Rafael J. Wysocki, Robert Richter,
	Philippe Elie, Linux Kernel Mailing List

Johannes Weiner <hannes@saeurebad.de> writes:
>
> You probably misconfigured emacs then.  I ran checkpatch over your patch
> and don't see how and why emacs would add trailing whitespace after a
> opening brace.

I don't know how, but it does. vim does too sometimes (I use both
editors) 

>
> You can use this to be on the safe side:
>
> (add-hook 'before-save-hook 'delete-trailing-whitespace)

That drops it on all. This means when I edit something which already
has trailing spaces then the diff is full of noise. That would
be a cure far worse than the disease[1]

Ideal would be if git-commit dropped them on all new lines, similar to 
git-apply --whitespace=fix and quilt refresh, but it doesn't seem to know how.

But normally maintainers just set these options (at least I do and Andrew
does too) so it doesn't really matter if the submitted patch has trailing 
spaces or not because they disappear on the way up. 

And that said I refuse to write any more on this silly bike shed topic.

-Andi

[1] to be honest I'm not sure i are actually a disease since I don't
know of any single problem they cause.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-19 12:12       ` Andi Kleen
  2008-08-19 12:37         ` Robert Richter
  2008-08-19 12:56         ` Johannes Weiner
@ 2008-08-19 13:18         ` Ingo Molnar
  2 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2008-08-19 13:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vegard Nossum, Pavel Machek, Rafael J. Wysocki, Robert Richter,
	Philippe Elie, Linux Kernel Mailing List


* Andi Kleen <ak@linux.intel.com> wrote:

>> Andi, another issue: your patch had 5 spurious whitespace errors. 
>> This is a reoccuring, many years pattern of behavior from you and you 
>> need to fix your workflow to send less sloppy patches.
>
> Yes I forgot you define patch cleanness based on white space.
>
> Seriously, It's actually new -- i did this one in git directly instead 
> of quilt and git-commit doesn't seem to know how to drop them. In my 
> older quilt workflow they were always automatically dropped since many 
> years. Both emacs and vim add them and I don't plan to switch editors, 
> sorry.

Thousands of other kernel developers get it right and manage to submit 
clean patches, so i think you are not trying very hard to actually solve 
this problem - and your flippant and obtrusive answers do not make it 
very easy to work with you as a contributor.

>> I fixed up the problems of this patch, no need to resend.
>
> I hope you fixed the typos in the commit log too (it really was just a 
> RFC test patch)

You've put your signoff in there and Vegard has tested it. The general 
principle on lkml is that patches can be marked as do-not-apply-yet by 
adding something like this:

 Not-Signed-off-by: Andi Kleen <ak@linux.intel.com>

or by not adding a SOB line at all. In that case i either ask people for 
the SOB or apply them at my own risk.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-08-19  1:13 ` Andi Kleen
  2008-08-19  7:12   ` Vegard Nossum
@ 2008-09-01 16:34   ` Robert Richter
  2008-09-05 17:58     ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Robert Richter @ 2008-09-01 16:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, Pavel Machek, Rafael J. Wysocki, Philippe Elie,
	Linux Kernel Mailing List, Andi Kleen

Ingo,

this fix (commit id 1404e403) is not yet upstream for v2.6.27. Could
this be added to the next pull request?

Thanks,

-Robert

On 19.08.08 03:13:38, Andi Kleen wrote:
>
>> These keep on coming all through the suspend/shutdown sequence, also
>> intermixing with other messages. I'm guessing oprofile is trying to
>> NMI CPUs that have been brought down?
>
> That should actually work in theory. Linux CPU offlining puts CPUs in a 
> state where
> they can still process NMIs. Hmm actually there was a change recently to 
> free
> their exception stacks. Maybe it's broken now.
>
> Also oprofile has a suspend method that disables NMIs.
>
>> Now I get some ACPI Exceptions, but I think that these are unrelated
>> to starting oprofile, because I have seen them on regular shutdowns as
>> well:
>> PM: Syncing filesystems ... done.
>> Freezing user space processes ... (elapsed 0.00 seconds) done.
>> Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
>> PM: Shrinking memory... done (0 pages freed)
>> PM: Freed 0 kbytes in 0.44 seconds (0.00 MB/s)
>> Suspending console(s) (use no_console_suspend to debug)
>> ACPI Exception (exoparg2-0444): AE_AML_PACKAGE_LIMIT, Index
>> (000000007) is beyond end of object [20080609]
>> ACPI Error (psparse-0530): Method parse/execution failed
>> [\_SB_.PCI0.IDE0.GTM_] (Node f783bfc0), AE_AML_PACKAGE_LIMIT
>> ACPI Error (psparse-0530): Method parse/execution failed
>> [\_SB_.PCI0.IDE0.CHN0._GTM] (Node f783bb40), AE_AML_PACKAGE_LIMIT
>> ACPI handle has no context!
>> serial 00:0d: disabled
>> serial 00:06: disabled
>> ehci_hcd 0000:00:1d.7: PCI INT A disabled
>> uhci_hcd 0000:00:1d.3: PCI INT D disabled
>> uhci_hcd 0000:00:1d.2: PCI INT C disabled
>> uhci_hcd 0000:00:1d.1: PCI INT B disabled
>> uhci_hcd 0000:00:1d.0: PCI INT A disabled
>> ACPI: Preparing to enter system sleep state S4
>> After that, I see the message "WQ on CPU0, prefer CPU1" a few times.
>> This must also be bad.
>> Now some warnings:
>> ------------[ cut here ]------------
>> WARNING: at 
>> /uio/arkimedes/s29/vegardno/git-working/linux-2.6/kernel/smp.c:328 s
>> mp_call_function_mask+0x194/0x1a0()
>
> The usual problem: the suspend function when interrupts are
> already disabled calls smp_call_function which is not allowed with
> interrupt off. But at this point all the other CPUs should be already
> down anyways, so it should be enough to just drop that.
>
> This patch should fix that problem at least by fixing cpu hotplug&
> suspend support. Untested.
>
> -Andi
>

> From 3967a8b03a30f26a6a5f41b4cc6a6772b5d0ce4f Mon Sep 17 00:00:00 2001
> From: Andi Kleen <ak@linux.intel.com>
> Date: Tue, 19 Aug 2008 03:10:03 +0200
> Subject: [PATCH] oprofile: fix cpu hotcplug
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/oprofile/nmi_int.c |   40 +++++++++++++++++++++++++++++++++++++---
>  1 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/oprofile/nmi_int.c b/arch/x86/oprofile/nmi_int.c
> index 529d625..6438c32 100644
> --- a/arch/x86/oprofile/nmi_int.c
> +++ b/arch/x86/oprofile/nmi_int.c
> @@ -15,6 +15,7 @@
>  #include <linux/slab.h>
>  #include <linux/moduleparam.h>
>  #include <linux/kdebug.h>
> +#include <linux/cpu.h>
>  #include <asm/nmi.h>
>  #include <asm/msr.h>
>  #include <asm/apic.h>
> @@ -28,23 +29,48 @@ static DEFINE_PER_CPU(unsigned long, saved_lvtpc);
>  
>  static int nmi_start(void);
>  static void nmi_stop(void);
> +static void nmi_cpu_start(void *dummy);
> +static void nmi_cpu_stop(void *dummy);
>  
>  /* 0 == registered but off, 1 == registered and on */
>  static int nmi_enabled = 0;
>  
> +#ifdef CONFIG_SMP
> +static int oprofile_cpu_notifier(struct notifier_block *b, unsigned long action,
> +				 void *data)
> +{
> +	int cpu = (unsigned long)data;
> +	switch (action) { 
> +	case CPU_DOWN_FAILED:
> +	case CPU_ONLINE:
> +		smp_call_function_single(cpu, nmi_cpu_start, NULL, 0);
> +		break;
> +	case CPU_DOWN_PREPARE:
> +		smp_call_function_single(cpu, nmi_cpu_stop, NULL, 1);
> +		break;
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block oprofile_cpu_nb = { 
> +	.notifier_call = oprofile_cpu_notifier
> +};	
> +#endif
> +
>  #ifdef CONFIG_PM
>  
>  static int nmi_suspend(struct sys_device *dev, pm_message_t state)
>  {
> +	/* Only one CPU left, just stop that one */
>  	if (nmi_enabled == 1)
> -		nmi_stop();
> +		nmi_cpu_stop(NULL);
>  	return 0;
>  }
>  
>  static int nmi_resume(struct sys_device *dev)
>  {
>  	if (nmi_enabled == 1)
> -		nmi_start();
> +		nmi_cpu_start(NULL);
>  	return 0;
>  }
>  
> @@ -66,6 +92,7 @@ static int __init init_sysfs(void)
>  	error = sysdev_class_register(&oprofile_sysclass);
>  	if (!error)
>  		error = sysdev_register(&device_oprofile);
> +	
>  	return error;
>  }
>  
> @@ -483,6 +510,9 @@ int __init op_nmi_init(struct oprofile_operations *ops)
>  	}
>  
>  	init_sysfs();
> +#ifdef CONFIG_SMP
> +	register_cpu_notifier(&oprofile_cpu_nb);
> +#endif
>  	using_nmi = 1;
>  	ops->create_files = nmi_create_files;
>  	ops->setup = nmi_setup;
> @@ -496,6 +526,10 @@ int __init op_nmi_init(struct oprofile_operations *ops)
>  
>  void op_nmi_exit(void)
>  {
> -	if (using_nmi)
> +	if (using_nmi) { 
>  		exit_sysfs();
> +#ifdef CONFIG_SMP
> +		unregister_cpu_notifier(&oprofile_cpu_nb);
> +#endif
> +	}
>  }
> -- 
> 1.5.6
> 


-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-09-01 16:34   ` Robert Richter
@ 2008-09-05 17:58     ` Ingo Molnar
  2008-09-05 18:59       ` Robert Richter
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-09-05 17:58 UTC (permalink / raw)
  To: Robert Richter
  Cc: Vegard Nossum, Pavel Machek, Rafael J. Wysocki, Philippe Elie,
	Linux Kernel Mailing List, Andi Kleen


hi Robert,

* Robert Richter <robert.richter@amd.com> wrote:

> Ingo,
> 
> this fix (commit id 1404e403) is not yet upstream for v2.6.27. Could
> this be added to the next pull request?

it's already upstream, as commit 80a8c9fffa. I've zapped the duplicate 
1404e403 from x86/oprofile.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-09-05 17:58     ` Ingo Molnar
@ 2008-09-05 18:59       ` Robert Richter
  2008-09-05 20:31         ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Robert Richter @ 2008-09-05 18:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, Pavel Machek, Rafael J. Wysocki, Philippe Elie,
	Linux Kernel Mailing List, Andi Kleen

On 05.09.08 19:58:29, Ingo Molnar wrote:
> > this fix (commit id 1404e403) is not yet upstream for v2.6.27. Could
> > this be added to the next pull request?
> 
> it's already upstream, as commit 80a8c9fffa. I've zapped the duplicate 
> 1404e403 from x86/oprofile.

Ah, I stumbled other this because it is still listed in the shortlog:

 $ git shortlog linux-2.6/master..tip/master | grep oprofile
       x86: fix oprofile + hibernation badness
 ...

But anyway, I found the other commit as well, thanks.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: oprofile + hibernation = badness
  2008-09-05 18:59       ` Robert Richter
@ 2008-09-05 20:31         ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2008-09-05 20:31 UTC (permalink / raw)
  To: Robert Richter
  Cc: Vegard Nossum, Pavel Machek, Rafael J. Wysocki, Philippe Elie,
	Linux Kernel Mailing List, Andi Kleen


* Robert Richter <robert.richter@amd.com> wrote:

> On 05.09.08 19:58:29, Ingo Molnar wrote:
> > > this fix (commit id 1404e403) is not yet upstream for v2.6.27. Could
> > > this be added to the next pull request?
> > 
> > it's already upstream, as commit 80a8c9fffa. I've zapped the duplicate 
> > 1404e403 from x86/oprofile.
> 
> Ah, I stumbled other this because it is still listed in the shortlog:
> 
>  $ git shortlog linux-2.6/master..tip/master | grep oprofile
>        x86: fix oprofile + hibernation badness
>  ...

yeah - it's still in the integration branch - but it doesnt hurt as it's 
same-content patch (it got cherry-picked over into an urgent branch). 
The duplication should go away on the next tip/master integration run 
(tomorrow-ish) - please let me know if it doesnt.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-09-05 20:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-18 20:32 oprofile + hibernation = badness Vegard Nossum
2008-08-18 20:51 ` Rafael J. Wysocki
2008-08-18 21:08   ` Vegard Nossum
2008-08-18 21:15     ` Rafael J. Wysocki
2008-08-18 21:29       ` Andrew Morton
2008-08-19  1:13 ` Andi Kleen
2008-08-19  7:12   ` Vegard Nossum
2008-08-19  9:49     ` Ingo Molnar
2008-08-19 12:12       ` Andi Kleen
2008-08-19 12:37         ` Robert Richter
2008-08-19 12:56         ` Johannes Weiner
2008-08-19 13:18           ` Andi Kleen
2008-08-19 13:18         ` Ingo Molnar
2008-09-01 16:34   ` Robert Richter
2008-09-05 17:58     ` Ingo Molnar
2008-09-05 18:59       ` Robert Richter
2008-09-05 20:31         ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).