linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* boot panic regression introduced in 3.5-rc7
       [not found] <1107100725.1273654.1343619920692.JavaMail.root@redhat.com>
@ 2012-07-30  3:51 ` CAI Qian
  2012-07-30 17:53   ` John Stultz
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: CAI Qian @ 2012-07-30  3:51 UTC (permalink / raw)
  To: linux-kernel 
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, Zhouping Liu

The bisecting pointed out this patch caused one of dell servers boot panic.

  5baefd6d84163443215f4a99f6a20f054ef11236
  hrtimer: Update hrtimer base offsets each hrtimer_interrupt

[    2.971092] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x10a/0x120()
[    2.971092] Hardware name: PowerEdge M605
[    2.971092] Modules linked in:
[    2.971092] Pid: 1, comm: swapper/0 Not tainted 3.5.0-rc6+ #8
[    2.971092] Call Trace:
[    2.971092]  <IRQ>  [<ffffffff81065a6f>] warn_slowpath_common+0x7f/0xc0
[    2.971092]  [<ffffffff81065aca>] warn_slowpath_null+0x1a/0x20
[    2.971092]  [<ffffffff810cb03a>] clockevents_program_event+0x10a/0x120
[    2.971092]  [<ffffffff810cc554>] tick_program_event+0x24/0x30
[    2.971092]  [<ffffffff8109671d>] hrtimer_interrupt+0x1ad/0x230
[    2.971092]  [<ffffffff816884e9>] smp_apic_timer_interrupt+0x69/0x99
[    2.971092]  [<ffffffff8168716f>] apic_timer_interrupt+0x6f/0x80
[    2.971092]  <EOI>  [<ffffffff8167dc30>] ? retint_restore_args+0x13/0x13
[    2.971092]  [<ffffffff81348563>] ? inflate_fast+0x2c3/0x5f0
[    2.971092]  [<ffffffff81349cec>] zlib_inflate+0x123c/0x1970
[    2.971092]  [<ffffffff811c1f0b>] ? vfs_write+0x10b/0x190
[    2.971092]  [<ffffffff81c86144>] ? bunzip2+0x472/0x472
[    2.971092]  [<ffffffff811c21ed>] ? sys_write+0x4d/0x90
[    2.971092]  [<ffffffff81c5aebb>] ? do_copy+0x87/0x8e
[    2.971092]  [<ffffffff81c5ac5c>] ? write_buffer+0x22/0x32
[    2.971092]  [<ffffffff81c5ac6c>] ? write_buffer+0x32/0x32
[    2.971092]  [<ffffffff81c86144>] ? bunzip2+0x472/0x472
[    2.971092]  [<ffffffff81c863c8>] gunzip+0x27b/0x349
[    2.971092]  [<ffffffff81c5b18e>] unpack_to_rootfs+0x159/0x267
[    2.971092]  [<ffffffff81c5abc0>] ? md_run_setup+0x9c/0x9c
[    2.971092]  [<ffffffff81c5b3f2>] ? maybe_link.part.2+0x111/0x111
[    2.971092]  [<ffffffff81c5b44a>] populate_rootfs+0x58/0x109
[    2.971092]  [<ffffffff8100212a>] do_one_initcall+0x12a/0x180
[    2.971092]  [<ffffffff81c59dd1>] kernel_init+0x15d/0x1e1
[    2.971092]  [<ffffffff81c59614>] ? do_early_param+0x8c/0x8c
[    2.971092]  [<ffffffff81687a74>] kernel_thread_helper+0x4/0x10
[    2.971092]  [<ffffffff8167dc30>] ? retint_restore_args+0x13/0x13
[    2.971092]  [<ffffffff81c59c74>] ? start_kernel+0x401/0x401
[    2.971092]  [<ffffffff81687a70>] ? gs_change+0x13/0x13
[    2.971092] ---[ end trace 319c95c486d7d9cd ]---
[    2.971092] Freeing initrd memory: 23412k freed
[    2.971092] DMA-API: preallocated 32768 debug entries
[    2.971092] DMA-API: debugging enabled by kernel config
[    6.034660] work_for_cpu (34) used greatest stack depth: 6520 bytes left
[    6.042463] work_for_cpu (35) used greatest stack depth: 6248 bytes left
[    6.098917] PCI-DMA: Disabling AGP.
[    6.098917] PCI-DMA: aperture base @ f4000000 size 65536 KB
[    6.098917] init_memory_mapping: [mem 0xf4000000-0xf7ffffff]
[    6.098917] PCI-DMA: using GART IOMMU.
[    6.098917] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
[    6.593402] cryptomgr_test (59) used greatest stack depth: 6200 bytes left
[    6.601692] audit: initializing netlink socket (disabled)
[    6.601692] type=2000 audit(196996417573.616:1): initialized
[    7.176075] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[    7.559238] VFS: Disk quotas dquot_6.5.2
[    7.559238] Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    7.663975] msgmni has been set to 7965
[    7.702014] cryptomgr_test (68) used greatest stack depth: 5576 bytes left
[    7.710867] cryptomgr_test (70) used greatest stack depth: 5416 bytes left
[    7.720367] alg: No test for stdrng (krng)
[    7.724763] NET: Registered protocol family 38
[    7.725463] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252)
[    7.743107] io scheduler noop registered
[    7.743559] io scheduler deadline registered (default)
[    7.743559] io scheduler cfq registered
[    7.743559] start plist test
[    7.743559] end plist test
[    7.896741] work_for_cpu (77) used greatest stack depth: 5288 bytes left
[    7.917588] work_for_cpu (79) used greatest stack depth: 5112 bytes left
[    7.943340] pcieport 0000:00:0a.0: Signaling PME through PCIe PME interrupt
[    7.943340] pcieport 0000:00:0b.0: Signaling PME through PCIe PME interrupt
[    7.943340] pci 0000:02:00.0: Signaling PME through PCIe PME interrupt
[    7.943340] pcieport 0000:00:0c.0: Signaling PME through PCIe PME interrupt
[    7.943340] pci 0000:03:00.0: Signaling PME through PCIe PME interrupt
[    7.943340] pci 0000:04:00.0: Signaling PME through PCIe PME interrupt
[    7.943340] pcieport 0000:00:0d.0: Signaling PME through PCIe PME interrupt
[    7.943340] pci 0000:05:00.0: Signaling PME through PCIe PME interrupt
[    7.943340] pci 0000:06:00.0: Signaling PME through PCIe PME interrupt
[    7.943340] pcieport 0000:00:0f.0: Signaling PME through PCIe PME interrupt
[    7.943340] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    7.943340] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
[    7.943340] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    7.943340] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[    7.943340] ACPI: Power Button [PWRF]
[    9.001224] ERST: Failed to get Error Log Address Range.
[   19.807939] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[   19.808927] Pid: 0, comm: swapper/0 Tainted: G        W    3.5.0-rc6+ #8
[   19.808927] Call Trace:
[   19.808927]  <NMI>  [<ffffffff8166dd25>] panic+0xba/0x1ce
[   19.808927]  [<ffffffff81110a30>] ? touch_nmi_watchdog+0x90/0x90
[   19.808927]  [<ffffffff81110ae5>] watchdog_overflow_callback+0xb5/0xc0
[   19.808927]  [<ffffffff8114ed9d>] __perf_event_overflow+0x9d/0x320
[   19.808927]  [<ffffffff8114aecd>] ? perf_event_update_userpage+0x16d/0x2c0
[   19.808927]  [<ffffffff8114ad60>] ? perf_event_task_disable+0x90/0x90
[   19.808927]  [<ffffffff8114fff4>] perf_event_overflow+0x14/0x20
[   19.808927]  [<ffffffff8102b833>] x86_pmu_handle_irq+0xe3/0x130
[   19.808927]  [<ffffffff8167f81d>] perf_event_nmi_handler+0x1d/0x20
[   19.808927]  [<ffffffff8167ed4e>] nmi_handle.isra.2+0xbe/0x210
[   19.808927]  [<ffffffff8167ec90>] ? __die+0xf0/0xf0
[   19.808927]  [<ffffffff8167efa8>] do_nmi+0x108/0x380
[   19.808927]  [<ffffffff8167e2cc>] end_repeat_nmi+0x1a/0x1e
[   19.808927]  [<ffffffff81096743>] ? hrtimer_interrupt+0x1d3/0x230
[   19.808927]  [<ffffffff81096743>] ? hrtimer_interrupt+0x1d3/0x230
[   19.808927]  [<ffffffff81096743>] ? hrtimer_interrupt+0x1d3/0x230
[   19.808927]  <<EOE>>  <IRQ>  [<ffffffff810cceae>] ? tick_nohz_stop_idle+0x3e/0x50
[   19.808927]  [<ffffffff816884e9>] smp_apic_timer_interrupt+0x69/0x99
[   19.808927]  [<ffffffff8168716f>] apic_timer_interrupt+0x6f/0x80
[   19.808927]  <EOI>  [<ffffffff8167dc30>] ? retint_restore_args+0x13/0x13
[   19.808927]  [<ffffffff810cd0cb>] ? tick_nohz_idle_enter+0x4b/0x80
[   19.808927]  [<ffffffff810cd0c7>] ? tick_nohz_idle_enter+0x47/0x80
[   19.808927]  [<ffffffff81023462>] cpu_idle+0x72/0x140
[   19.808927]  [<ffffffff816515e5>] rest_init+0x149/0x154
[   19.808927]  [<ffffffff8165149c>] ? csum_partial_copy_generic+0x16c/0x16c
[   19.808927]  [<ffffffff81c59c67>] start_kernel+0x3f4/0x401
[   19.808927]  [<ffffffff81c59672>] ? repair_env_string+0x5e/0x5e
[   19.808927]  [<ffffffff81c59356>] x86_64_start_reservations+0x131/0x135
[   19.808927]  [<ffffffff81c5945a>] x86_64_start_kernel+0x100/0x10f

Kernel config and full dmesg are here.
http://people.redhat.com/qcai/config
http://people.redhat.com/qcai/dmesg

CAI Qian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: boot panic regression introduced in 3.5-rc7
  2012-07-30  3:51 ` boot panic regression introduced in 3.5-rc7 CAI Qian
@ 2012-07-30 17:53   ` John Stultz
  2012-07-31  1:59     ` CAI Qian
  2012-07-30 20:59   ` John Stultz
  2012-07-31  5:48   ` John Stultz
  2 siblings, 1 reply; 6+ messages in thread
From: John Stultz @ 2012-07-30 17:53 UTC (permalink / raw)
  To: CAI Qian
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, Zhouping Liu

On 07/29/2012 08:51 PM, CAI Qian wrote:
> The bisecting pointed out this patch caused one of dell servers boot panic.
>
>    5baefd6d84163443215f4a99f6a20f054ef11236
>    hrtimer: Update hrtimer base offsets each hrtimer_interrupt
>
> [    2.971092] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x10a/0x120()

Thanks for the report! Just to confirm, you're seeing this with the 
final 3.5?

I'll let you know as soon as I've got something to test.

thanks
-john


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: boot panic regression introduced in 3.5-rc7
  2012-07-30  3:51 ` boot panic regression introduced in 3.5-rc7 CAI Qian
  2012-07-30 17:53   ` John Stultz
@ 2012-07-30 20:59   ` John Stultz
  2012-07-31  5:48   ` John Stultz
  2 siblings, 0 replies; 6+ messages in thread
From: John Stultz @ 2012-07-30 20:59 UTC (permalink / raw)
  To: CAI Qian
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, Zhouping Liu

On 07/29/2012 08:51 PM, CAI Qian wrote:
> The bisecting pointed out this patch caused one of dell servers boot panic.
>
>    5baefd6d84163443215f4a99f6a20f054ef11236
>    hrtimer: Update hrtimer base offsets each hrtimer_interrupt
>
> [    2.971092] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x10a/0x120()
> [    2.971092] Hardware name: PowerEdge M605
> [    2.971092] Modules linked in:
Looking at the dmesg:

[    0.000000] Extended CMOS year: 8200


I'm working with Prarit to try to debug the issue on the affected 
machine. He noticed part of the problem is that the offs_real was set 
to  0x7FFFFFFFFFFFFFFF, which is the same as KTIME_MAX.

I suspect from the dmesg above we're getting bad data from the CMOS 
clock, and that's then causing an overflow converting to a ktime_t 
(64bits of nanoseconds can only hold ~584 years).

I've still not quite narrowed down why this hasn't bit you earlier, 
since the same wall_to_monotonic -> ktime conversion was done in 
retrigger_next_event before the change.  Maybe something called 
settimeofday(), fixing crazy time value before you switched to highres mode?

Once I sort out this last question,  I'll try to see where we can add 
some sanity checking for this sort of thing.

thanks
-john


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: boot panic regression introduced in 3.5-rc7
  2012-07-30 17:53   ` John Stultz
@ 2012-07-31  1:59     ` CAI Qian
  0 siblings, 0 replies; 6+ messages in thread
From: CAI Qian @ 2012-07-31  1:59 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, Zhouping Liu



----- Original Message -----
> On 07/29/2012 08:51 PM, CAI Qian wrote:
> > The bisecting pointed out this patch caused one of dell servers
> > boot panic.
> >
> >    5baefd6d84163443215f4a99f6a20f054ef11236
> >    hrtimer: Update hrtimer base offsets each hrtimer_interrupt
> >
> > [    2.971092] WARNING: at kernel/time/clockevents.c:209
> > clockevents_program_event+0x10a/0x120()
> 
> Thanks for the report! Just to confirm, you're seeing this with the
> final 3.5?
Yes, and the latest linus and tip trees too.
> 
> I'll let you know as soon as I've got something to test.
> 
> thanks
> -john
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: boot panic regression introduced in 3.5-rc7
  2012-07-30  3:51 ` boot panic regression introduced in 3.5-rc7 CAI Qian
  2012-07-30 17:53   ` John Stultz
  2012-07-30 20:59   ` John Stultz
@ 2012-07-31  5:48   ` John Stultz
  2012-08-01  6:50     ` Thomas Gleixner
  2 siblings, 1 reply; 6+ messages in thread
From: John Stultz @ 2012-07-31  5:48 UTC (permalink / raw)
  To: CAI Qian
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, Zhouping Liu

On 07/29/2012 08:51 PM, CAI Qian wrote:
> The bisecting pointed out this patch caused one of dell servers boot panic.
>
>    5baefd6d84163443215f4a99f6a20f054ef11236
>    hrtimer: Update hrtimer base offsets each hrtimer_interrupt
>
> [    2.971092] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x10a/0x120()
> [    2.971092] Hardware name: PowerEdge M605

Ok. So I think I've chased this all the way down.

The main issue as noted earlier, is that on this system, the RTC/CMOS is 
returning a year of 8200 as seen in the dmesg:

[    0.000000] Extended CMOS year: 8200

This causes problems because,  the (signed) 64bit ktime_t structure can 
only store ~292 years of nanoseconds.  Thus, when initialize the time 
from the persistent clock, and set the time to the year 8200, this 
results in the timekeeper.offs_real being capped at KTIME_MAX ((1LL<<63)-1).

So congrats! While most folks haven't started looking at the 2038 issue 
on 32bit systems, you've already started pushing the internal limits on 
64bit systems :)

Now, while this is obviously problematic, this point confused me for a 
bit:  Prior to the commit bisected in the original mail above, we stored 
the same bad KTIME_MAX data in the 
cpu_base->clock_base[HRTIMER_BASE_REALTIME].offset value.  We just 
didn't read the value from the timekeeping core at each interrupt, and 
the value isn't actually changing when the warning and panic is being 
triggered.

So it was unclear as to why if we're providing the same bad KTIME_MAX 
value to hrtimer_interrupt, why are we seeing problems now and not before?

After hacking the kernel and forcing the persistent clock to return a 
similar bad CMOS value of the year 8200, I could reproduce this and 
finally track it down.

Ends up there's a slight difference in  ktime_get_update_offsets() vs 
ktime_get():

ktime_get() does basically the following:
         return timespec_to_ktime(timespec_add(xtime, wall_to_monotonic))

Where as ktime_get_update_offsets does approximately:
         return ktime_sub(timespec_to_ktime(xtime), realtime_offset);

The problem is, at boot we set xtime = year 8200 and wall_to_monotonic = 
year -8200,  ktime_get adds both values, mostly nulling the difference 
out (leaving only how long the system has been up), then converts that 
relatively small value to a ktime_t properly without losing any information.

ktime_get_update_offsets however, since it converts xtime (again set to 
some value greater then year 8200), to a ktime, it gets clamped at 
KTIME_MAX, then we subtract realtime_offset, which is _also_ clamped at 
KTIME_MAX, resulting in us always returning almost[1] zero.  This causes 
us to stop expiring timers.

Now, one of the reasons Thomas and I changed the logic was that using 
the precalculated realtime_offset was slightly more efficient then 
re-adding xtime and wall_to_monotonic's components separately. But how 
valuable this unmeasured slight efficiency is vs extra robustness for 
crazy time values is questionable.

Additionally I suspect that your system probably corrects itself in 
early boot via ntpdate, as I'm pretty sure you'd have other strange 
timer behavior trying to run the system with a date larger then KTIME_MAX.

So I suspect we need two fixes here:
1) Fall back to using the full-precision ktime_get() method of 
calculating the current monotonic time in ktime_get_update_offsets to 
avoid what is in effect precision loss with very large timespecs.
2) Validate that time values we accept are smaller the ktime_t before 
using them.

Thomas, does this sound reasonable? Patches to follow shortly.

thanks
-john


[1] So the reality is slightly more complicated, since 
ktime_get_update_offsets actually returns:
         return ktime_sub(ktime_add(ktime_set(xtime.tv_sec,0),nsecs), 
realtime_offset);
Which basically means we return some value that increases to ~4seconds 
and then nsec overflows and we loop back to zero.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: boot panic regression introduced in 3.5-rc7
  2012-07-31  5:48   ` John Stultz
@ 2012-08-01  6:50     ` Thomas Gleixner
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2012-08-01  6:50 UTC (permalink / raw)
  To: John Stultz
  Cc: CAI Qian, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Prarit Bhargava, Zhouping Liu

On Mon, 30 Jul 2012, John Stultz wrote:
> On 07/29/2012 08:51 PM, CAI Qian wrote:
> Now, one of the reasons Thomas and I changed the logic was that using the
> precalculated realtime_offset was slightly more efficient then re-adding xtime
> and wall_to_monotonic's components separately. But how valuable this
> unmeasured slight efficiency is vs extra robustness for crazy time values is
> questionable.

Well, I guess it unearthed a weakness which has been there forever:

  Trusting random values which are supplied by cmos or whatever.

So the right fix is:
 
> 2) Validate that time values we accept are smaller the ktime_t before using
> them.

I really don't like the magic workaround by nulling out the effect of
crap data with a more expensive calculation.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-08-01  6:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1107100725.1273654.1343619920692.JavaMail.root@redhat.com>
2012-07-30  3:51 ` boot panic regression introduced in 3.5-rc7 CAI Qian
2012-07-30 17:53   ` John Stultz
2012-07-31  1:59     ` CAI Qian
2012-07-30 20:59   ` John Stultz
2012-07-31  5:48   ` John Stultz
2012-08-01  6:50     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).