All of lore.kernel.org
 help / color / mirror / Atom feed
* State of current Xen debugger
@ 2010-09-14 14:22 Roger Cruz
  2010-09-14 14:30 ` Tim Deegan
  0 siblings, 1 reply; 13+ messages in thread
From: Roger Cruz @ 2010-09-14 14:22 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 625 bytes --]


I am trying to debug a problem where the hypervisor is hanging hard.  Not even the NMI watchdog is triggering a reboot.  So I wanted to hook up a debugger.  What is the state of the current debuggers out there?  Any input on how I should set it up (kdb, gdb, etc) and pointers to a good wiki page are much appreciated.  I did perform a Google search and found some links but I want to hear from the current developers as to what is most stable and useful for debugging this type of hard hang.  I only have a serial port PCI-express card to use as the laptop has no built in port.

Thank you in advance.

Roger R. Cruz

[-- Attachment #1.2: Type: text/html, Size: 1045 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: State of current Xen debugger
  2010-09-14 14:22 State of current Xen debugger Roger Cruz
@ 2010-09-14 14:30 ` Tim Deegan
  2010-09-14 14:56   ` Roger Cruz
  0 siblings, 1 reply; 13+ messages in thread
From: Tim Deegan @ 2010-09-14 14:30 UTC (permalink / raw)
  To: Roger Cruz; +Cc: xen-devel

Hi, 

At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> I am trying to debug a problem where the hypervisor is hanging hard.
> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> up a debugger. 

Sorry to bring a counsel of despair but if the NMI watchdog isn't
working then your chances of getting a working debugger are slim.  It's
likely that at least one CPU is very very stuck.  Does the 'd' debug key
work on the serial line when the machine is wedged?

On a more cheerful note, I've twice seen hard hangs like this that
turned out to be hardware issues, fixable with BIOS upgrades. 

Cheers,

Tim.

> What is the state of the current debuggers out there?
> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> good wiki page are much appreciated.  I did perform a Google search
> and found some links but I want to hear from the current developers as
> to what is most stable and useful for debugging this type of hard
> hang.  I only have a serial port PCI-express card to use as the laptop
> has no built in port.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: State of current Xen debugger
  2010-09-14 14:30 ` Tim Deegan
@ 2010-09-14 14:56   ` Roger Cruz
  2010-09-14 15:19     ` Dan Magenheimer
  2010-09-14 15:20     ` Tim Deegan
  0 siblings, 2 replies; 13+ messages in thread
From: Roger Cruz @ 2010-09-14 14:56 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2389 bytes --]

Hi Tim,  good to hear from you again

I had a pretty good inkling that one of you hardcore developers would say that :-)  Yes, it is pretty well wedged.  I can cause the problem more rapidly by dropping to a single CPU.  When the hang happens, the Xen console is completely dead.  None of the special keys work.  

I do have hopes a BIOS upgrade could fix this as a last resort but I want to see if at least I can understand the problem.  We have a few different machines that are exhibiting similar symptoms so I have to see if I can find a work-around without requiring every user to upgrade their BIOS :-( 

Just in case, what debugger have you been using?  Are there recent instructions on how to set it up that you can point me to?

Thanks
Roger


-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
Sent: Tue 9/14/2010 10:30 AM
To: Roger Cruz
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] State of current Xen debugger
 
Hi, 

At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> I am trying to debug a problem where the hypervisor is hanging hard.
> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> up a debugger. 

Sorry to bring a counsel of despair but if the NMI watchdog isn't
working then your chances of getting a working debugger are slim.  It's
likely that at least one CPU is very very stuck.  Does the 'd' debug key
work on the serial line when the machine is wedged?

On a more cheerful note, I've twice seen hard hangs like this that
turned out to be hardware issues, fixable with BIOS upgrades. 

Cheers,

Tim.

> What is the state of the current debuggers out there?
> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> good wiki page are much appreciated.  I did perform a Google search
> and found some links but I want to hear from the current developers as
> to what is most stable and useful for debugging this type of hard
> hang.  I only have a serial port PCI-express card to use as the laptop
> has no built in port.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00


[-- Attachment #1.2: Type: text/html, Size: 3138 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: State of current Xen debugger
  2010-09-14 14:56   ` Roger Cruz
@ 2010-09-14 15:19     ` Dan Magenheimer
  2010-09-14 15:54       ` Roger Cruz
  2010-09-14 15:20     ` Tim Deegan
  1 sibling, 1 reply; 13+ messages in thread
From: Dan Magenheimer @ 2010-09-14 15:19 UTC (permalink / raw)
  To: Roger Cruz, Tim Deegan; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2935 bytes --]

A couple of thoughts:

 

Have you tried max_cstate=0 (as a Xen boot option)?

 

Also, you didn't say what version of Xen you are using but playing around with hpet_broadcast (enabling it or force-disabling it as below) might be worth a try.

 

http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html

 

From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com] 
Sent: Tuesday, September 14, 2010 8:56 AM
To: Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger

 

Hi Tim,  good to hear from you again

I had a pretty good inkling that one of you hardcore developers would say that :-)  Yes, it is pretty well wedged.  I can cause the problem more rapidly by dropping to a single CPU.  When the hang happens, the Xen console is completely dead.  None of the special keys work. 

I do have hopes a BIOS upgrade could fix this as a last resort but I want to see if at least I can understand the problem.  We have a few different machines that are exhibiting similar symptoms so I have to see if I can find a work-around without requiring every user to upgrade their BIOS :-(

Just in case, what debugger have you been using?  Are there recent instructions on how to set it up that you can point me to?

Thanks
Roger


-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
Sent: Tue 9/14/2010 10:30 AM
To: Roger Cruz
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] State of current Xen debugger

Hi,

At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> I am trying to debug a problem where the hypervisor is hanging hard.
> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> up a debugger.

Sorry to bring a counsel of despair but if the NMI watchdog isn't
working then your chances of getting a working debugger are slim.  It's
likely that at least one CPU is very very stuck.  Does the 'd' debug key
work on the serial line when the machine is wedged?

On a more cheerful note, I've twice seen hard hangs like this that
turned out to be hardware issues, fixable with BIOS upgrades.

Cheers,

Tim.

> What is the state of the current debuggers out there?
> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> good wiki page are much appreciated.  I did perform a Google search
> and found some links but I want to hear from the current developers as
> to what is most stable and useful for debugging this type of hard
> hang.  I only have a serial port PCI-express card to use as the laptop
> has no built in port.

--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00

[-- Attachment #1.2: Type: text/html, Size: 6897 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: State of current Xen debugger
  2010-09-14 14:56   ` Roger Cruz
  2010-09-14 15:19     ` Dan Magenheimer
@ 2010-09-14 15:20     ` Tim Deegan
  2010-09-14 16:08       ` Roger Cruz
  1 sibling, 1 reply; 13+ messages in thread
From: Tim Deegan @ 2010-09-14 15:20 UTC (permalink / raw)
  To: Roger Cruz; +Cc: xen-devel

At 15:56 +0100 on 14 Sep (1284479787), Roger Cruz wrote:
> I had a pretty good inkling that one of you hardcore developers would
> say that :-) Yes, it is pretty well wedged.  I can cause the problem
> more rapidly by dropping to a single CPU.  When the hang happens, the
> Xen console is completely dead.  None of the special keys work.

If the 'd' key doesn't work then the serial irq isn't getting handled,
so the CPU is wedged at a higher TPR (at least).  Usually in that case
the CPU is spinning so the NMI watchdog timer kicks in OK; possibly if
it was idle with a high TPR it wouldn't.

What version of Xen are you using?  

It might be worth trying a boot with MSI disabled (there were reports at
one stage of MSIs not being EOI'd because the timer interupt that would
remind Xen to EOI them was at a lower priority than the MSI).

> I do have hopes a BIOS upgrade could fix this as a last resort but I
> want to see if at least I can understand the problem.  We have a few
> different machines that are exhibiting similar symptoms so I have to
> see if I can find a work-around without requiring every user to
> upgrade their BIOS :-(
> 
> Just in case, what debugger have you been using?  Are there recent
> instructions on how to set it up that you can point me to?

I don't use a debugger on Xen.  I usually find that by the time the
debugger kicks in it's too late to help, so I end up finding bugs by
code inspection and printks. :)

Mukesh Rathor at Oracle has done some debugger work, though, including
an in-Xen debugger.  There's a gdb stub too but I suspect it's rotted
quite badly.

Cheers,

Tim.

> Thanks
> Roger
> 
> 
> -----Original Message-----
> From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
> Sent: Tue 9/14/2010 10:30 AM
> To: Roger Cruz
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] State of current Xen debugger
> 
> Hi,
> 
> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> > I am trying to debug a problem where the hypervisor is hanging hard.
> > Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> > up a debugger.
> 
> Sorry to bring a counsel of despair but if the NMI watchdog isn't
> working then your chances of getting a working debugger are slim.  It's
> likely that at least one CPU is very very stuck.  Does the 'd' debug key
> work on the serial line when the machine is wedged?
> 
> On a more cheerful note, I've twice seen hard hangs like this that
> turned out to be hardware issues, fixable with BIOS upgrades.
> 
> Cheers,
> 
> Tim.
> 
> > What is the state of the current debuggers out there?
> > Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> > good wiki page are much appreciated.  I did perform a Google search
> > and found some links but I want to hear from the current developers as
> > to what is most stable and useful for debugging this type of hard
> > hang.  I only have a serial port PCI-express card to use as the laptop
> > has no built in port.
> 
> --
> Tim Deegan <Tim.Deegan@citrix.com>
> Principal Software Engineer, XenServer Engineering
> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00
> 

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: State of current Xen debugger
  2010-09-14 15:19     ` Dan Magenheimer
@ 2010-09-14 15:54       ` Roger Cruz
  2010-09-15 14:19         ` Roger Cruz
  2010-09-28 15:21         ` Roger Cruz
  0 siblings, 2 replies; 13+ messages in thread
From: Roger Cruz @ 2010-09-14 15:54 UTC (permalink / raw)
  To: Dan Magenheimer, Tim Deegan; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3520 bytes --]

Hi Dan,

I am using 3.4.2 where we have made very minor modifications (some backports, for example).

I have not tried your suggestions.. so I will do that next.. thanks!

R.

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Tue 9/14/2010 11:19 AM
To: Roger Cruz; Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger
 
A couple of thoughts:

 

Have you tried max_cstate=0 (as a Xen boot option)?

 

Also, you didn't say what version of Xen you are using but playing around with hpet_broadcast (enabling it or force-disabling it as below) might be worth a try.

 

http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html

 

From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com] 
Sent: Tuesday, September 14, 2010 8:56 AM
To: Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger

 

Hi Tim,  good to hear from you again

I had a pretty good inkling that one of you hardcore developers would say that :-)  Yes, it is pretty well wedged.  I can cause the problem more rapidly by dropping to a single CPU.  When the hang happens, the Xen console is completely dead.  None of the special keys work. 

I do have hopes a BIOS upgrade could fix this as a last resort but I want to see if at least I can understand the problem.  We have a few different machines that are exhibiting similar symptoms so I have to see if I can find a work-around without requiring every user to upgrade their BIOS :-(

Just in case, what debugger have you been using?  Are there recent instructions on how to set it up that you can point me to?

Thanks
Roger


-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
Sent: Tue 9/14/2010 10:30 AM
To: Roger Cruz
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] State of current Xen debugger

Hi,

At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> I am trying to debug a problem where the hypervisor is hanging hard.
> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> up a debugger.

Sorry to bring a counsel of despair but if the NMI watchdog isn't
working then your chances of getting a working debugger are slim.  It's
likely that at least one CPU is very very stuck.  Does the 'd' debug key
work on the serial line when the machine is wedged?

On a more cheerful note, I've twice seen hard hangs like this that
turned out to be hardware issues, fixable with BIOS upgrades.

Cheers,

Tim.

> What is the state of the current debuggers out there?
> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> good wiki page are much appreciated.  I did perform a Google search
> and found some links but I want to hear from the current developers as
> to what is most stable and useful for debugging this type of hard
> hang.  I only have a serial port PCI-express card to use as the laptop
> has no built in port.

--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00



[-- Attachment #1.2: Type: text/html, Size: 4626 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: State of current Xen debugger
  2010-09-14 15:20     ` Tim Deegan
@ 2010-09-14 16:08       ` Roger Cruz
  0 siblings, 0 replies; 13+ messages in thread
From: Roger Cruz @ 2010-09-14 16:08 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 8938 bytes --]

I am using 3.4.2 with some modifications

I added printks to the nmi_watchdog_tick as shown below.  I don't break the console lock.. but I am convinced that the printk lock is not the problem because I have also tested by having a void printk routine and it still hangs, so it felt pretty safe not breaking the lock.  I also tried the console_start/end_sync to make sure I was seeing all the messages when it hung.


void nmi_watchdog_tick(struct cpu_user_regs * regs)
{
    unsigned int sum = this_cpu(nmi_timer_ticks);

    if ( (this_cpu(last_irq_sums) == sum) &&
         !atomic_read(&watchdog_disable_count) )
    {
      if (sum > 20) {
	//	console_start_sync();
	printk("**** CPU%d, counter=%d, last_sum=%d, curr_sum=%d, hz=%d, nmis=%d\n",
	       smp_processor_id(), this_cpu(alert_counter), this_cpu(last_irq_sums), sum, 5*nmi_hz,  nmi_count(smp_processor_id()) );
	//	console_end_sync();
      }
      
        /*
         * Ayiee, looks like this CPU is stuck ... wait a few IRQs (5 seconds) 
         * before doing the oops ...
         */
        this_cpu(alert_counter)++;
        if ( this_cpu(alert_counter) == 5*nmi_hz )
        {
            console_force_unlock();
            printk("Watchdog timer detects that CPU%d is stuck!\n",
                   smp_processor_id());
            fatal_trap(TRAP_nmi, regs);
        }
    } 
    else 
    {
      
      if (sum > 20) {
	//	console_start_sync();
	printk("*CPU%d, counter=%d, last_sum=%d, curr_sum=%d, nmis=%d\n",
	       smp_processor_id(), this_cpu(alert_counter), this_cpu(last_irq_sums), sum, nmi_count(smp_processor_id()) );
	//console_end_sync();
      }
      
        this_cpu(last_irq_sums) = sum;
        this_cpu(alert_counter) = 0;
    }


My messages stop printing and I get a hard hang.  the Performance Ctr NMI appears to come once every 4 seconds.  However, I have observed instances where they are about 10 seconds apart.  Not sure what is making the NMIs come in at uneven intervals.  As a test, I turned on SpeedStep and power management functions in the BIOS and it still hangs.

XEN) *CPU0, counter=0, last_sum=974, curr_sum=977, nmis=391
(XEN) *CPU0, counter=0, last_sum=977, curr_sum=979, nmis=392
(XEN) *CPU0, counter=0, last_sum=979, curr_sum=981, nmis=393
(XEN) *CPU0, counter=0, last_sum=981, curr_sum=984, nmis=394
(XEN) *CPU0, counter=0, last_sum=984, curr_sum=986, nmis=395
(XEN) *CPU0, counter=0, last_sum=986, curr_sum=988, nmis=396
(XEN) *CPU0, counter=0, last_sum=988, curr_sum=991, nmis=397
(XEN) *CPU0, counter=0, last_sum=991, curr_sum=993, nmis=398
(XEN) *CPU0, counter=0, last_sum=993, curr_sum=995, nmis=399
(XEN) *CPU0, counter=0, last_sum=995, curr_sum=997, nmis=400
(XEN) *CPU0, counter=0, last_sum=997, curr_sum=1000, nmis=401
(XEN) *CPU0, counter=0, last_sum=1000, curr_sum=1002, nmis=402
(XEN) *CPU0, counter=0, last_sum=1002, curr_sum=1005, nmis=403
(XEN) *CPU0, counter=0, last_sum=1005, curr_sum=1008, nmis=404
(XEN) *CPU0, counter=0, last_sum=1008, curr_sum=1010, nmis=405
(XEN) *CPU0, counter=0, last_sum=1010, curr_sum=1013, nmis=406
(XEN) *CPU0, counter=0, last_sum=1013, curr_sum=1015, nmis=407
(XEN) *CPU0, counter=0, last_sum=1015, curr_sum=1018, nmis=408
(XEN) *CPU0, counter=0, last_sum=1018, curr_sum=1020, nmis=409
(XEN) *CPU0, counter=0, last_sum=1020, curr_sum=1023, nmis=410
(XEN) *CPU0, counter=0, last_sum=1023, curr_sum=1026, nmis=411
(XEN) *CPU0, counter=0, last_sum=1026, curr_sum=1029, nmis=412
(XEN) *CPU0, counter=0, last_sum=1029, curr_sum=1031, nmis=413
(XEN) *CPU0, counter=0, last_sum=1031, curr_sum=1033, nmis=414
(XEN) *CPU0, counter=0, last_sum=1033, curr_sum=1035, nmis=415
(XEN) *CPU0, counter=0, last_sum=1035, curr_sum=1038, nmis=416
(XEN) *CPU0, counter=0, last_sum=1038, curr_sum=1041, nmis=417
(XEN) *CPU0, counter=0, last_sum=1041, curr_sum=1043, nmis=418
(XEN) *CPU0, counter=0, last_sum=1043, curr_sum=1046, nmis=419
(XEN) *CPU0, counter=0, last_sum=1046, curr_sum=1049, nmis=420
(XEN) *CPU0, counter=0, last_sum=1049, curr_sum=1051, nmis=421
(XEN) *CPU0, counter=0, last_sum=1051, curr_sum=1055, nmis=422
(XEN) *CPU0, counter=0, last_sum=1055, curr_sum=1058, nmis=423
(XEN) *CPU0, counter=0, last_sum=1058, curr_sum=1061, nmis=424
(XEN) *CPU0, counter=0, last_sum=1061, curr_sum=1064, nmis=425
(XEN) *CPU0, counter=0, last_sum=1064, curr_sum=1067, nmis=426
(XEN) *CPU0, counter=0, last_sum=1067, curr_sum=1070, nmis=427
(XEN) *CPU0, counter=0, last_sum=1070, curr_sum=1073, nmis=428
(XEN) *CPU0, counter=0, last_sum=1073, curr_sum=1076, nmis=429
 __  __            _____ _  _    ____ 
 \ \/ /___ _ __   |___ /| || |  |___ \
  \  // _ \ '_ \    |_ \| || |_   __) |
  /  \  __/ | | |  ___) |__   _| / __/
 /_/\_\___|_| |_| |____(_) |_|(_)_____|
                                      
(XEN) Xen version 3.4.2 (rcruz@) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) Mon Sep 13 23:06:17 UTC 2010
(XEN) Latest ChangeSet: Mon Sep 13 16:12:14 2010 -0400 132:a499dd8fcb55


-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
Sent: Tue 9/14/2010 11:20 AM
To: Roger Cruz
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] State of current Xen debugger
 
At 15:56 +0100 on 14 Sep (1284479787), Roger Cruz wrote:
> I had a pretty good inkling that one of you hardcore developers would
> say that :-) Yes, it is pretty well wedged.  I can cause the problem
> more rapidly by dropping to a single CPU.  When the hang happens, the
> Xen console is completely dead.  None of the special keys work.

If the 'd' key doesn't work then the serial irq isn't getting handled,
so the CPU is wedged at a higher TPR (at least).  Usually in that case
the CPU is spinning so the NMI watchdog timer kicks in OK; possibly if
it was idle with a high TPR it wouldn't.

What version of Xen are you using?  

It might be worth trying a boot with MSI disabled (there were reports at
one stage of MSIs not being EOI'd because the timer interupt that would
remind Xen to EOI them was at a lower priority than the MSI).

> I do have hopes a BIOS upgrade could fix this as a last resort but I
> want to see if at least I can understand the problem.  We have a few
> different machines that are exhibiting similar symptoms so I have to
> see if I can find a work-around without requiring every user to
> upgrade their BIOS :-(
> 
> Just in case, what debugger have you been using?  Are there recent
> instructions on how to set it up that you can point me to?

I don't use a debugger on Xen.  I usually find that by the time the
debugger kicks in it's too late to help, so I end up finding bugs by
code inspection and printks. :)

Mukesh Rathor at Oracle has done some debugger work, though, including
an in-Xen debugger.  There's a gdb stub too but I suspect it's rotted
quite badly.

Cheers,

Tim.

> Thanks
> Roger
> 
> 
> -----Original Message-----
> From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
> Sent: Tue 9/14/2010 10:30 AM
> To: Roger Cruz
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] State of current Xen debugger
> 
> Hi,
> 
> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> > I am trying to debug a problem where the hypervisor is hanging hard.
> > Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> > up a debugger.
> 
> Sorry to bring a counsel of despair but if the NMI watchdog isn't
> working then your chances of getting a working debugger are slim.  It's
> likely that at least one CPU is very very stuck.  Does the 'd' debug key
> work on the serial line when the machine is wedged?
> 
> On a more cheerful note, I've twice seen hard hangs like this that
> turned out to be hardware issues, fixable with BIOS upgrades.
> 
> Cheers,
> 
> Tim.
> 
> > What is the state of the current debuggers out there?
> > Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> > good wiki page are much appreciated.  I did perform a Google search
> > and found some links but I want to hear from the current developers as
> > to what is most stable and useful for debugging this type of hard
> > hang.  I only have a serial port PCI-express card to use as the laptop
> > has no built in port.
> 
> --
> Tim Deegan <Tim.Deegan@citrix.com>
> Principal Software Engineer, XenServer Engineering
> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00
> 

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00


[-- Attachment #1.2: Type: text/html, Size: 12355 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: State of current Xen debugger
  2010-09-14 15:54       ` Roger Cruz
@ 2010-09-15 14:19         ` Roger Cruz
  2010-09-28 15:21         ` Roger Cruz
  1 sibling, 0 replies; 13+ messages in thread
From: Roger Cruz @ 2010-09-15 14:19 UTC (permalink / raw)
  To: Roger Cruz, Dan Magenheimer, Tim Deegan; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 8220 bytes --]

Hi Dan,

 

I tried both of your suggestions here and I am still seeing a hang.  The
only clue that I have to go on now is why the print statements that I
have placed in the nmi tick handler do not continue to come at a regular
1 second interval.   My expectation is that they come at a regular rate
and it seems like upon boot, it does up to the 39th second.  See the
output below where curr_sum comes from the nmi_timer_ticks variable.
After the 39th tick, the messages are printed at different intervals...
anywhere between 2-20 seconds.  Does Xen adjust the NMI interval rate?
I know that the NMIs are programmed based on the CPU cycles, so what I
did was disabled SpeedStep to make sure that the processor speed is not
adjusted and the symptoms are still identical.

 

Any other ideas?

 

Regards,

Roger R. Cruz

 

void nmi_watchdog_tick(struct cpu_user_regs * regs)
{
    unsigned int sum = this_cpu(nmi_timer_ticks);




 

HEDLEY-T500 login: (XEN) [23:29:04] 17213**** CPU0, counter=0,
last_sum=25, curr_sum=27, hz=10, nmis=36
(XEN) [23:29:05] 17298**** CPU0, counter=0, last_sum=27, curr_sum=28,
hz=10, nmis=37
(XEN) [23:29:06] 17383**** CPU0, counter=0, last_sum=28, curr_sum=29,
hz=10, nmis=38
(XEN) [23:29:06] 17468mm.c:806:d0 Error getting mfn 100 (pfn 3ff0) from
L1 entry 8000000000100625 for l1e_owner=0, pg_owner=32753
(XEN) [23:29:07] 17598**** CPU0, counter=0, last_sum=29, curr_sum=29,
hz=10, nmis=39
(XEN) [23:29:08] 17683**** CPU0, counter=1, last_sum=29, curr_sum=30,
hz=10, nmis=40
(XEN) [23:29:08] 17768**** CPU0, counter=0, last_sum=30, curr_sum=31,
hz=10, nmis=41
(XEN) [23:29:09] 17853**** CPU0, counter=0, last_sum=31, curr_sum=32,
hz=10, nmis=42
mapping kernel into physical memory
about to get started...
(XEN) [23:29:10] 17938**** CPU0, counter=0, last_sum=32, curr_sum=32,
hz=10, nmis=43
(XEN) [23:29:10] 18023**** CPU0, counter=1, last_sum=32, curr_sum=33,
hz=10, nmis=44
(XEN) [23:29:11] 18108**** CPU0, counter=0, last_sum=33, curr_sum=34,
hz=10, nmis=45
(XEN) [23:29:12] 18193**** CPU0, counter=0, last_sum=34, curr_sum=34,
hz=10, nmis=46
(XEN) [23:29:13] 18278**** CPU0, counter=1, last_sum=34, curr_sum=35,
hz=10, nmis=47
(XEN) [23:29:13] 18363**** CPU0, counter=0, last_sum=35, curr_sum=36,
hz=10, nmis=48
(XEN) [23:29:15] 18448**** CPU0, counter=0, last_sum=36, curr_sum=38,
hz=10, nmis=49
(XEN) [23:29:16] 18533**** CPU0, counter=0, last_sum=38, curr_sum=39,
hz=10, nmis=50
(XEN) [23:29:18] 18618**** CPU0, counter=0, last_sum=39, curr_sum=41,
hz=10, nmis=51
(XEN) [23:29:21] 18703**** CPU0, counter=0, last_sum=41, curr_sum=43,
hz=10, nmis=52
(XEN) [23:29:27] 18788**** CPU0, counter=0, last_sum=43, curr_sum=49,
hz=10, nmis=53
(XEN) [23:29:34] 18873**** CPU0, counter=0, last_sum=49, curr_sum=56,
hz=10, nmis=54
(XEN) [23:29:39] 18958**** CPU0, counter=0, last_sum=56, curr_sum=61,
hz=10, nmis=55
(XEN) [23:29:43] 19043**** CPU0, counter=0, last_sum=61, curr_sum=66,
hz=10, nmis=56
(XEN) [23:29:47] 19128**** CPU0, counter=0, last_sum=66, curr_sum=69,
hz=10, nmis=57
(XEN) [23:29:50] 19213**** CPU0, counter=0, last_sum=69, curr_sum=73,
hz=10, nmis=58
(XEN) [23:29:54] 19298**** CPU0, counter=0, last_sum=73, curr_sum=77,
hz=10, nmis=59
(XEN) [23:29:58] 19383**** CPU0, counter=0, last_sum=77, curr_sum=80,
hz=10, nmis=60
(XEN) [23:30:01] 19468**** CPU0, counter=0, last_sum=80, curr_sum=83,
hz=10, nmis=61
(XEN) [23:30:04] 19553**** CPU0, counter=0, last_sum=83, curr_sum=87,
hz=10, nmis=62
(XEN) [23:30:08] 19638**** CPU0, counter=0, last_sum=87, curr_sum=91,
hz=10, nmis=63
(XEN) [23:30:11] 19723**** CPU0, counter=0, last_sum=91, curr_sum=94,
hz=10, nmis=64
(XEN) [23:30:13] 19808**** CPU0, counter=0, last_sum=94, curr_sum=96,
hz=10, nmis=65
(XEN) [23:30:16] 19893**** CPU0, counter=0, last_sum=96, curr_sum=99,
hz=10, nmis=66
(XEN) [23:30:19] 19978**** CPU0, counter=0, last_sum=99, curr_sum=102,
hz=10, nmis=67
(XEN) [23:30:22] 20064**** CPU0, counter=0, last_sum=102, curr_sum=105,
hz=10, nmis=68
(XEN) [23:30:25] 20151**** CPU0, counter=0, last_sum=105, curr_sum=108,
hz=10, nmis=69
(XEN) [23:30:28] 20238**** CPU0, counter=0, last_sum=108, curr_sum=111,
hz=10, nmis=70

 

From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Roger Cruz
Sent: Tuesday, September 14, 2010 11:55 AM
To: Dan Magenheimer; Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger

 

Hi Dan,

I am using 3.4.2 where we have made very minor modifications (some
backports, for example).

I have not tried your suggestions.. so I will do that next.. thanks!

R.

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Tue 9/14/2010 11:19 AM
To: Roger Cruz; Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger

A couple of thoughts:



Have you tried max_cstate=0 (as a Xen boot option)?



Also, you didn't say what version of Xen you are using but playing
around with hpet_broadcast (enabling it or force-disabling it as below)
might be worth a try.



http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html



From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com]
Sent: Tuesday, September 14, 2010 8:56 AM
To: Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger



Hi Tim,  good to hear from you again

I had a pretty good inkling that one of you hardcore developers would
say that :-)  Yes, it is pretty well wedged.  I can cause the problem
more rapidly by dropping to a single CPU.  When the hang happens, the
Xen console is completely dead.  None of the special keys work.

I do have hopes a BIOS upgrade could fix this as a last resort but I
want to see if at least I can understand the problem.  We have a few
different machines that are exhibiting similar symptoms so I have to see
if I can find a work-around without requiring every user to upgrade
their BIOS :-(

Just in case, what debugger have you been using?  Are there recent
instructions on how to set it up that you can point me to?

Thanks
Roger


-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
Sent: Tue 9/14/2010 10:30 AM
To: Roger Cruz
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] State of current Xen debugger

Hi,

At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> I am trying to debug a problem where the hypervisor is hanging hard.
> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> up a debugger.

Sorry to bring a counsel of despair but if the NMI watchdog isn't
working then your chances of getting a working debugger are slim.  It's
likely that at least one CPU is very very stuck.  Does the 'd' debug key
work on the serial line when the machine is wedged?

On a more cheerful note, I've twice seen hard hangs like this that
turned out to be hardware issues, fixable with BIOS upgrades.

Cheers,

Tim.

> What is the state of the current debuggers out there?
> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> good wiki page are much appreciated.  I did perform a Google search
> and found some links but I want to hear from the current developers as
> to what is most stable and useful for debugging this type of hard
> hang.  I only have a serial port PCI-express card to use as the laptop
> has no built in port.

--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00


[-- Attachment #1.2: Type: text/html, Size: 13356 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: State of current Xen debugger
  2010-09-14 15:54       ` Roger Cruz
  2010-09-15 14:19         ` Roger Cruz
@ 2010-09-28 15:21         ` Roger Cruz
  2010-09-28 15:30           ` Keir Fraser
  1 sibling, 1 reply; 13+ messages in thread
From: Roger Cruz @ 2010-09-28 15:21 UTC (permalink / raw)
  To: Roger Cruz, Dan Magenheimer, Tim Deegan; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4959 bytes --]

I am still chasing this hard hang in our system with a modified 3.4.2
xen.  I have upgraded the BIOS and the problem still exists.  The only
thing that so far had appeared to work was adding max_cstate=0 but now I
have a report where it still hung in one customer who had that flag
enabled.  The rest of them have been successfully running for more than
a week with this "work-around".  I have isolated the problem to Lenovo
with the Centrino processors.  These guys will stop the TSC when in C3.

 

What I need to really understand is why the NMI/watchdog in Xen is not
working and causing a crash when the CPU hangs.  I was under the
impression that NMIs couldn't be masked at all.  Is there anyway that
Xen could be disabling or changing that behavior?   I know the NMI is
being driven by a timer set in the NMI handler.  Could there be a case
where this timer is disabled?   Any ideas are welcome!

 

Thanks

Roger R. Cruz

 

 

 

 

 

 

From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Roger Cruz
Sent: Tuesday, September 14, 2010 11:55 AM
To: Dan Magenheimer; Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger

 

Hi Dan,

I am using 3.4.2 where we have made very minor modifications (some
backports, for example).

I have not tried your suggestions.. so I will do that next.. thanks!

R.

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Tue 9/14/2010 11:19 AM
To: Roger Cruz; Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger

A couple of thoughts:



Have you tried max_cstate=0 (as a Xen boot option)?



Also, you didn't say what version of Xen you are using but playing
around with hpet_broadcast (enabling it or force-disabling it as below)
might be worth a try.



http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html



From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com]
Sent: Tuesday, September 14, 2010 8:56 AM
To: Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] State of current Xen debugger



Hi Tim,  good to hear from you again

I had a pretty good inkling that one of you hardcore developers would
say that :-)  Yes, it is pretty well wedged.  I can cause the problem
more rapidly by dropping to a single CPU.  When the hang happens, the
Xen console is completely dead.  None of the special keys work.

I do have hopes a BIOS upgrade could fix this as a last resort but I
want to see if at least I can understand the problem.  We have a few
different machines that are exhibiting similar symptoms so I have to see
if I can find a work-around without requiring every user to upgrade
their BIOS :-(

Just in case, what debugger have you been using?  Are there recent
instructions on how to set it up that you can point me to?

Thanks
Roger


-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
Sent: Tue 9/14/2010 10:30 AM
To: Roger Cruz
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] State of current Xen debugger

Hi,

At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> I am trying to debug a problem where the hypervisor is hanging hard.
> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> up a debugger.

Sorry to bring a counsel of despair but if the NMI watchdog isn't
working then your chances of getting a working debugger are slim.  It's
likely that at least one CPU is very very stuck.  Does the 'd' debug key
work on the serial line when the machine is wedged?

On a more cheerful note, I've twice seen hard hangs like this that
turned out to be hardware issues, fixable with BIOS upgrades.

Cheers,

Tim.

> What is the state of the current debuggers out there?
> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> good wiki page are much appreciated.  I did perform a Google search
> and found some links but I want to hear from the current developers as
> to what is most stable and useful for debugging this type of hard
> hang.  I only have a serial port PCI-express card to use as the laptop
> has no built in port.

--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00


[-- Attachment #1.2: Type: text/html, Size: 12409 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: State of current Xen debugger
  2010-09-28 15:21         ` Roger Cruz
@ 2010-09-28 15:30           ` Keir Fraser
  2010-09-28 15:40             ` Roger Cruz
  0 siblings, 1 reply; 13+ messages in thread
From: Keir Fraser @ 2010-09-28 15:30 UTC (permalink / raw)
  To: Roger Cruz, Dan Magenheimer, Tim Deegan; +Cc: xen-devel

On 28/09/2010 16:21, "Roger Cruz" <roger.cruz@virtualcomputer.com> wrote:

> I am still chasing this hard hang in our system with a modified 3.4.2 xen.  I
> have upgraded the BIOS and the problem still exists.  The only thing that so
> far had appeared to work was adding max_cstate=0 but now I have a report where
> it still hung in one customer who had that flag enabled.  The rest of them
> have been successfully running for more than a week with this ³work-around².
> I have isolated the problem to Lenovo with the Centrino processors.  These
> guys will stop the TSC when in C3.
>  
> What I need to really understand is why the NMI/watchdog in Xen is not working
> and causing a crash when the CPU hangs.  I was under the impression that NMIs
> couldn¹t be masked at all.  Is there anyway that Xen could be disabling or
> changing that behavior?   I know the NMI is being driven by a timer set in the
> NMI handler.  Could there be a case where this timer is disabled?   Any ideas
> are welcome!

The NMI counter gets driven by the APIC timer. Perhaps it needs poking
womehow on wakeup from C3? My suggestion for debugging this would be to take
a look at what native Linux does. The NMI perfctr poking logic was all taken
from (rather old now) upstream Linux.

 -- Keir

> Thanks
> Roger R. Cruz
>  
>  
>  
>  
>  
>  
> 
> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Roger Cruz
> Sent: Tuesday, September 14, 2010 11:55 AM
> To: Dan Magenheimer; Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] State of current Xen debugger
>  
> Hi Dan,
> 
> I am using 3.4.2 where we have made very minor modifications (some backports,
> for example).
> 
> I have not tried your suggestions.. so I will do that next.. thanks!
> 
> R.
> 
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Tue 9/14/2010 11:19 AM
> To: Roger Cruz; Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] State of current Xen debugger
> 
> A couple of thoughts:
> 
> 
> 
> Have you tried max_cstate=0 (as a Xen boot option)?
> 
> 
> 
> Also, you didn't say what version of Xen you are using but playing around with
> hpet_broadcast (enabling it or force-disabling it as below) might be worth a
> try.
> 
> 
> 
> http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html
> 
> 
> 
> From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com]
> Sent: Tuesday, September 14, 2010 8:56 AM
> To: Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] State of current Xen debugger
> 
> 
> 
> Hi Tim,  good to hear from you again
> 
> I had a pretty good inkling that one of you hardcore developers would say that
> :-)  Yes, it is pretty well wedged.  I can cause the problem more rapidly by
> dropping to a single CPU.  When the hang happens, the Xen console is
> completely dead.  None of the special keys work.
> 
> I do have hopes a BIOS upgrade could fix this as a last resort but I want to
> see if at least I can understand the problem.  We have a few different
> machines that are exhibiting similar symptoms so I have to see if I can find a
> work-around without requiring every user to upgrade their BIOS :-(
> 
> Just in case, what debugger have you been using?  Are there recent
> instructions on how to set it up that you can point me to?
> 
> Thanks
> Roger
> 
> 
> -----Original Message-----
> From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
> Sent: Tue 9/14/2010 10:30 AM
> To: Roger Cruz
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] State of current Xen debugger
> 
> Hi,
> 
> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
>> I am trying to debug a problem where the hypervisor is hanging hard.
>> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
>> up a debugger.
> 
> Sorry to bring a counsel of despair but if the NMI watchdog isn't
> working then your chances of getting a working debugger are slim.  It's
> likely that at least one CPU is very very stuck.  Does the 'd' debug key
> work on the serial line when the machine is wedged?
> 
> On a more cheerful note, I've twice seen hard hangs like this that
> turned out to be hardware issues, fixable with BIOS upgrades.
> 
> Cheers,
> 
> Tim.
> 
>> What is the state of the current debuggers out there?
>> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
>> good wiki page are much appreciated.  I did perform a Google search
>> and found some links but I want to hear from the current developers as
>> to what is most stable and useful for debugging this type of hard
>> hang.  I only have a serial port PCI-express card to use as the laptop
>> has no built in port.
> 
> --
> Tim Deegan <Tim.Deegan@citrix.com>
> Principal Software Engineer, XenServer Engineering
> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
> 02:35:00
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
> 02:35:00
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
> 02:35:00
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: State of current Xen debugger
  2010-09-28 15:30           ` Keir Fraser
@ 2010-09-28 15:40             ` Roger Cruz
  2010-09-28 17:06               ` Keir Fraser
  0 siblings, 1 reply; 13+ messages in thread
From: Roger Cruz @ 2010-09-28 15:40 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, Tim Deegan; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 8548 bytes --]


By the APIC timer?  When I traced this code I was under the impression that is driven by the performance counters counting cycles and generating an interrupt when the counter overflows.  I found this was the routine being called to setup the watchdog

static void __pminit setup_p6_watchdog(unsigned counter)
{
    unsigned int evntsel;

    nmi_perfctr_msr = MSR_P6_PERFCTR0;  <--- register

    clear_msr_range(MSR_P6_EVNTSEL0, 2);
    clear_msr_range(MSR_P6_PERFCTR0, 2);

    evntsel = P6_EVNTSEL_INT
        | P6_EVNTSEL_OS
        | P6_EVNTSEL_USR
        | counter;

    wrmsr(MSR_P6_EVNTSEL0, evntsel, 0);
    write_watchdog_counter("P6_PERFCTR0");
    apic_write(APIC_LVTPC, APIC_DM_NMI);
    evntsel |= P6_EVNTSEL0_ENABLE;
    wrmsr(MSR_P6_EVNTSEL0, evntsel, 0);
}

and then during the NMI tick handler this path was executed

        else if ( nmi_perfctr_msr == MSR_P6_PERFCTR0 )
        {
            /*
             * Only P6 based Pentium M need to re-unmask the apic vector but
             * it doesn't hurt other P6 variants.
             */
            apic_write(APIC_LVTPC, APIC_DM_NMI);
        }
        write_watchdog_counter(NULL);



static inline void write_watchdog_counter(const char *descr)
{
    u64 count = (u64)cpu_khz * 1000;

    do_div(count, nmi_hz);
    if(descr)
        Dprintk("setting %s to -0x%08Lx\n", descr, count);
    wrmsrl(nmi_perfctr_msr, 0 - count);
}


It is also my understanding that during the CPU c3 state change in cpu_idle.c, the APIC timer is turned off.  See comments below.

        /*
         * Before invoking C3, be aware that TSC/APIC timer may be 
         * stopped by H/W. Without carefully handling of TSC/APIC stop issues,
         * deep C state can't work correctly.
         */
        /* preparing APIC stop */
        lapic_timer_off();  <------------- APIC timer appears to be turned off here.

        /* Get start time (ticks) */
        t1 = inl(pmtmr_ioport);
        /* Trace cpu idle entry */
        TRACE_2D(TRC_PM_IDLE_ENTRY, cx->idx, t1);
        /* Invoke C3 */
        acpi_idle_do_entry(cx);
        /* Get end time (ticks) */
        t2 = inl(pmtmr_ioport);

        /* recovering TSC */
        cstate_restore_tsc();  <----- this is our backport of an unstable patch to keep TSCs synchronized
        /* Trace cpu idle exit */
 

Thanks Keir!

Roger

-----Original Message-----
From: Keir Fraser on behalf of Keir Fraser
Sent: Tue 9/28/2010 11:30 AM
To: Roger Cruz; Dan Magenheimer; Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] State of current Xen debugger
 
On 28/09/2010 16:21, "Roger Cruz" <roger.cruz@virtualcomputer.com> wrote:

> I am still chasing this hard hang in our system with a modified 3.4.2 xen.  I
> have upgraded the BIOS and the problem still exists.  The only thing that so
> far had appeared to work was adding max_cstate=0 but now I have a report where
> it still hung in one customer who had that flag enabled.  The rest of them
> have been successfully running for more than a week with this ³work-around².
> I have isolated the problem to Lenovo with the Centrino processors.  These
> guys will stop the TSC when in C3.
>  
> What I need to really understand is why the NMI/watchdog in Xen is not working
> and causing a crash when the CPU hangs.  I was under the impression that NMIs
> couldn¹t be masked at all.  Is there anyway that Xen could be disabling or
> changing that behavior?   I know the NMI is being driven by a timer set in the
> NMI handler.  Could there be a case where this timer is disabled?   Any ideas
> are welcome!

The NMI counter gets driven by the APIC timer. Perhaps it needs poking
womehow on wakeup from C3? My suggestion for debugging this would be to take
a look at what native Linux does. The NMI perfctr poking logic was all taken
from (rather old now) upstream Linux.

 -- Keir

> Thanks
> Roger R. Cruz
>  
>  
>  
>  
>  
>  
> 
> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Roger Cruz
> Sent: Tuesday, September 14, 2010 11:55 AM
> To: Dan Magenheimer; Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] State of current Xen debugger
>  
> Hi Dan,
> 
> I am using 3.4.2 where we have made very minor modifications (some backports,
> for example).
> 
> I have not tried your suggestions.. so I will do that next.. thanks!
> 
> R.
> 
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Tue 9/14/2010 11:19 AM
> To: Roger Cruz; Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] State of current Xen debugger
> 
> A couple of thoughts:
> 
> 
> 
> Have you tried max_cstate=0 (as a Xen boot option)?
> 
> 
> 
> Also, you didn't say what version of Xen you are using but playing around with
> hpet_broadcast (enabling it or force-disabling it as below) might be worth a
> try.
> 
> 
> 
> http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html
> 
> 
> 
> From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com]
> Sent: Tuesday, September 14, 2010 8:56 AM
> To: Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] State of current Xen debugger
> 
> 
> 
> Hi Tim,  good to hear from you again
> 
> I had a pretty good inkling that one of you hardcore developers would say that
> :-)  Yes, it is pretty well wedged.  I can cause the problem more rapidly by
> dropping to a single CPU.  When the hang happens, the Xen console is
> completely dead.  None of the special keys work.
> 
> I do have hopes a BIOS upgrade could fix this as a last resort but I want to
> see if at least I can understand the problem.  We have a few different
> machines that are exhibiting similar symptoms so I have to see if I can find a
> work-around without requiring every user to upgrade their BIOS :-(
> 
> Just in case, what debugger have you been using?  Are there recent
> instructions on how to set it up that you can point me to?
> 
> Thanks
> Roger
> 
> 
> -----Original Message-----
> From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
> Sent: Tue 9/14/2010 10:30 AM
> To: Roger Cruz
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] State of current Xen debugger
> 
> Hi,
> 
> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
>> I am trying to debug a problem where the hypervisor is hanging hard.
>> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
>> up a debugger.
> 
> Sorry to bring a counsel of despair but if the NMI watchdog isn't
> working then your chances of getting a working debugger are slim.  It's
> likely that at least one CPU is very very stuck.  Does the 'd' debug key
> work on the serial line when the machine is wedged?
> 
> On a more cheerful note, I've twice seen hard hangs like this that
> turned out to be hardware issues, fixable with BIOS upgrades.
> 
> Cheers,
> 
> Tim.
> 
>> What is the state of the current debuggers out there?
>> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
>> good wiki page are much appreciated.  I did perform a Google search
>> and found some links but I want to hear from the current developers as
>> to what is most stable and useful for debugging this type of hard
>> hang.  I only have a serial port PCI-express card to use as the laptop
>> has no built in port.
> 
> --
> Tim Deegan <Tim.Deegan@citrix.com>
> Principal Software Engineer, XenServer Engineering
> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
> 02:35:00
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
> 02:35:00
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
> 02:35:00
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel



No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3149 - Release Date: 09/28/10 02:34:00


[-- Attachment #1.2: Type: text/html, Size: 12393 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: State of current Xen debugger
  2010-09-28 15:40             ` Roger Cruz
@ 2010-09-28 17:06               ` Keir Fraser
  2010-09-28 17:14                 ` Roger Cruz
  0 siblings, 1 reply; 13+ messages in thread
From: Keir Fraser @ 2010-09-28 17:06 UTC (permalink / raw)
  To: Roger Cruz, Dan Magenheimer, Tim Deegan; +Cc: xen-devel

Yeah, but the performance counters are driven by the same LAPIC timesource
that drives the main LAPIC timer.

 -- Keir

On 28/09/2010 16:40, "Roger Cruz" <roger.cruz@virtualcomputer.com> wrote:

> 
> 
> By the APIC timer?  When I traced this code I was under the impression that is
> driven by the performance counters counting cycles and generating an interrupt
> when the counter overflows.  I found this was the routine being called to
> setup the watchdog
> 
> static void __pminit setup_p6_watchdog(unsigned counter)
> {
>     unsigned int evntsel;
> 
>     nmi_perfctr_msr = MSR_P6_PERFCTR0;  <--- register
> 
>     clear_msr_range(MSR_P6_EVNTSEL0, 2);
>     clear_msr_range(MSR_P6_PERFCTR0, 2);
> 
>     evntsel = P6_EVNTSEL_INT
>         | P6_EVNTSEL_OS
>         | P6_EVNTSEL_USR
>         | counter;
> 
>     wrmsr(MSR_P6_EVNTSEL0, evntsel, 0);
>     write_watchdog_counter("P6_PERFCTR0");
>     apic_write(APIC_LVTPC, APIC_DM_NMI);
>     evntsel |= P6_EVNTSEL0_ENABLE;
>     wrmsr(MSR_P6_EVNTSEL0, evntsel, 0);
> }
> 
> and then during the NMI tick handler this path was executed
> 
>         else if ( nmi_perfctr_msr == MSR_P6_PERFCTR0 )
>         {
>             /*
>              * Only P6 based Pentium M need to re-unmask the apic vector but
>              * it doesn't hurt other P6 variants.
>              */
>             apic_write(APIC_LVTPC, APIC_DM_NMI);
>         }
>         write_watchdog_counter(NULL);
> 
> 
> 
> static inline void write_watchdog_counter(const char *descr)
> {
>     u64 count = (u64)cpu_khz * 1000;
> 
>     do_div(count, nmi_hz);
>     if(descr)
>         Dprintk("setting %s to -0x%08Lx\n", descr, count);
>     wrmsrl(nmi_perfctr_msr, 0 - count);
> }
> 
> 
> It is also my understanding that during the CPU c3 state change in cpu_idle.c,
> the APIC timer is turned off.  See comments below.
> 
>         /*
>          * Before invoking C3, be aware that TSC/APIC timer may be
>          * stopped by H/W. Without carefully handling of TSC/APIC stop issues,
>          * deep C state can't work correctly.
>          */
>         /* preparing APIC stop */
>         lapic_timer_off();  <------------- APIC timer appears to be turned off
> here.
> 
>         /* Get start time (ticks) */
>         t1 = inl(pmtmr_ioport);
>         /* Trace cpu idle entry */
>         TRACE_2D(TRC_PM_IDLE_ENTRY, cx->idx, t1);
>         /* Invoke C3 */
>         acpi_idle_do_entry(cx);
>         /* Get end time (ticks) */
>         t2 = inl(pmtmr_ioport);
> 
>         /* recovering TSC */
>         cstate_restore_tsc();  <----- this is our backport of an unstable
> patch to keep TSCs synchronized
>         /* Trace cpu idle exit */
> 
> 
> Thanks Keir!
> 
> Roger
> 
> -----Original Message-----
> From: Keir Fraser on behalf of Keir Fraser
> Sent: Tue 9/28/2010 11:30 AM
> To: Roger Cruz; Dan Magenheimer; Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] State of current Xen debugger
> 
> On 28/09/2010 16:21, "Roger Cruz" <roger.cruz@virtualcomputer.com> wrote:
> 
>> I am still chasing this hard hang in our system with a modified 3.4.2 xen.  I
>> have upgraded the BIOS and the problem still exists.  The only thing that so
>> far had appeared to work was adding max_cstate=0 but now I have a report
>> where
>> it still hung in one customer who had that flag enabled.  The rest of them
>> have been successfully running for more than a week with this ³work-around².
>> I have isolated the problem to Lenovo with the Centrino processors.  These
>> guys will stop the TSC when in C3.
>> 
>> What I need to really understand is why the NMI/watchdog in Xen is not
>> working
>> and causing a crash when the CPU hangs.  I was under the impression that NMIs
>> couldn¹t be masked at all.  Is there anyway that Xen could be disabling or
>> changing that behavior?   I know the NMI is being driven by a timer set in
>> the
>> NMI handler.  Could there be a case where this timer is disabled?   Any ideas
>> are welcome!
> 
> The NMI counter gets driven by the APIC timer. Perhaps it needs poking
> womehow on wakeup from C3? My suggestion for debugging this would be to take
> a look at what native Linux does. The NMI perfctr poking logic was all taken
> from (rather old now) upstream Linux.
> 
>  -- Keir
> 
>> Thanks
>> Roger R. Cruz
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: xen-devel-bounces@lists.xensource.com
>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Roger Cruz
>> Sent: Tuesday, September 14, 2010 11:55 AM
>> To: Dan Magenheimer; Tim Deegan
>> Cc: xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> Hi Dan,
>> 
>> I am using 3.4.2 where we have made very minor modifications (some backports,
>> for example).
>> 
>> I have not tried your suggestions.. so I will do that next.. thanks!
>> 
>> R.
>> 
>> -----Original Message-----
>> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>> Sent: Tue 9/14/2010 11:19 AM
>> To: Roger Cruz; Tim Deegan
>> Cc: xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> A couple of thoughts:
>> 
>> 
>> 
>> Have you tried max_cstate=0 (as a Xen boot option)?
>> 
>> 
>> 
>> Also, you didn't say what version of Xen you are using but playing around
>> with
>> hpet_broadcast (enabling it or force-disabling it as below) might be worth a
>> try.
>> 
>> 
>> 
>> http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html
>> 
>> 
>> 
>> From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com]
>> Sent: Tuesday, September 14, 2010 8:56 AM
>> To: Tim Deegan
>> Cc: xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> 
>> 
>> Hi Tim,  good to hear from you again
>> 
>> I had a pretty good inkling that one of you hardcore developers would say
>> that
>> :-)  Yes, it is pretty well wedged.  I can cause the problem more rapidly by
>> dropping to a single CPU.  When the hang happens, the Xen console is
>> completely dead.  None of the special keys work.
>> 
>> I do have hopes a BIOS upgrade could fix this as a last resort but I want to
>> see if at least I can understand the problem.  We have a few different
>> machines that are exhibiting similar symptoms so I have to see if I can find
>> a
>> work-around without requiring every user to upgrade their BIOS :-(
>> 
>> Just in case, what debugger have you been using?  Are there recent
>> instructions on how to set it up that you can point me to?
>> 
>> Thanks
>> Roger
>> 
>> 
>> -----Original Message-----
>> From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
>> Sent: Tue 9/14/2010 10:30 AM
>> To: Roger Cruz
>> Cc: xen-devel@lists.xensource.com
>> Subject: Re: [Xen-devel] State of current Xen debugger
>> 
>> Hi,
>> 
>> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
>>> I am trying to debug a problem where the hypervisor is hanging hard.
>>> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
>>> up a debugger.
>> 
>> Sorry to bring a counsel of despair but if the NMI watchdog isn't
>> working then your chances of getting a working debugger are slim.  It's
>> likely that at least one CPU is very very stuck.  Does the 'd' debug key
>> work on the serial line when the machine is wedged?
>> 
>> On a more cheerful note, I've twice seen hard hangs like this that
>> turned out to be hardware issues, fixable with BIOS upgrades.
>> 
>> Cheers,
>> 
>> Tim.
>> 
>>> What is the state of the current debuggers out there?
>>> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
>>> good wiki page are much appreciated.  I did perform a Google search
>>> and found some links but I want to hear from the current developers as
>>> to what is most stable and useful for debugging this type of hard
>>> hang.  I only have a serial port PCI-express card to use as the laptop
>>> has no built in port.
>> 
>> --
>> Tim Deegan <Tim.Deegan@citrix.com>
>> Principal Software Engineer, XenServer Engineering
>> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.856 / Virus Database: 271.1.1/3149 - Release Date: 09/28/10
> 02:34:00
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: State of current Xen debugger
  2010-09-28 17:06               ` Keir Fraser
@ 2010-09-28 17:14                 ` Roger Cruz
  0 siblings, 0 replies; 13+ messages in thread
From: Roger Cruz @ 2010-09-28 17:14 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, Tim Deegan; +Cc: xen-devel

Ok, so I'm obviously missing something here.  I thought the performance counters were setup to measure CPU cycles, so when the CPU enters C3 and it asserts the STPCLK signal, then the performance counters won't increment and hence we won't get an NMI while the core is in C3.  Only an external interrupt or other wake up event will wake it up.  I understood that interrupt to be the HPET timer which is always running and not affected by TSC being shutdown.  Could you please correct me if this is not right?

What is the time source for the LAPIC timer?


-----Original Message-----
From: Keir Fraser [mailto:keir.xen@gmail.com] On Behalf Of Keir Fraser
Sent: Tuesday, September 28, 2010 1:07 PM
To: Roger Cruz; Dan Magenheimer; Tim Deegan
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] State of current Xen debugger

Yeah, but the performance counters are driven by the same LAPIC timesource
that drives the main LAPIC timer.

 -- Keir

On 28/09/2010 16:40, "Roger Cruz" <roger.cruz@virtualcomputer.com> wrote:

> 
> 
> By the APIC timer?  When I traced this code I was under the impression that is
> driven by the performance counters counting cycles and generating an interrupt
> when the counter overflows.  I found this was the routine being called to
> setup the watchdog
> 
> static void __pminit setup_p6_watchdog(unsigned counter)
> {
>     unsigned int evntsel;
> 
>     nmi_perfctr_msr = MSR_P6_PERFCTR0;  <--- register
> 
>     clear_msr_range(MSR_P6_EVNTSEL0, 2);
>     clear_msr_range(MSR_P6_PERFCTR0, 2);
> 
>     evntsel = P6_EVNTSEL_INT
>         | P6_EVNTSEL_OS
>         | P6_EVNTSEL_USR
>         | counter;
> 
>     wrmsr(MSR_P6_EVNTSEL0, evntsel, 0);
>     write_watchdog_counter("P6_PERFCTR0");
>     apic_write(APIC_LVTPC, APIC_DM_NMI);
>     evntsel |= P6_EVNTSEL0_ENABLE;
>     wrmsr(MSR_P6_EVNTSEL0, evntsel, 0);
> }
> 
> and then during the NMI tick handler this path was executed
> 
>         else if ( nmi_perfctr_msr == MSR_P6_PERFCTR0 )
>         {
>             /*
>              * Only P6 based Pentium M need to re-unmask the apic vector but
>              * it doesn't hurt other P6 variants.
>              */
>             apic_write(APIC_LVTPC, APIC_DM_NMI);
>         }
>         write_watchdog_counter(NULL);
> 
> 
> 
> static inline void write_watchdog_counter(const char *descr)
> {
>     u64 count = (u64)cpu_khz * 1000;
> 
>     do_div(count, nmi_hz);
>     if(descr)
>         Dprintk("setting %s to -0x%08Lx\n", descr, count);
>     wrmsrl(nmi_perfctr_msr, 0 - count);
> }
> 
> 
> It is also my understanding that during the CPU c3 state change in cpu_idle.c,
> the APIC timer is turned off.  See comments below.
> 
>         /*
>          * Before invoking C3, be aware that TSC/APIC timer may be
>          * stopped by H/W. Without carefully handling of TSC/APIC stop issues,
>          * deep C state can't work correctly.
>          */
>         /* preparing APIC stop */
>         lapic_timer_off();  <------------- APIC timer appears to be turned off
> here.
> 
>         /* Get start time (ticks) */
>         t1 = inl(pmtmr_ioport);
>         /* Trace cpu idle entry */
>         TRACE_2D(TRC_PM_IDLE_ENTRY, cx->idx, t1);
>         /* Invoke C3 */
>         acpi_idle_do_entry(cx);
>         /* Get end time (ticks) */
>         t2 = inl(pmtmr_ioport);
> 
>         /* recovering TSC */
>         cstate_restore_tsc();  <----- this is our backport of an unstable
> patch to keep TSCs synchronized
>         /* Trace cpu idle exit */
> 
> 
> Thanks Keir!
> 
> Roger
> 
> -----Original Message-----
> From: Keir Fraser on behalf of Keir Fraser
> Sent: Tue 9/28/2010 11:30 AM
> To: Roger Cruz; Dan Magenheimer; Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] State of current Xen debugger
> 
> On 28/09/2010 16:21, "Roger Cruz" <roger.cruz@virtualcomputer.com> wrote:
> 
>> I am still chasing this hard hang in our system with a modified 3.4.2 xen.  I
>> have upgraded the BIOS and the problem still exists.  The only thing that so
>> far had appeared to work was adding max_cstate=0 but now I have a report
>> where
>> it still hung in one customer who had that flag enabled.  The rest of them
>> have been successfully running for more than a week with this ³work-around².
>> I have isolated the problem to Lenovo with the Centrino processors.  These
>> guys will stop the TSC when in C3.
>> 
>> What I need to really understand is why the NMI/watchdog in Xen is not
>> working
>> and causing a crash when the CPU hangs.  I was under the impression that NMIs
>> couldn¹t be masked at all.  Is there anyway that Xen could be disabling or
>> changing that behavior?   I know the NMI is being driven by a timer set in
>> the
>> NMI handler.  Could there be a case where this timer is disabled?   Any ideas
>> are welcome!
> 
> The NMI counter gets driven by the APIC timer. Perhaps it needs poking
> womehow on wakeup from C3? My suggestion for debugging this would be to take
> a look at what native Linux does. The NMI perfctr poking logic was all taken
> from (rather old now) upstream Linux.
> 
>  -- Keir
> 
>> Thanks
>> Roger R. Cruz
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: xen-devel-bounces@lists.xensource.com
>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Roger Cruz
>> Sent: Tuesday, September 14, 2010 11:55 AM
>> To: Dan Magenheimer; Tim Deegan
>> Cc: xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> Hi Dan,
>> 
>> I am using 3.4.2 where we have made very minor modifications (some backports,
>> for example).
>> 
>> I have not tried your suggestions.. so I will do that next.. thanks!
>> 
>> R.
>> 
>> -----Original Message-----
>> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>> Sent: Tue 9/14/2010 11:19 AM
>> To: Roger Cruz; Tim Deegan
>> Cc: xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> A couple of thoughts:
>> 
>> 
>> 
>> Have you tried max_cstate=0 (as a Xen boot option)?
>> 
>> 
>> 
>> Also, you didn't say what version of Xen you are using but playing around
>> with
>> hpet_broadcast (enabling it or force-disabling it as below) might be worth a
>> try.
>> 
>> 
>> 
>> http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html
>> 
>> 
>> 
>> From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com]
>> Sent: Tuesday, September 14, 2010 8:56 AM
>> To: Tim Deegan
>> Cc: xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] State of current Xen debugger
>> 
>> 
>> 
>> Hi Tim,  good to hear from you again
>> 
>> I had a pretty good inkling that one of you hardcore developers would say
>> that
>> :-)  Yes, it is pretty well wedged.  I can cause the problem more rapidly by
>> dropping to a single CPU.  When the hang happens, the Xen console is
>> completely dead.  None of the special keys work.
>> 
>> I do have hopes a BIOS upgrade could fix this as a last resort but I want to
>> see if at least I can understand the problem.  We have a few different
>> machines that are exhibiting similar symptoms so I have to see if I can find
>> a
>> work-around without requiring every user to upgrade their BIOS :-(
>> 
>> Just in case, what debugger have you been using?  Are there recent
>> instructions on how to set it up that you can point me to?
>> 
>> Thanks
>> Roger
>> 
>> 
>> -----Original Message-----
>> From: Tim Deegan [mailto:Tim.Deegan@citrix.com]
>> Sent: Tue 9/14/2010 10:30 AM
>> To: Roger Cruz
>> Cc: xen-devel@lists.xensource.com
>> Subject: Re: [Xen-devel] State of current Xen debugger
>> 
>> Hi,
>> 
>> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
>>> I am trying to debug a problem where the hypervisor is hanging hard.
>>> Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
>>> up a debugger.
>> 
>> Sorry to bring a counsel of despair but if the NMI watchdog isn't
>> working then your chances of getting a working debugger are slim.  It's
>> likely that at least one CPU is very very stuck.  Does the 'd' debug key
>> work on the serial line when the machine is wedged?
>> 
>> On a more cheerful note, I've twice seen hard hangs like this that
>> turned out to be hardware issues, fixable with BIOS upgrades.
>> 
>> Cheers,
>> 
>> Tim.
>> 
>>> What is the state of the current debuggers out there?
>>> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
>>> good wiki page are much appreciated.  I did perform a Google search
>>> and found some links but I want to hear from the current developers as
>>> to what is most stable and useful for debugging this type of hard
>>> hang.  I only have a serial port PCI-express card to use as the laptop
>>> has no built in port.
>> 
>> --
>> Tim Deegan <Tim.Deegan@citrix.com>
>> Principal Software Engineer, XenServer Engineering
>> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
>> 02:35:00
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.856 / Virus Database: 271.1.1/3149 - Release Date: 09/28/10
> 02:34:00
> 
> 



No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3149 - Release Date: 09/28/10 02:34:00

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-09-28 17:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-14 14:22 State of current Xen debugger Roger Cruz
2010-09-14 14:30 ` Tim Deegan
2010-09-14 14:56   ` Roger Cruz
2010-09-14 15:19     ` Dan Magenheimer
2010-09-14 15:54       ` Roger Cruz
2010-09-15 14:19         ` Roger Cruz
2010-09-28 15:21         ` Roger Cruz
2010-09-28 15:30           ` Keir Fraser
2010-09-28 15:40             ` Roger Cruz
2010-09-28 17:06               ` Keir Fraser
2010-09-28 17:14                 ` Roger Cruz
2010-09-14 15:20     ` Tim Deegan
2010-09-14 16:08       ` Roger Cruz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.