All of lore.kernel.org
 help / color / mirror / Atom feed
* Known console(d) bug?
@ 2009-05-29 18:26 Ferenc Wagner
  2009-05-29 21:53 ` Pasi Kärkkäinen
  0 siblings, 1 reply; 5+ messages in thread
From: Ferenc Wagner @ 2009-05-29 18:26 UTC (permalink / raw)
  To: xen-devel

Hi,

There's a problem I'm struggling with for quite some time in our Xen
hosting environment.  Basically, after a couple of months' smooth
running time, suddenly most virtual machines get stuck into r state
and stop responding to anything, including xm console and xm sysrq.
It happens rather regularly, but I can't reproduce it by taxing the
domUs or the dom0 with disk I/O, CPU or console I/O.

However, a couple of days ago it turned out that this situation can be
cured by restarting xenconsoled!  After that, xm console spit out the
previous random typing, sysrq help strings and whatnot for the domUs
which weren't stuck in r state, and the stuck ones also started to
respond and run normally (spending most of their time in b state) again.

The whole phenomenon looked like xenconsoled stopped emptying the domU
console buffers, and those domUs which were constantly writing to
their consoles quickly filled it up and started busy-looping trying to
put more characters onto their consoles, not caring to respond to
ping, even.  But those domUs which didn't write to their consoles,
stayed functional until the desperate operator forced them to create
enough console output to fill up their buffers as well, and then they
stuck into r state just like the others.  After restarting xenconsoled
all were able to recover successfully.

Of course the above is just guessing, I don't know the details of Xen
console handling.  But I wonder if it rings any bells here, or maybe
this issue is known and fixed already.  Oh, I experience this under
Xen 3.2 and pv-ops guests (2.6.26+patches).
-- 
Thanks,
Feri.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Known console(d) bug?
  2009-05-29 18:26 Known console(d) bug? Ferenc Wagner
@ 2009-05-29 21:53 ` Pasi Kärkkäinen
  2009-05-29 23:04   ` Keir Fraser
  2009-05-29 23:06   ` Ferenc Wagner
  0 siblings, 2 replies; 5+ messages in thread
From: Pasi Kärkkäinen @ 2009-05-29 21:53 UTC (permalink / raw)
  To: Ferenc Wagner; +Cc: xen-devel

On Fri, May 29, 2009 at 08:26:33PM +0200, Ferenc Wagner wrote:
> Hi,
> 
> There's a problem I'm struggling with for quite some time in our Xen
> hosting environment.  Basically, after a couple of months' smooth
> running time, suddenly most virtual machines get stuck into r state
> and stop responding to anything, including xm console and xm sysrq.
> It happens rather regularly, but I can't reproduce it by taxing the
> domUs or the dom0 with disk I/O, CPU or console I/O.
> 
> However, a couple of days ago it turned out that this situation can be
> cured by restarting xenconsoled!  After that, xm console spit out the
> previous random typing, sysrq help strings and whatnot for the domUs
> which weren't stuck in r state, and the stuck ones also started to
> respond and run normally (spending most of their time in b state) again.
> 
> The whole phenomenon looked like xenconsoled stopped emptying the domU
> console buffers, and those domUs which were constantly writing to
> their consoles quickly filled it up and started busy-looping trying to
> put more characters onto their consoles, not caring to respond to
> ping, even.  But those domUs which didn't write to their consoles,
> stayed functional until the desperate operator forced them to create
> enough console output to fill up their buffers as well, and then they
> stuck into r state just like the others.  After restarting xenconsoled
> all were able to recover successfully.
> 
> Of course the above is just guessing, I don't know the details of Xen
> console handling.  But I wonder if it rings any bells here, or maybe
> this issue is known and fixed already.  Oh, I experience this under
> Xen 3.2 and pv-ops guests (2.6.26+patches).

I've seen the exact same bug/problem with Xen in RHEL5/CentOS (5.0, 5.1, 5.2). 
I believe it's also in 5.3. 

I reported the problem to xen-devel, but I couldn't provide the needed
strace/backtrace to figure out the reason _why_ that happens.. (I had
already restarted xenconsoled..)

I think developers would need more information to figure out what the
actual bug is. 

-- Pasi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Known console(d) bug?
  2009-05-29 21:53 ` Pasi Kärkkäinen
@ 2009-05-29 23:04   ` Keir Fraser
  2009-07-17 21:35     ` Ferenc Wagner
  2009-05-29 23:06   ` Ferenc Wagner
  1 sibling, 1 reply; 5+ messages in thread
From: Keir Fraser @ 2009-05-29 23:04 UTC (permalink / raw)
  To: Pasi Kärkkäinen, Ferenc Wagner; +Cc: xen-devel

On 29/05/2009 22:53, "Pasi Kärkkäinen" <pasik@iki.fi> wrote:

> I've seen the exact same bug/problem with Xen in RHEL5/CentOS (5.0, 5.1, 5.2).
> I believe it's also in 5.3.
> 
> I reported the problem to xen-devel, but I couldn't provide the needed
> strace/backtrace to figure out the reason _why_ that happens.. (I had
> already restarted xenconsoled..)
> 
> I think developers would need more information to figure out what the
> actual bug is. 

Yes, I think any kind of xenconsoled hang can eventually result in guests
spinning waiting for their console buffers to be emptied. It might be
interesting to build xenconsoled with debug symbols (-g compile option) and
attach gdb when it gets in this state. Without that kind of info it'll be
hard to track down.

 -- Keir

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Known console(d) bug?
  2009-05-29 21:53 ` Pasi Kärkkäinen
  2009-05-29 23:04   ` Keir Fraser
@ 2009-05-29 23:06   ` Ferenc Wagner
  1 sibling, 0 replies; 5+ messages in thread
From: Ferenc Wagner @ 2009-05-29 23:06 UTC (permalink / raw)
  To: xen-devel

Pasi Kärkkäinen <pasik@iki.fi> writes:

> On Fri, May 29, 2009 at 08:26:33PM +0200, Ferenc Wagner wrote:
> 
>> There's a problem I'm struggling with for quite some time in our Xen
>> hosting environment.  Basically, after a couple of months' smooth
>> running time, suddenly most virtual machines get stuck into r state
>> and stop responding to anything, including xm console and xm sysrq.
>> It happens rather regularly, but I can't reproduce it by taxing the
>> domUs or the dom0 with disk I/O, CPU or console I/O.
>> 
>> However, a couple of days ago it turned out that this situation can be
>> cured by restarting xenconsoled!  After that, xm console spit out the
>> previous random typing, sysrq help strings and whatnot for the domUs
>> which weren't stuck in r state, and the stuck ones also started to
>> respond and run normally (spending most of their time in b state) again.
>> 
>> The whole phenomenon looked like xenconsoled stopped emptying the domU
>> console buffers, and those domUs which were constantly writing to
>> their consoles quickly filled it up and started busy-looping trying to
>> put more characters onto their consoles, not caring to respond to
>> ping, even.  But those domUs which didn't write to their consoles,
>> stayed functional until the desperate operator forced them to create
>> enough console output to fill up their buffers as well, and then they
>> stuck into r state just like the others.  After restarting xenconsoled
>> all were able to recover successfully.
>> 
>> Of course the above is just guessing, I don't know the details of Xen
>> console handling.  But I wonder if it rings any bells here, or maybe
>> this issue is known and fixed already.  Oh, I experience this under
>> Xen 3.2 and pv-ops guests (2.6.26+patches).
>
> I've seen the exact same bug/problem with Xen in RHEL5/CentOS (5.0, 5.1, 5.2). 
> I believe it's also in 5.3. 
>
> I reported the problem to xen-devel, but I couldn't provide the needed
> strace/backtrace to figure out the reason _why_ that happens.. (I had
> already restarted xenconsoled..)
>
> I think developers would need more information to figure out what the
> actual bug is. 

Indeed I found your report now.  This means you're running for almost
a year without experiencing this!  I get it much more often, but still
pretty rarely.  I also noticed that the more or less regular

WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 200 ms (> 50 ms) (GSource: 0x811bf80)

messages from heartbeat came 50 times more often while xenstored was
stuck (it didn't take any significant CPU at least).  However, four
domUs in constantly r state surely sucked up all the CPU power of the
4-way host machine.

And this phenomenon is always triggered by some extra load, typically
by tiger starting an md5sum check of the installed packages at the
same time on a couple of domUs.  (Btw. doesn't some randomized crond
exist for helping this in general?)
-- 
Cheers,
Feri.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Known console(d) bug?
  2009-05-29 23:04   ` Keir Fraser
@ 2009-07-17 21:35     ` Ferenc Wagner
  0 siblings, 0 replies; 5+ messages in thread
From: Ferenc Wagner @ 2009-07-17 21:35 UTC (permalink / raw)
  To: xen-devel

Keir Fraser <keir.fraser@eu.citrix.com> writes:

> On 29/05/2009 22:53, "Pasi Kärkkäinen" <pasik@iki.fi> wrote:
>
>> I've seen the exact same bug/problem with Xen in RHEL5/CentOS (5.0, 5.1, 5.2).
>> I believe it's also in 5.3.
>> 
>> I reported the problem to xen-devel, but I couldn't provide the needed
>> strace/backtrace to figure out the reason _why_ that happens.. (I had
>> already restarted xenconsoled..)
>> 
>> I think developers would need more information to figure out what the
>> actual bug is. 
>
> Yes, I think any kind of xenconsoled hang can eventually result in guests
> spinning waiting for their console buffers to be emptied. It might be
> interesting to build xenconsoled with debug symbols (-g compile option) and
> attach gdb when it gets in this state. Without that kind of info it'll be
> hard to track down.

I haven't had the opportunity to run xenconsoled with debugging
enabled yet, but the disaster stroke again while I was on holiday.  My
co-workers restarted some stuck domains, but left a couple around.
Attaching strace to xenconsoled showed a pretty large timeout on select:

select(43, [6 8 9 11 12 14 15 18 20 21 24 26 27 29 30 32 33 35 36 38 39 41 42], [9 12 21 24], NULL, {4144869, 572000} <unfinished ...>

which may or may not be a clue.  The lsof output seemed reasonable:

COMMAND    PID USER   FD   TYPE     DEVICE    SIZE       NODE NAME
xenconsol 4566 root  cwd    DIR      253,4    4096        128 /
xenconsol 4566 root  rtd    DIR      253,4    4096        128 /
xenconsol 4566 root  txt    REG      253,2   21296     577488 /usr/lib/xen-3.2-1/bin/xenconsoled
xenconsol 4566 root  mem    REG        0,3         2147483647 /proc/xen/privcmd (path inode=4026533301)
xenconsol 4566 root  mem    REG      253,4  116414    3175190 /lib/i686/cmov/libpthread-2.7.so
xenconsol 4566 root  mem    REG      253,4 1413540    3170117 /lib/i686/cmov/libc-2.7.so
xenconsol 4566 root  mem    REG      253,2   15300    2621918 /usr/lib/libxenstore.so.3.0.0
xenconsol 4566 root  mem    REG      253,2   71684    3217152 /usr/lib/xen-3.2-1/lib/libxenctrl.so
xenconsol 4566 root  mem    REG      253,4    9684    3175197 /lib/i686/cmov/libutil-2.7.so
xenconsol 4566 root  mem    REG      253,4  113248    1050535 /lib/ld-2.7.so
xenconsol 4566 root    0u   CHR        1,3                936 /dev/null
xenconsol 4566 root    1u   CHR        1,3                936 /dev/null
xenconsol 4566 root    2u   CHR        1,3                936 /dev/null
xenconsol 4566 root    3uW  REG      253,3       5    1573306 /var/run/xenconsoled.pid
xenconsol 4566 root    4u  unix 0xcfb47180              10030 socket
xenconsol 4566 root    5u   REG        0,3       0 4026533301 /proc/xen/privcmd
xenconsol 4566 root    6r  FIFO        0,6              10032 pipe
xenconsol 4566 root    7w  FIFO        0,6              10032 pipe
xenconsol 4566 root    8u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root    9u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   10u   CHR      136,1                  3 /dev/pts/1
xenconsol 4566 root   11u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   12u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   13u   CHR      136,2                  4 /dev/pts/2
xenconsol 4566 root   14u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   15u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   16u   CHR      136,3                  5 /dev/pts/3
xenconsol 4566 root   17u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   18u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   19u   CHR      136,4                  6 /dev/pts/4
xenconsol 4566 root   20u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   21u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   22u   CHR      136,5                  7 /dev/pts/5
xenconsol 4566 root   23u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   24u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   25u   CHR      136,6                  8 /dev/pts/6
xenconsol 4566 root   26u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   27u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   28u   CHR      136,7                  9 /dev/pts/7
xenconsol 4566 root   29u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   30u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   31u   CHR      136,8                 10 /dev/pts/8
xenconsol 4566 root   32u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   33u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   34u   CHR      136,9                 11 /dev/pts/9
xenconsol 4566 root   35u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   36u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   37u   CHR     136,10                 12 /dev/pts/10
xenconsol 4566 root   38u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   39u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   40u   CHR     136,11                 13 /dev/pts/11
xenconsol 4566 root   41u   CHR      10,63               1491 /dev/xen/evtchn
xenconsol 4566 root   42u   CHR        5,2               1538 /dev/ptmx
xenconsol 4566 root   43u   CHR     136,12                 14 /dev/pts/12

After restarting xenconsoled, the stuck domain said:

[1052088.070488] BUG: soft lockup - CPU#0 stuck for 136469s! [nscd:1796]

pretty much as expected.  I still plan to investigate this, but
sending now just in case it rings a bell somewhere...
-- 
Regards,
Feri.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-07-17 21:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-29 18:26 Known console(d) bug? Ferenc Wagner
2009-05-29 21:53 ` Pasi Kärkkäinen
2009-05-29 23:04   ` Keir Fraser
2009-07-17 21:35     ` Ferenc Wagner
2009-05-29 23:06   ` Ferenc Wagner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.