All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
@ 2015-01-08 13:36 Gerhard Wiesinger
  2015-01-08 17:24 ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-08 13:36 UTC (permalink / raw)
  To: qemu-devel

Hello,

After upgrading my KVM environment from Fedora 20 to Fedora 21 up2date 
(hosts and guests, Intel CPU) I've the following problem:
1.) On the database VM PostgresSQL e.g. 2 processes hang with 100% cpu
2.) On the monitoring VM Munin/RRDtool also hangs with 100% cpu
Killing of processes is not possible, only reboot helps.

But I nailed it down to the following:
yum install strace
strace -y -p339
Process 339 attached
# no system calls here
^CProcess 339 detached

Going deeper:
yum --enablerepo fedora-debuginfo,updates-debuginfo install gdb 
postgresql-debuginfo
gdb postgres 339
(gdb) bt
#0  0x00007fffffbf8ff8 in gettimeofday ()
#1  0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
(gdb) frame 1
#1  0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
1274            gettimeofday(&tp, NULL);
(same on second process here)

Quitting and reattaching gdb also hangs here, so gettimeofday takes 100% 
CPU and never ends!

Therefore I guess this is a problem either in the Linux kernel or in 
QEMU/KVM.
It might be the case that something changed in timer handling or some 
default changed here. VMs are time syncrhonized with NTP so this might 
also be a problem that the syscall hangs here.

Workaround when it happens: reboot ....

Any further ideas?

Thank you.

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-08 13:36 [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever Gerhard Wiesinger
@ 2015-01-08 17:24 ` Paolo Bonzini
  2015-01-08 18:12   ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-08 17:24 UTC (permalink / raw)
  To: Gerhard Wiesinger, qemu-devel



On 08/01/2015 14:36, Gerhard Wiesinger wrote:
> Quitting and reattaching gdb also hangs here, so gettimeofday takes 100%
> CPU and never ends!
> 
> Therefore I guess this is a problem either in the Linux kernel or in
> QEMU/KVM.

What kernel are you running on (and were you running on)?

Can you try F20 host and F21 guest or vice versa?

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-08 17:24 ` Paolo Bonzini
@ 2015-01-08 18:12   ` Gerhard Wiesinger
  2015-01-08 18:22     ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-08 18:12 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

On 08.01.2015 18:24, Paolo Bonzini wrote:
>
> On 08/01/2015 14:36, Gerhard Wiesinger wrote:
>> Quitting and reattaching gdb also hangs here, so gettimeofday takes 100%
>> CPU and never ends!
>>
>> Therefore I guess this is a problem either in the Linux kernel or in
>> QEMU/KVM.
> What kernel are you running on (and were you running on)?
>
> Can you try F20 host and F21 guest or vice versa?

Hello Paolo,

Always latest available stable versions (kernel/qemu-kvm):
F20: 3.17.6-200.fc20.x86_64 on guest/host, qemu-kvm-1.6.2-10.fc20.x86_64 
on host
F21: 3.17.7-300.fc21.x86_64 on guest/host, qemu-kvm-2.1.2-7.fc21.x86_64 
on host
(I had also 3.17.6-300.fc21.x86_64 and it happended there, too).

The topic is: it happens after some time (e.g. hours to days)
It is production environment running around 10VMs and I want to avoid 
many experiments there. Since kernels were the same on FC20/F21 and 
qemu/kvm changed from 1.6.2 to 2.1.2 I guess the topic seems to be 
there. Also newer gcc might be a topic.

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/time/timekeeping.c?id=refs/tags/v3.17.7#n695
I guess it hangs in the do/while loop here:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/time/timekeeping.c?id=refs/tags/v3.17.7#n493

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-08 18:12   ` Gerhard Wiesinger
@ 2015-01-08 18:22     ` Paolo Bonzini
  2015-01-08 22:28       ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-08 18:22 UTC (permalink / raw)
  To: Gerhard Wiesinger, qemu-devel



On 08/01/2015 19:12, Gerhard Wiesinger wrote:
> Since kernels were the same on FC20/F21 and qemu/kvm changed from
> 1.6.2 to 2.1.2 I guess the topic seems to be there. Also newer gcc
> might be a topic.

Indeed.  Can you try the 2.2.0 qemu-kvm release, available in the
fedora-virt-preview repository?

http://fedoraproject.org/wiki/Virtualization_Preview_Repository

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-08 18:22     ` Paolo Bonzini
@ 2015-01-08 22:28       ` Gerhard Wiesinger
  2015-01-08 22:42         ` Paolo Bonzini
  2015-01-12 11:41         ` Gerhard Wiesinger
  0 siblings, 2 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-08 22:28 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

On 08.01.2015 19:22, Paolo Bonzini wrote:
>
> On 08/01/2015 19:12, Gerhard Wiesinger wrote:
>> Since kernels were the same on FC20/F21 and qemu/kvm changed from
>> 1.6.2 to 2.1.2 I guess the topic seems to be there. Also newer gcc
>> might be a topic.
> Indeed.  Can you try the 2.2.0 qemu-kvm release, available in the
> fedora-virt-preview repository?
>
> http://fedoraproject.org/wiki/Virtualization_Preview_Repository

Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.

I'll keep you up to date in the next days whether it happens again or not.

BTW: Has something changed in the time code area between 1.6.2 and 2.1.2?

Thank you so far.

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-08 22:28       ` Gerhard Wiesinger
@ 2015-01-08 22:42         ` Paolo Bonzini
  2015-01-12 11:41         ` Gerhard Wiesinger
  1 sibling, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-08 22:42 UTC (permalink / raw)
  To: Gerhard Wiesinger, qemu-devel



On 08/01/2015 23:28, Gerhard Wiesinger wrote:
>>
> 
> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
> 
> I'll keep you up to date in the next days whether it happens again or not.
> 
> BTW: Has something changed in the time code area between 1.6.2 and 2.1.2?

Most of this is actually in the kernel, but sometimes even
reading/writing the status out of the kernel can trigger bugs.

2.2 has more robust live migration of the kvm pv clock, but fixing that
also uncovered a few other bugs and it's possible they were latent in
2.1.  Crossing fingers.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-08 22:28       ` Gerhard Wiesinger
  2015-01-08 22:42         ` Paolo Bonzini
@ 2015-01-12 11:41         ` Gerhard Wiesinger
  2015-01-12 11:46           ` Paolo Bonzini
  2015-01-13 20:13           ` Gerhard Wiesinger
  1 sibling, 2 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-12 11:41 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

On 08.01.2015 23:28, Gerhard Wiesinger wrote:
> On 08.01.2015 19:22, Paolo Bonzini wrote:
>>
>> Indeed.  Can you try the 2.2.0 qemu-kvm release, available in the
>> fedora-virt-preview repository?
>>
>> http://fedoraproject.org/wiki/Virtualization_Preview_Repository
>
> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
>
> I'll keep you up to date in the next days whether it happens again or 
> not.

With qemu-kvm 2.2.0 release from the above repository the 100% usage 
didn't happen so far (although I had to reboot after kernel update).

So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling. 
Any backporting planned?

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-12 11:41         ` Gerhard Wiesinger
@ 2015-01-12 11:46           ` Paolo Bonzini
  2015-01-12 11:53             ` Gerhard Wiesinger
  2015-01-13 20:13           ` Gerhard Wiesinger
  1 sibling, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-12 11:46 UTC (permalink / raw)
  To: Gerhard Wiesinger, qemu-devel



On 12/01/2015 12:41, Gerhard Wiesinger wrote:
>>
>> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
>>
>> I'll keep you up to date in the next days whether it happens again or
>> not.
> 
> With qemu-kvm 2.2.0 release from the above repository the 100% usage
> didn't happen so far (although I had to reboot after kernel update).
> 
> So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling.
> Any backporting planned?

That's difficult without bisection pointing out where the bugs were
fixed.  2.1.3 is scheduled real soon now and it will be the last release
from the 2.1.x branch.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-12 11:46           ` Paolo Bonzini
@ 2015-01-12 11:53             ` Gerhard Wiesinger
  2015-01-12 11:54               ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-12 11:53 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

On 12.01.2015 12:46, Paolo Bonzini wrote:
>
> On 12/01/2015 12:41, Gerhard Wiesinger wrote:
>>> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
>>>
>>> I'll keep you up to date in the next days whether it happens again or
>>> not.
>> With qemu-kvm 2.2.0 release from the above repository the 100% usage
>> didn't happen so far (although I had to reboot after kernel update).
>>
>> So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling.
>> Any backporting planned?
> That's difficult without bisection pointing out where the bugs were
> fixed.  2.1.3 is scheduled real soon now and it will be the last release
> from the 2.1.x branch.

So no security updates planned for 2.1.x afterwards?

For Fedora: Update for 2.2.x for the regulary update repo planned?
Will updated for 2.2.x also be released in the preview repos?
http://fedoraproject.org/wiki/Virtualization_Preview_Repository

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-12 11:53             ` Gerhard Wiesinger
@ 2015-01-12 11:54               ` Paolo Bonzini
  0 siblings, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-12 11:54 UTC (permalink / raw)
  To: Gerhard Wiesinger, qemu-devel



On 12/01/2015 12:53, Gerhard Wiesinger wrote:
>>>
>> That's difficult without bisection pointing out where the bugs were
>> fixed.  2.1.3 is scheduled real soon now and it will be the last release
>> from the 2.1.x branch.
> 
> So no security updates planned for 2.1.x afterwards?

Distros usually take care of them, except for really egregious
vulnerabilities.

Paolo

> For Fedora: Update for 2.2.x for the regulary update repo planned?
> Will updated for 2.2.x also be released in the preview repos?
> http://fedoraproject.org/wiki/Virtualization_Preview_Repository

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-12 11:41         ` Gerhard Wiesinger
  2015-01-12 11:46           ` Paolo Bonzini
@ 2015-01-13 20:13           ` Gerhard Wiesinger
  2015-01-13 20:48             ` Paolo Bonzini
  2015-01-14 17:47             ` Juan Quintela
  1 sibling, 2 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-13 20:13 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

On 12.01.2015 12:41, Gerhard Wiesinger wrote:
> On 08.01.2015 23:28, Gerhard Wiesinger wrote:
>>
>> I'll keep you up to date in the next days whether it happens again or 
>> not.
>
> With qemu-kvm 2.2.0 release from the above repository the 100% usage 
> didn't happen so far (although I had to reboot after kernel update).

It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL 
is running:
(gdb) bt
#0  0x00007fff9a1feff4 in gettimeofday ()
#1  0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274

What we know:
OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, 
qemu-kvm-1.6.2-10.fc20.x86_64 on host
NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, 
qemu-kvm-2.1.2-7.fc21.x86_64 on host
NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, 
qemu-kvm-2.2.0-1.fc21.x86_64 on host

No one less can reproduce or has similar problems?
Any further ideas?

BTW: I'm running ntp in the following manner: internet <=> ntp server in 
VM <=> ntp client on KVM host (firewall runs in KVM)

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-13 20:13           ` Gerhard Wiesinger
@ 2015-01-13 20:48             ` Paolo Bonzini
  2015-01-13 21:08               ` Cole Robinson
  2015-01-13 21:14               ` Gerhard Wiesinger
  2015-01-14 17:47             ` Juan Quintela
  1 sibling, 2 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-13 20:48 UTC (permalink / raw)
  To: Gerhard Wiesinger, qemu-devel, Cole Robinson, virt



On 13/01/2015 21:13, Gerhard Wiesinger wrote:
> 
> It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL
> is running:
> (gdb) bt
> #0  0x00007fff9a1feff4 in gettimeofday ()
> #1  0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
> 
> What we know:
> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
> qemu-kvm-1.6.2-10.fc20.x86_64 on host
> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
> qemu-kvm-2.1.2-7.fc21.x86_64 on host
> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
> qemu-kvm-2.2.0-1.fc21.x86_64 on host
> 
> No one less can reproduce or has similar problems?
> Any further ideas?

Hmm, too bad. :(

Any chance you can try with 1.7 and 2.0 releases?  Cole, perhaps you
help building some RPMS for Gerhard?

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-13 20:48             ` Paolo Bonzini
@ 2015-01-13 21:08               ` Cole Robinson
  2015-01-13 21:14               ` Gerhard Wiesinger
  1 sibling, 0 replies; 35+ messages in thread
From: Cole Robinson @ 2015-01-13 21:08 UTC (permalink / raw)
  To: Paolo Bonzini, Gerhard Wiesinger, qemu-devel, virt

On 01/13/2015 03:48 PM, Paolo Bonzini wrote:
> 
> 
> On 13/01/2015 21:13, Gerhard Wiesinger wrote:
>>
>> It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL
>> is running:
>> (gdb) bt
>> #0  0x00007fff9a1feff4 in gettimeofday ()
>> #1  0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>>
>> What we know:
>> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
>> qemu-kvm-1.6.2-10.fc20.x86_64 on host
>> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.1.2-7.fc21.x86_64 on host
>> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>>
>> No one less can reproduce or has similar problems?
>> Any further ideas?
> 
> Hmm, too bad. :(
> 
> Any chance you can try with 1.7 and 2.0 releases?  Cole, perhaps you
> help building some RPMS for Gerhard?
> 
> Paolo
> 

There's F21 qemu-2.0 builds here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=538383
And 1.7 builds here: http://koji.fedoraproject.org/koji/buildinfo?buildID=499026

There might be dependency issues trying to install on latest f21, if they were
built against older libraries. Worth a shot though. Besides that something
custom will need to be built, though at that point it's probably easier to
build qemu from source

- Cole

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-13 20:48             ` Paolo Bonzini
  2015-01-13 21:08               ` Cole Robinson
@ 2015-01-13 21:14               ` Gerhard Wiesinger
  2015-01-13 21:16                 ` Paolo Bonzini
  1 sibling, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-13 21:14 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, Cole Robinson, virt

On 13.01.2015 21:48, Paolo Bonzini wrote:
>
> On 13/01/2015 21:13, Gerhard Wiesinger wrote:
>> It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL
>> is running:
>> (gdb) bt
>> #0  0x00007fff9a1feff4 in gettimeofday ()
>> #1  0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>>
>> What we know:
>> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
>> qemu-kvm-1.6.2-10.fc20.x86_64 on host
>> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.1.2-7.fc21.x86_64 on host
>> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>>
>> No one less can reproduce or has similar problems?
>> Any further ideas?
> Hmm, too bad. :(
>
> Any chance you can try with 1.7 and 2.0 releases?  Cole, perhaps you
> help building some RPMS for Gerhard?
>

Yes, I can try. RPMs would be fine.

Don't know if it is related, but on another VM maschine I was getting 
the following 75s after reboot:
[   78.857006] INFO: rcu_sched self-detected stall on CPU { 0} (t=60000 
jiffies g=1966 c=1965 q=0)
[  258.860006] INFO: rcu_sched self-detected stall on CPU { 0} (t=240003 
jiffies g=1966 c=1965 q=0)

I also had a look at the kernel code again:
http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
499         do {
500                 seq = read_seqcount_begin(&tk_core.seq);
501
502                 ts->tv_sec = tk->xtime_sec;
503                 nsecs = timekeeping_get_ns(&tk->tkr);
504
505         } while (read_seqcount_retry(&tk_core.seq, seq));

So it looks like that the seqcount always changes and therefore loops 
forever here (as far as I digged it down this is the only loop here).

Might be something wrong with the memory barriers in recent qemu-kvm 
releases?

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-13 21:14               ` Gerhard Wiesinger
@ 2015-01-13 21:16                 ` Paolo Bonzini
  2015-01-13 22:01                   ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-13 21:16 UTC (permalink / raw)
  To: Gerhard Wiesinger, qemu-devel, Cole Robinson, virt



On 13/01/2015 22:14, Gerhard Wiesinger wrote:
> 
> I also had a look at the kernel code again:
> http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
> 499         do {
> 500                 seq = read_seqcount_begin(&tk_core.seq);
> 501
> 502                 ts->tv_sec = tk->xtime_sec;
> 503                 nsecs = timekeeping_get_ns(&tk->tkr);
> 504
> 505         } while (read_seqcount_retry(&tk_core.seq, seq));
> 
> So it looks like that the seqcount always changes and therefore loops
> forever here (as far as I digged it down this is the only loop here).
> 
> Might be something wrong with the memory barriers in recent qemu-kvm
> releases?

No, that's not possible.  Unless you pause/resume or migrate the VM, all
of the handling of kvmclock is entirely in the kernel.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-13 21:16                 ` Paolo Bonzini
@ 2015-01-13 22:01                   ` Gerhard Wiesinger
  2015-01-14  0:59                     ` Laine Stump
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-13 22:01 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, Cole Robinson, virt

On 13.01.2015 22:16, Paolo Bonzini wrote:
>
> On 13/01/2015 22:14, Gerhard Wiesinger wrote:
>> I also had a look at the kernel code again:
>> http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
>> 499         do {
>> 500                 seq = read_seqcount_begin(&tk_core.seq);
>> 501
>> 502                 ts->tv_sec = tk->xtime_sec;
>> 503                 nsecs = timekeeping_get_ns(&tk->tkr);
>> 504
>> 505         } while (read_seqcount_retry(&tk_core.seq, seq));
>>
>> So it looks like that the seqcount always changes and therefore loops
>> forever here (as far as I digged it down this is the only loop here).
>>
>> Might be something wrong with the memory barriers in recent qemu-kvm
>> releases?
> No, that's not possible.  Unless you pause/resume or migrate the VM, all
> of the handling of kvmclock is entirely in the kernel.

Any other possible explaination of the problem?

Had a look at the diff (I guess the right file at least in qemu tree):
# no critical changes IHMO here
git diff -u v1.6.2..v2.1.2 ./hw/i386/kvm/clock.c

Trying to reproduce with a loop:
#include <sys/time.h>
#include <stdio.h>

int main(int argc, char* argv[])
{
   struct timeval tv;
   int i = 0;
   for (;;)
   {
     gettimeofday(&tv, 0);
     ++i;
     if (i >= 10000000)
     {
       i = 0;
       printf("%i\n", (int)tv.tv_sec);
     }
   }
   return 0;
}

As I wrote this: "First tests seem to run well, so no quick win ....", I 
could reproduce it with a stall in 318s :-)
(gdb) bt
#0  0x00007fff6d9fefff in gettimeofday ()
#1  0x00000000004005ad in main (argc=1, argv=0x7fff6d9b28b8) at 
gettimeofdayloop.c:10

So we have at least a testcase which is quickly to reproduce.

So we are digging down my second findings about a major bug in qemu-kvm  :-)

Can someone try, too?

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-13 22:01                   ` Gerhard Wiesinger
@ 2015-01-14  0:59                     ` Laine Stump
  2015-01-14  9:15                       ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Laine Stump @ 2015-01-14  0:59 UTC (permalink / raw)
  To: Gerhard Wiesinger, Paolo Bonzini, qemu-devel, Cole Robinson, virt

On 01/13/2015 05:01 PM, Gerhard Wiesinger wrote:
> On 13.01.2015 22:16, Paolo Bonzini wrote:
>>
>> On 13/01/2015 22:14, Gerhard Wiesinger wrote:
>>> I also had a look at the kernel code again:
>>> http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
>>>
>>> 499         do {
>>> 500                 seq = read_seqcount_begin(&tk_core.seq);
>>> 501
>>> 502                 ts->tv_sec = tk->xtime_sec;
>>> 503                 nsecs = timekeeping_get_ns(&tk->tkr);
>>> 504
>>> 505         } while (read_seqcount_retry(&tk_core.seq, seq));
>>>
>>> So it looks like that the seqcount always changes and therefore loops
>>> forever here (as far as I digged it down this is the only loop here).
>>>
>>> Might be something wrong with the memory barriers in recent qemu-kvm
>>> releases?
>> No, that's not possible.  Unless you pause/resume or migrate the VM, all
>> of the handling of kvmclock is entirely in the kernel.
>
> Any other possible explaination of the problem?
>
> Had a look at the diff (I guess the right file at least in qemu tree):
> # no critical changes IHMO here
> git diff -u v1.6.2..v2.1.2 ./hw/i386/kvm/clock.c
>
> Trying to reproduce with a loop:
> #include <sys/time.h>
> #include <stdio.h>
>
> int main(int argc, char* argv[])
> {
>   struct timeval tv;
>   int i = 0;
>   for (;;)
>   {
>     gettimeofday(&tv, 0);
>     ++i;
>     if (i >= 10000000)
>     {
>       i = 0;
>       printf("%i\n", (int)tv.tv_sec);
>     }
>   }
>   return 0;
> }
>
> As I wrote this: "First tests seem to run well, so no quick win ....",
> I could reproduce it with a stall in 318s :-)
> (gdb) bt
> #0  0x00007fff6d9fefff in gettimeofday ()
> #1  0x00000000004005ad in main (argc=1, argv=0x7fff6d9b28b8) at
> gettimeofdayloop.c:10
>
> So we have at least a testcase which is quickly to reproduce.
>
> So we are digging down my second findings about a major bug in
> qemu-kvm  :-)
>
> Can someone try, too?
>
> Ciao,
> Gerhard
>
>

Take a look at the following kernel bug. It specifically deals with a
hang in gettimeofday() in a KVM guest:

https://bugzilla.redhat.com/show_bug.cgi?id=1178975

There is a link to a patched kernel you can try; it fixed my problems (I
was repeatedly getting hangs in python-urlgrabber during yum updates on
F21).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-14  0:59                     ` Laine Stump
@ 2015-01-14  9:15                       ` Gerhard Wiesinger
  2015-02-15  8:18                         ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-14  9:15 UTC (permalink / raw)
  To: Laine Stump, Paolo Bonzini, qemu-devel, Cole Robinson, virt

On 14.01.2015 01:59, Laine Stump wrote:
>
> Take a look at the following kernel bug. It specifically deals with a
> hang in gettimeofday() in a KVM guest:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
>
> There is a link to a patched kernel you can try; it fixed my problems (I
> was repeatedly getting hangs in python-urlgrabber during yum updates on
> F21).
>

Looks to be fixed, commented in:
https://bugzilla.redhat.com/show_bug.cgi?id=1178975
Installed kernels.
http://koji.fedoraproject.org/koji/taskinfo?taskID=8575247

Time to release a new official kernel :-)

Thanx for the comments.

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-13 20:13           ` Gerhard Wiesinger
  2015-01-13 20:48             ` Paolo Bonzini
@ 2015-01-14 17:47             ` Juan Quintela
  2015-01-14 17:52               ` Juan Quintela
  1 sibling, 1 reply; 35+ messages in thread
From: Juan Quintela @ 2015-01-14 17:47 UTC (permalink / raw)
  To: Gerhard Wiesinger; +Cc: Paolo Bonzini, qemu-devel

Gerhard Wiesinger <lists@wiesinger.com> wrote:
> On 12.01.2015 12:41, Gerhard Wiesinger wrote:
>> On 08.01.2015 23:28, Gerhard Wiesinger wrote:
>>>
>>> I'll keep you up to date in the next days whether it happens again
>>> or not.
>>
>> With qemu-kvm 2.2.0 release from the above repository the 100% usage
>> didn't happen so far (although I had to reboot after kernel update).
>
> It happens also with qemu-kvm 2.2.0 on another VM where also
> PostgreSQL is running:
> (gdb) bt
> #0  0x00007fff9a1feff4 in gettimeofday ()
> #1  0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>
> What we know:
> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
> qemu-kvm-1.6.2-10.fc20.x86_64 on host
> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
> qemu-kvm-2.1.2-7.fc21.x86_64 on host
> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>
> No one less can reproduce or has similar problems?

I have similar problems, F21 guest on F21 host.
https://bugzilla.redhat.com/show_bug.cgi?id=1174664

For me, it is firefox that from time to time loops on vclock_gettime().
At that point the DSO have got corrupted (kvmclock msr's), and as you
say, only solution is reboot.  Could you look at the bug, and see if the
areas pointed by the MSR also got corrupted?  I tried using kernel-debug
with debugging for memory allocation (marcelo request).  I see pattens
like 0x5a5a5a5a, so clearly there is some corruption, but haven't been
able to find _what_ is making that happen.

> Any further ideas?

Nope :-(

> BTW: I'm running ntp in the following manner: internet <=> ntp server
> in VM <=> ntp client on KVM host (firewall runs in KVM)

I am uisng ntp on the host against Internet, but the guest don't use ntp
(I do a ntpdate if I see that the guest has drifted too much, and it is
a rare event.  Right now I don't rememeber having fixed it lately).

Later, Juan.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-14 17:47             ` Juan Quintela
@ 2015-01-14 17:52               ` Juan Quintela
  2015-01-14 22:19                 ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Juan Quintela @ 2015-01-14 17:52 UTC (permalink / raw)
  To: Gerhard Wiesinger; +Cc: Paolo Bonzini, qemu-devel

Juan Quintela <quintela@redhat.com> wrote:
> Gerhard Wiesinger <lists@wiesinger.com> wrote:
>> On 12.01.2015 12:41, Gerhard Wiesinger wrote:
>>> On 08.01.2015 23:28, Gerhard Wiesinger wrote:
>>>>
>>>> I'll keep you up to date in the next days whether it happens again
>>>> or not.
>>>
>>> With qemu-kvm 2.2.0 release from the above repository the 100% usage
>>> didn't happen so far (although I had to reboot after kernel update).
>>
>> It happens also with qemu-kvm 2.2.0 on another VM where also
>> PostgreSQL is running:
>> (gdb) bt
>> #0  0x00007fff9a1feff4 in gettimeofday ()
>> #1  0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>>
>> What we know:
>> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
>> qemu-kvm-1.6.2-10.fc20.x86_64 on host
>> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.1.2-7.fc21.x86_64 on host
>> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>>
>> No one less can reproduce or has similar problems?
>
> I have similar problems, F21 guest on F21 host.
> https://bugzilla.redhat.com/show_bug.cgi?id=1174664
>
> For me, it is firefox that from time to time loops on vclock_gettime().
> At that point the DSO have got corrupted (kvmclock msr's), and as you
> say, only solution is reboot.  Could you look at the bug, and see if the
> areas pointed by the MSR also got corrupted?  I tried using kernel-debug
> with debugging for memory allocation (marcelo request).  I see pattens
> like 0x5a5a5a5a, so clearly there is some corruption, but haven't been
> able to find _what_ is making that happen.
>
>> Any further ideas?
>
> Nope :-(
>
>> BTW: I'm running ntp in the following manner: internet <=> ntp server
>> in VM <=> ntp client on KVM host (firewall runs in KVM)
>
> I am uisng ntp on the host against Internet, but the guest don't use ntp
> (I do a ntpdate if I see that the guest has drifted too much, and it is
> a rare event.  Right now I don't rememeber having fixed it lately).

I forgot tell on the previous patch, I am using 2vcpus.  with a single
vcpu I have been unable to trigger this bug.

Later, Juan.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-14 17:52               ` Juan Quintela
@ 2015-01-14 22:19                 ` Gerhard Wiesinger
  0 siblings, 0 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-14 22:19 UTC (permalink / raw)
  To: quintela; +Cc: Paolo Bonzini, qemu-devel

On 14.01.2015 18:52, Juan Quintela wrote:
> Juan Quintela <quintela@redhat.com> wrote:
> I forgot tell on the previous patch, I am using 2vcpus. with a single 
> vcpu I have been unable to trigger this bug. 

There is already a fix with a new patched kernel available for the 
guest, see the bugzilla entry and my posts in this thread.

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-01-14  9:15                       ` Gerhard Wiesinger
@ 2015-02-15  8:18                         ` Gerhard Wiesinger
  2015-02-16 14:18                           ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-02-15  8:18 UTC (permalink / raw)
  To: Laine Stump, Paolo Bonzini, qemu-devel, Cole Robinson, virt

On 14.01.2015 10:15, Gerhard Wiesinger wrote:
> On 14.01.2015 01:59, Laine Stump wrote:
>>
>> Take a look at the following kernel bug. It specifically deals with a
>> hang in gettimeofday() in a KVM guest:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
>>
>> There is a link to a patched kernel you can try; it fixed my problems (I
>> was repeatedly getting hangs in python-urlgrabber during yum updates on
>> F21).
>>
>
> Looks to be fixed, commented in:
> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
> Installed kernels.
> http://koji.fedoraproject.org/koji/taskinfo?taskID=8575247
>

There seems to be another problem. My PostgreSQL server VM still crashes 
(2 times).
1.) First time crash: no reaction at all in KVM console, no network access
2.) Second time "crash":
- KVM console 2 characters could be entered, then console was dead forever
- ping works
- ssh not
- host: 100% cpu on 2 cores
strace -y -p on kvm host (loops through this):
read(7<anon_inode:[eventfd]>, 0x7fffb862bf20, 16) = -1 EAGAIN (Resource 
temporarily unavailable)
ioctl(11<anon_inode:kvm-vm>, KVM_GET_DIRTY_LOG, 0x7fffb862bd30) = 0
ioctl(11<anon_inode:kvm-vm>, KVM_GET_DIRTY_LOG, 0x7fffb862bd30) = 0
write(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
ppoll([{fd=98<socket:[4660317]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=89<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=88<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=87<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=86<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=85<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=84<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=83<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=82<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=81<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=80<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=79<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=78<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=77<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=76<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=75<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=74<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=73<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=72<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=71<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=70<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=69<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=68<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=67<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=66<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=65<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=64<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=63<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=62<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=61<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=60<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, 
{fd=59<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, ...], 71, 
{0, 124112228}, NULL, 8) = 1 ([...], left {0, 124090614})
write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
read(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 512) = 8
write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8

Kernel (host/guest):  3.18.6-200.fc21.x86_64 #1 SMP
qemu-kvm-2.2.0-5.fc21.x86_64

Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
https://bugzilla.redhat.com/show_bug.cgi?id=1178975
is fixed (didn't occour with the test program posted at 
https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min, 
happened before reproduceable in 2min, still running)

So I guess there is another problem in the kernel with volatile and gcc 
optimizations (or maybe in qemu-KVM)

Any ideas?
Any other bug reports like this?

Thank you.

Ciao,
Gerhard

-- http://www.wiesinger.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-02-15  8:18                         ` Gerhard Wiesinger
@ 2015-02-16 14:18                           ` Paolo Bonzini
  2015-02-16 15:09                             ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-02-16 14:18 UTC (permalink / raw)
  To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt



On 15/02/2015 09:18, Gerhard Wiesinger wrote:
> 
> ppoll([{fd=98<socket:[4660317]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=89<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=88<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=87<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=86<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=85<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=84<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=83<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=82<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=81<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=80<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=79<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=78<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=77<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=76<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=75<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=74<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=73<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=72<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=71<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=70<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=69<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=68<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=67<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=66<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=65<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=64<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=63<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=62<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=61<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=60<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=59<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, ...], 71,
> {0, 124112228}, NULL, 8) = 1 ([...], left {0, 124090614})
> write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
> read(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 512) = 8
> write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8

Can you grab some random backtraces ("thread apply all bt full") with gdb?

What is the libvirt XML or qemu command line?

Paolo

> Kernel (host/guest):  3.18.6-200.fc21.x86_64 #1 SMP
> qemu-kvm-2.2.0-5.fc21.x86_64
> 
> Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
> is fixed (didn't occour with the test program posted at
> https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
> happened before reproduceable in 2min, still running)
> 
> So I guess there is another problem in the kernel with volatile and gcc
> optimizations (or maybe in qemu-KVM)

No, this doesn't look like volatile.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-02-16 14:18                           ` Paolo Bonzini
@ 2015-02-16 15:09                             ` Gerhard Wiesinger
  2015-02-16 15:29                               ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-02-16 15:09 UTC (permalink / raw)
  To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt

On 16.02.2015 15:18, Paolo Bonzini wrote:
>
> On 15/02/2015 09:18, Gerhard Wiesinger wrote:
>>
>> Can you grab some random backtraces ("thread apply all bt full") with gdb?

Very low load on the machine, so I guess most will be sleeping and hard 
to catch non sleeping ones. See below
For the records: gdb -p 14139 -ex 'thread apply all bt full' -batch
>>
>> What is the libvirt XML or qemu command line?

See below.
>> Kernel (host/guest):  3.18.6-200.fc21.x86_64 #1 SMP
>> qemu-kvm-2.2.0-5.fc21.x86_64
>>
>> Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
>> is fixed (didn't occour with the test program posted at
>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
>> happened before reproduceable in 2min, still running)
>>
>> So I guess there is another problem in the kernel with volatile and gcc
>> optimizations (or maybe in qemu-KVM)
> No, this doesn't look like volatile.

But why 100% on 2 cores?

There were also recent bugfixes with optimizations in gcc-4.9.2-6.fc21 
maybe we hit one of these.
- update from the 4.9 branch
   - PRs c++/54442, c++/64514, c++/64521, c++/64901, c/57653, c/61553, 
c/64766,
     c/64778, c/64824, c/64868, debug/64511, debug/64663, fortran/56867,
     fortran/57023, fortran/60922, fortran/62044, fortran/63733,
     fortran/64230, fortran/64528, fortran/64771, ipa/63970, ipa/64068,
     ipa/64559, libstdc++/64476, libstdc++/64584, libstdc++/64585,
     libstdc++/64646, libstdc++/64649, libstdc++/64680, middle-end/63704,
     middle-end/64391, middle-end/64421, middle-end/64734,
     rtl-optimization/61058, rtl-optimization/63637,
     rtl-optimization/64286, rtl-optimization/64557, target/61413,
     target/63424, target/64358, target/64479, target/64505, target/64513,
     target/64580, target/64795, target/64882, target/64938, target/64979,
     testsuite/64712, tree-optimization/64563

Ciao,
Gerhard

/usr/bin/qemu-system-x86_64 -machine accel=kvm -name myvm -S -machine 
pc-0.15,accel=kvm,usb=off -m 384 -realtime mlock=off -smp 
2,sockets=2,cores=1,threads=1 -uuid b40e77d3-cd86-4d59-9ee4-5756ec88bf99 
-no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/myvm.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc 
-no-shutdown -boot menu=on,strict=on -device 
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
lsi,id=scsi0,bus=pci.0,addr=0x6 -device ahci,id=ahci0,bus=pci.0,addr=0x9 
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive 
if=none,id=drive-ide0-1-0,readonly=on,format=raw -device 
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive 
file=/var/lib/libvirt/images/myvm.img,if=none,id=drive-virtio-disk0,format=qcow2 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
-netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=34 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:aa:bb:cc,bus=pci.0,addr=0x8 
-netdev tap,fd=35,id=hostnet1,vhost=on,vhostfd=36 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:aa:bb:cd,bus=pci.0,addr=0x7 
-chardev pty,id=charserial0 -device 
isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 
-vnc 127.0.0.1:9 -k de -device 
VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on

<domain type='kvm' id='12'>
   <name>myvm</name>
   <uuid>b40e77d3-cd86-4d59-9ee4-5756ec88bf99</uuid>
   <description>Fedora 21</description>
   <memory unit='KiB'>393216</memory>
   <currentMemory unit='KiB'>393216</currentMemory>
   <vcpu placement='static'>2</vcpu>
   <resource>
     <partition>/machine</partition>
   </resource>
   <os>
     <type arch='x86_64' machine='pc-0.15'>hvm</type>
     <bootmenu enable='yes'/>
   </os>
   <features>
     <acpi/>
     <apic/>
     <pae/>
   </features>
   <clock offset='utc'/>
   <on_poweroff>destroy</on_poweroff>
   <on_reboot>restart</on_reboot>
   <on_crash>restart</on_crash>
   <devices>
     <emulator>/usr/bin/qemu-kvm</emulator>
     <disk type='block' device='cdrom'>
       <driver name='qemu' type='raw'/>
       <backingStore/>
       <target dev='hdc' bus='ide'/>
       <readonly/>
       <alias name='ide0-1-0'/>
       <address type='drive' controller='0' bus='1' target='0' unit='0'/>
     </disk>
     <disk type='file' device='disk'>
       <driver name='qemu' type='qcow2'/>
       <source file='/var/lib/libvirt/images/myvm.img'/>
       <backingStore/>
       <target dev='vda' bus='virtio'/>
       <boot order='1'/>
       <alias name='virtio-disk0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x03' 
function='0x0'/>
     </disk>
     <controller type='ide' index='0'>
       <alias name='ide0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x01' 
function='0x1'/>
     </controller>
     <controller type='usb' index='0'>
       <alias name='usb0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x01' 
function='0x2'/>
     </controller>
     <controller type='virtio-serial' index='0'>
       <alias name='virtio-serial0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x04' 
function='0x0'/>
     </controller>
     <controller type='scsi' index='0'>
       <alias name='scsi0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x06' 
function='0x0'/>
     </controller>
     <controller type='sata' index='0'>
       <alias name='sata0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x09' 
function='0x0'/>
     </controller>
     <controller type='pci' index='0' model='pci-root'>
       <alias name='pci.0'/>
     </controller>
     <interface type='network'>
       <mac address='52:54:00:aa:bb:cc'/>
       <source network='DMZ' bridge='virbr2'/>
       <target dev='vnet16'/>
       <model type='virtio'/>
       <alias name='net0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x08' 
function='0x0'/>
     </interface>
     <interface type='bridge'>
       <mac address='52:54:00:aa:bb:cd'/>
       <source bridge='br0'/>
       <target dev='vnet17'/>
       <model type='virtio'/>
       <alias name='net1'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x07' 
function='0x0'/>
     </interface>
     <serial type='pty'>
       <source path='/dev/pts/9'/>
       <target port='0'/>
       <alias name='serial0'/>
     </serial>
     <console type='pty' tty='/dev/pts/9'>
       <source path='/dev/pts/9'/>
       <target type='serial' port='0'/>
       <alias name='serial0'/>
     </console>
     <input type='tablet' bus='usb'>
       <alias name='input0'/>
     </input>
     <input type='mouse' bus='ps2'/>
     <input type='keyboard' bus='ps2'/>
     <graphics type='vnc' port='5909' autoport='yes' listen='127.0.0.1' 
keymap='de'>
       <listen type='address' address='127.0.0.1'/>
     </graphics>
     <video>
       <model type='vga' vram='16384' heads='1'/>
       <alias name='video0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' 
function='0x0'/>
     </video>
     <memballoon model='virtio'>
       <alias name='balloon0'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x05' 
function='0x0'/>
     </memballoon>
   </devices>
</domain>




Thread 11 (Thread 0x7ffe8bdfe700 (LWP 14147)):
#0  0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6
#1  0x00007ffea8c37c35 in kvm_vcpu_ioctl ()
#2  0x00007ffea8c37cec in kvm_cpu_exec ()
#3  0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn ()
#4  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 10 (Thread 0x7ffe8b5fd700 (LWP 14148)):
#0  0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6
#1  0x00007ffea8c37c35 in kvm_vcpu_ioctl ()
#2  0x00007ffea8c37cec in kvm_cpu_exec ()
#3  0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn ()
#4  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 9 (Thread 0x7ffe893ff700 (LWP 14158)):
#0  0x00007ffea771e590 in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x00007ffea8eccd79 in qemu_cond_wait ()
#2  0x00007ffea8e50dd3 in vnc_worker_thread_loop ()
#3  0x00007ffea8e511b8 in vnc_worker_thread ()
#4  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 8 (Thread 0x7ffe4effd700 (LWP 18334)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 7 (Thread 0x7ffe8231f700 (LWP 18335)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 6 (Thread 0x7ffe908c7700 (LWP 22620)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 5 (Thread 0x7ffe2d7fa700 (LWP 12264)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 4 (Thread 0x7ffe83622700 (LWP 17892)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 3 (Thread 0x7ffe4f7fe700 (LWP 17893)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7ffe82b20700 (LWP 17894)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7ffea8aa5a80 (LWP 14139)):
#0  0x00007ffe9dca72c1 in ppoll () at /lib64/libc.so.6
#1  0x00007ffea8e6289c in qemu_poll_ns ()
#2  0x00007ffea8e62034 in main_loop_wait ()
#3  0x00007ffea8bfdcdd in main ()


[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007ffe9dca72c1 in ppoll () from /lib64/libc.so.6

Thread 8 (Thread 0x7ffe8bdfe700 (LWP 14147)):
#0  0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6
#1  0x00007ffea8c37c35 in kvm_vcpu_ioctl ()
#2  0x00007ffea8c37cec in kvm_cpu_exec ()
#3  0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn ()
#4  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 7 (Thread 0x7ffe8b5fd700 (LWP 14148)):
#0  0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6
#1  0x00007ffea8c37c35 in kvm_vcpu_ioctl ()
#2  0x00007ffea8c37cec in kvm_cpu_exec ()
#3  0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn ()
#4  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 6 (Thread 0x7ffe893ff700 (LWP 14158)):
#0  0x00007ffea771e590 in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x00007ffea8eccd79 in qemu_cond_wait ()
#2  0x00007ffea8e50dd3 in vnc_worker_thread_loop ()
#3  0x00007ffea8e511b8 in vnc_worker_thread ()
#4  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 5 (Thread 0x7ffdeeffd700 (LWP 31356)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 4 (Thread 0x7ffdef7fe700 (LWP 32680)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 3 (Thread 0x7ffe0ffff700 (LWP 32681)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7ffe8131d700 (LWP 32682)):
#0  0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1  0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2  0x00007ffea8e5523c in worker_thread ()
#3  0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4  0x00007ffe9dcb279d in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7ffea8aa5a80 (LWP 14139)):
#0  0x00007ffe9dca72c1 in ppoll () at /lib64/libc.so.6
#1  0x00007ffea8e6289c in qemu_poll_ns ()
#2  0x00007ffea8e62034 in main_loop_wait ()
#3  0x00007ffea8bfdcdd in main ()

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-02-16 15:09                             ` Gerhard Wiesinger
@ 2015-02-16 15:29                               ` Paolo Bonzini
  2015-03-01 10:36                                 ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-02-16 15:29 UTC (permalink / raw)
  To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt



On 16/02/2015 16:09, Gerhard Wiesinger wrote:
> On 16.02.2015 15:18, Paolo Bonzini wrote:
>>
>> On 15/02/2015 09:18, Gerhard Wiesinger wrote:
>>>
>>> Can you grab some random backtraces ("thread apply all bt full") with
>>> gdb?
> 
> Very low load on the machine, so I guess most will be sleeping and hard
> to catch non sleeping ones. See below

This likely means that the 100% threads are not the I/O (event loop)
threads, but the VCPU threads.

>>> Kernel (host/guest):  3.18.6-200.fc21.x86_64 #1 SMP
>>> qemu-kvm-2.2.0-5.fc21.x86_64
>>>
>>> Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
>>> is fixed (didn't occour with the test program posted at
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
>>> happened before reproduceable in 2min, still running)
>>>
>>> So I guess there is another problem in the kernel with volatile and gcc
>>> optimizations (or maybe in qemu-KVM)
>> No, this doesn't look like volatile.
> 
> But why 100% on 2 cores?

Because you have 2 VCPUs, I guess.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-02-16 15:29                               ` Paolo Bonzini
@ 2015-03-01 10:36                                 ` Gerhard Wiesinger
  2015-03-02  9:26                                   ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-01 10:36 UTC (permalink / raw)
  To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt

On 16.02.2015 16:29, Paolo Bonzini wrote:
>
> On 16/02/2015 16:09, Gerhard Wiesinger wrote:
>> On 16.02.2015 15:18, Paolo Bonzini wrote:
>>> On 15/02/2015 09:18, Gerhard Wiesinger wrote:
>>>> Can you grab some random backtraces ("thread apply all bt full") with
>>>> gdb?
>> Very low load on the machine, so I guess most will be sleeping and hard
>> to catch non sleeping ones. See below
> This likely means that the 100% threads are not the I/O (event loop)
> threads, but the VCPU threads.

ok, happened again, details below.

Any further ideas from the stack traces?

So far it happened only the PostgreSQL database VM. Kernel is alive 
(ping works well). ssh is not working.
console window: after entering one character at login prompt, then crashed:
[1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or 
sacrifice child
[1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, 
anon-rss: 162712kB, file-rss: 220kB

VM uptime is ~1 day, 2 cores 100%CPU.
VM is very stripped down, nevertheless is should have enough memory (and 
also swap). Looks like it crashed at the nighlty backup (pg_dumpall), 
command is: ssh -x ${REMOTE_USER}@${REMOTE_HOST} "pg_dumpall" | bzip2 -9 
 > ${DEST_SQL}
free
               total        used        free      shared buff/cache   
available
Mem:         372264       28536      214780       20884 128948      303360
Swap:       1081340           0     1081340

Ciao,
Gerhard

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007feee62412c1 in ppoll () from /lib64/libc.so.6

Thread 4 (Thread 0x7feed3fff700 (LWP 4636)):
#0  0x00007feee6242977 in ioctl () at /lib64/libc.so.6
#1  0x00007feef11d1c35 in kvm_vcpu_ioctl ()
#2  0x00007feef11d1cec in kvm_cpu_exec ()
#3  0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
#4  0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5  0x00007feee624c79d in clone () at /lib64/libc.so.6

Thread 3 (Thread 0x7feed37fe700 (LWP 4637)):
#0  0x00007feee6242977 in ioctl () at /lib64/libc.so.6
#1  0x00007feef11d1c35 in kvm_vcpu_ioctl ()
#2  0x00007feef11d1cec in kvm_cpu_exec ()
#3  0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
#4  0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5  0x00007feee624c79d in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7feed1bff700 (LWP 4653)):
#0  0x00007feeefcb8590 in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x00007feef1466d79 in qemu_cond_wait ()
#2  0x00007feef13eadd3 in vnc_worker_thread_loop ()
#3  0x00007feef13eb1b8 in vnc_worker_thread ()
#4  0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5  0x00007feee624c79d in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7feef103fa80 (LWP 4563)):
#0  0x00007feee62412c1 in ppoll () at /lib64/libc.so.6
#1  0x00007feef13fc89c in qemu_poll_ns ()
#2  0x00007feef13fc034 in main_loop_wait ()
#3  0x00007feef1197cdd in main ()

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-01 10:36                                 ` Gerhard Wiesinger
@ 2015-03-02  9:26                                   ` Paolo Bonzini
  2015-03-02 15:52                                     ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-03-02  9:26 UTC (permalink / raw)
  To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt



On 01/03/2015 11:36, Gerhard Wiesinger wrote:
> So far it happened only the PostgreSQL database VM. Kernel is alive
> (ping works well). ssh is not working.
> console window: after entering one character at login prompt, then crashed:
> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
> sacrifice child
> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
> anon-rss: 162712kB, file-rss: 220kB

Can you get a vmcore or at least sysrq-t output?

Paolo

> VM uptime is ~1 day, 2 cores 100%CPU.
> VM is very stripped down, nevertheless is should have enough memory (and
> also swap). Looks like it crashed at the nighlty backup (pg_dumpall),
> command is: ssh -x ${REMOTE_USER}@${REMOTE_HOST} "pg_dumpall" | bzip2 -9
>> ${DEST_SQL}
> free
>               total        used        free      shared buff/cache  
> available
> Mem:         372264       28536      214780       20884 128948      303360
> Swap:       1081340           0     1081340
> 
> Ciao,
> Gerhard
> 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 0x00007feee62412c1 in ppoll () from /lib64/libc.so.6
> 
> Thread 4 (Thread 0x7feed3fff700 (LWP 4636)):
> #0  0x00007feee6242977 in ioctl () at /lib64/libc.so.6
> #1  0x00007feef11d1c35 in kvm_vcpu_ioctl ()
> #2  0x00007feef11d1cec in kvm_cpu_exec ()
> #3  0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
> #4  0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
> #5  0x00007feee624c79d in clone () at /lib64/libc.so.6
> 
> Thread 3 (Thread 0x7feed37fe700 (LWP 4637)):
> #0  0x00007feee6242977 in ioctl () at /lib64/libc.so.6
> #1  0x00007feef11d1c35 in kvm_vcpu_ioctl ()
> #2  0x00007feef11d1cec in kvm_cpu_exec ()
> #3  0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
> #4  0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
> #5  0x00007feee624c79d in clone () at /lib64/libc.so.6
> 
> Thread 2 (Thread 0x7feed1bff700 (LWP 4653)):
> #0  0x00007feeefcb8590 in pthread_cond_wait@@GLIBC_2.3.2 () at
> /lib64/libpthread.so.0
> #1  0x00007feef1466d79 in qemu_cond_wait ()
> #2  0x00007feef13eadd3 in vnc_worker_thread_loop ()
> #3  0x00007feef13eb1b8 in vnc_worker_thread ()
> #4  0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
> #5  0x00007feee624c79d in clone () at /lib64/libc.so.6
> 
> Thread 1 (Thread 0x7feef103fa80 (LWP 4563)):
> #0  0x00007feee62412c1 in ppoll () at /lib64/libc.so.6
> #1  0x00007feef13fc89c in qemu_poll_ns ()
> #2  0x00007feef13fc034 in main_loop_wait ()
> #3  0x00007feef1197cdd in main ()
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-02  9:26                                   ` Paolo Bonzini
@ 2015-03-02 15:52                                     ` Gerhard Wiesinger
  2015-03-02 17:15                                       ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-02 15:52 UTC (permalink / raw)
  To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt

On 02.03.2015 10:26, Paolo Bonzini wrote:
>
> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>> So far it happened only the PostgreSQL database VM. Kernel is alive
>> (ping works well). ssh is not working.
>> console window: after entering one character at login prompt, then crashed:
>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
>> sacrifice child
>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>> anon-rss: 162712kB, file-rss: 220kB
> Can you get a vmcore or at least sysrq-t output?

Yes, next time it happens I can analyze it.

I think there are 2 problems:
1.) OOM (Out of Memory) problem with the low memory settings and kernel 
settings (see below)
2.) Instability problem which might have a dependency to 1.)

What I've done so far (thanks to Andrey Korolyov for ideas and help):
a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
virsh dumpxml database | grep "<type"
     <type arch='x86_64' machine='pc-0.15'>hvm</type>

virsh edit database
virsh dumpxml database | grep "<type"
     <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>

SMBIOS is updated therefore from 2.4 to 2.8:
dmesg|grep -i SMBIOS
[    0.000000] SMBIOS 2.8 present.
b.) Switched to tsc clock, kernel parameters: clocksource=tsc nohz=off 
highres=off
c.) Changed overcommit to 1
echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
d.) Tried 1 VCPU instead of 2
e.) Installed 512MB vRAM instead of 384MB
f.) Prepared for sysrq and vmcore
echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
sysctl -w kernel.sysrq=1
virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
virsh dump domain-name /tmp/dumpfile
g.) Further ideas, not yet done: disable memory balooning by 
blacklisting baloon driver or remove from virsh xml config

Summary:
1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM problem, 
no crash
2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, 
no crash

So the OOM problem seems to be solved (at least it didn't happen so far) 
by installing 512MB RAM and setting vm.overcommit_memory = 1 (I guess 
just setting overcommit would be fine, too).

Instability didn't occour so far. If I can't reproduce it, I'll revert 
the settings.

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-02 15:52                                     ` Gerhard Wiesinger
@ 2015-03-02 17:15                                       ` Gerhard Wiesinger
  2015-03-03  9:12                                         ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-02 17:15 UTC (permalink / raw)
  To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt

On 02.03.2015 16:52, Gerhard Wiesinger wrote:
> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>
>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>> (ping works well). ssh is not working.
>>> console window: after entering one character at login prompt, then 
>>> crashed:
>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
>>> sacrifice child
>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>> anon-rss: 162712kB, file-rss: 220kB
>> Can you get a vmcore or at least sysrq-t output?
>
> Yes, next time it happens I can analyze it.
>
> I think there are 2 problems:
> 1.) OOM (Out of Memory) problem with the low memory settings and 
> kernel settings (see below)
> 2.) Instability problem which might have a dependency to 1.)
>
> What I've done so far (thanks to Andrey Korolyov for ideas and help):
> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
> virsh dumpxml database | grep "<type"
>     <type arch='x86_64' machine='pc-0.15'>hvm</type>
>
> virsh edit database
> virsh dumpxml database | grep "<type"
>     <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>
> SMBIOS is updated therefore from 2.4 to 2.8:
> dmesg|grep -i SMBIOS
> [    0.000000] SMBIOS 2.8 present.
> b.) Switched to tsc clock, kernel parameters: clocksource=tsc nohz=off 
> highres=off
> c.) Changed overcommit to 1
> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
> d.) Tried 1 VCPU instead of 2
> e.) Installed 512MB vRAM instead of 384MB
> f.) Prepared for sysrq and vmcore
> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
> sysctl -w kernel.sysrq=1
> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
> virsh dump domain-name /tmp/dumpfile
> g.) Further ideas, not yet done: disable memory balooning by 
> blacklisting baloon driver or remove from virsh xml config
>
> Summary:
> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM problem, 
> no crash
> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, 
> no crash

3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, 
no crash

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-02 17:15                                       ` Gerhard Wiesinger
@ 2015-03-03  9:12                                         ` Gerhard Wiesinger
  2015-03-03 10:12                                           ` Paolo Bonzini
                                                             ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-03  9:12 UTC (permalink / raw)
  To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt

On 02.03.2015 18:15, Gerhard Wiesinger wrote:
> On 02.03.2015 16:52, Gerhard Wiesinger wrote:
>> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>>
>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>>> (ping works well). ssh is not working.
>>>> console window: after entering one character at login prompt, then 
>>>> crashed:
>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
>>>> sacrifice child
>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>>> anon-rss: 162712kB, file-rss: 220kB
>>> Can you get a vmcore or at least sysrq-t output?
>>
>> Yes, next time it happens I can analyze it.
>>
>> I think there are 2 problems:
>> 1.) OOM (Out of Memory) problem with the low memory settings and 
>> kernel settings (see below)
>> 2.) Instability problem which might have a dependency to 1.)
>>
>> What I've done so far (thanks to Andrey Korolyov for ideas and help):
>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
>> virsh dumpxml database | grep "<type"
>>     <type arch='x86_64' machine='pc-0.15'>hvm</type>
>>
>> virsh edit database
>> virsh dumpxml database | grep "<type"
>>     <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>>
>> SMBIOS is updated therefore from 2.4 to 2.8:
>> dmesg|grep -i SMBIOS
>> [    0.000000] SMBIOS 2.8 present.
>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc 
>> nohz=off highres=off
>> c.) Changed overcommit to 1
>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
>> d.) Tried 1 VCPU instead of 2
>> e.) Installed 512MB vRAM instead of 384MB
>> f.) Prepared for sysrq and vmcore
>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
>> sysctl -w kernel.sysrq=1
>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
>> virsh dump domain-name /tmp/dumpfile
>> g.) Further ideas, not yet done: disable memory balooning by 
>> blacklisting baloon driver or remove from virsh xml config
>>
>> Summary:
>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM 
>> problem, no crash
>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM 
>> problem, no crash
>
> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, 
> no crash

3b.) Still happened again at the nightly backup with same configuration 
as in 3.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 
1, pc-i440fx-2.2: no OOM problem, ping ok, no reaction, BUT CRASHED again

SYSRQ: no reaction of the VM
virsh send-key vm KEY_LEFTALT KEY_SYSRQ KEY_T

virsh dump vm file.core
error: Failed to core dump domain vm to file.core
error: internal error: unable to execute QEMU command 'migrate': State 
blocked by non-migratable device '0000:00:09.0/ich9_ahci'

Removed the SATA controller, dump should work for the future.

Any futher ideas?

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-03  9:12                                         ` Gerhard Wiesinger
@ 2015-03-03 10:12                                           ` Paolo Bonzini
  2015-03-03 11:51                                           ` Gonglei
  2015-03-03 12:28                                           ` Gerhard Wiesinger
  2 siblings, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-03-03 10:12 UTC (permalink / raw)
  To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt



On 03/03/2015 10:12, Gerhard Wiesinger wrote:
> 
> virsh dump vm file.core
> error: Failed to core dump domain vm to file.core
> error: internal error: unable to execute QEMU command 'migrate': State
> blocked by non-migratable device '0000:00:09.0/ich9_ahci'
> 
> Removed the SATA controller, dump should work for the future.

Thanks.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-03  9:12                                         ` Gerhard Wiesinger
  2015-03-03 10:12                                           ` Paolo Bonzini
@ 2015-03-03 11:51                                           ` Gonglei
  2015-03-03 12:28                                           ` Gerhard Wiesinger
  2 siblings, 0 replies; 35+ messages in thread
From: Gonglei @ 2015-03-03 11:51 UTC (permalink / raw)
  To: Gerhard Wiesinger, Paolo Bonzini, Laine Stump, qemu-devel,
	Cole Robinson, virt

On 2015/3/3 17:12, Gerhard Wiesinger wrote:
> 3b.) Still happened again at the nightly backup with same configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no reaction, BUT CRASHED again
> 
> SYSRQ: no reaction of the VM
> virsh send-key vm KEY_LEFTALT KEY_SYSRQ KEY_T
> 
> virsh dump vm file.core
> error: Failed to core dump domain vm to file.core
> error: internal error: unable to execute QEMU command 'migrate': State blocked by non-migratable device '0000:00:09.0/ich9_ahci'
> 
You can use '--memory-only' argument to dump vm's memory only. eg:

virsh dump vm file.core --memory-only

> Removed the SATA controller, dump should work for the future.
> 
> Any futher ideas?

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-03  9:12                                         ` Gerhard Wiesinger
  2015-03-03 10:12                                           ` Paolo Bonzini
  2015-03-03 11:51                                           ` Gonglei
@ 2015-03-03 12:28                                           ` Gerhard Wiesinger
  2015-03-03 13:18                                             ` Gerhard Wiesinger
  2 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-03 12:28 UTC (permalink / raw)
  To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt

On 03.03.2015 10:12, Gerhard Wiesinger wrote:
> On 02.03.2015 18:15, Gerhard Wiesinger wrote:
>> On 02.03.2015 16:52, Gerhard Wiesinger wrote:
>>> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>>>
>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>>>> (ping works well). ssh is not working.
>>>>> console window: after entering one character at login prompt, then 
>>>>> crashed:
>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 
>>>>> 112 or
>>>>> sacrifice child
>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>>>> anon-rss: 162712kB, file-rss: 220kB
>>>> Can you get a vmcore or at least sysrq-t output?
>>>
>>> Yes, next time it happens I can analyze it.
>>>
>>> I think there are 2 problems:
>>> 1.) OOM (Out of Memory) problem with the low memory settings and 
>>> kernel settings (see below)
>>> 2.) Instability problem which might have a dependency to 1.)
>>>
>>> What I've done so far (thanks to Andrey Korolyov for ideas and help):
>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
>>> virsh dumpxml database | grep "<type"
>>>     <type arch='x86_64' machine='pc-0.15'>hvm</type>
>>>
>>> virsh edit database
>>> virsh dumpxml database | grep "<type"
>>>     <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>>>
>>> SMBIOS is updated therefore from 2.4 to 2.8:
>>> dmesg|grep -i SMBIOS
>>> [    0.000000] SMBIOS 2.8 present.
>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc 
>>> nohz=off highres=off
>>> c.) Changed overcommit to 1
>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
>>> d.) Tried 1 VCPU instead of 2
>>> e.) Installed 512MB vRAM instead of 384MB
>>> f.) Prepared for sysrq and vmcore
>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
>>> sysctl -w kernel.sysrq=1
>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
>>> virsh dump domain-name /tmp/dumpfile
>>> g.) Further ideas, not yet done: disable memory balooning by 
>>> blacklisting baloon driver or remove from virsh xml config
>>>
>>> Summary:
>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM 
>>> problem, no crash
>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM 
>>> problem, no crash
>>
>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM 
>> problem, no crash
>
> 3b.) Still happened again at the nightly backup with same 
> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, 
> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no 
> reaction, BUT CRASHED again
>

3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, 
pc-i440fx-2.2: OOM problem, no crash

postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Free swap  = 905924kB
Total swap = 1081340kB
Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB, 
file-rss:0kB

An OOM should not occour:
https://www.kernel.org/doc/gorman/html/understand/understand016.html
Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM

Why does an OOM condition occour? Looks like a bug in the kernel?
Any ideas?

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-03 12:28                                           ` Gerhard Wiesinger
@ 2015-03-03 13:18                                             ` Gerhard Wiesinger
  2015-03-03 20:50                                               ` Gerhard Wiesinger
  0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-03 13:18 UTC (permalink / raw)
  To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt

On 03.03.2015 13:28, Gerhard Wiesinger wrote:
> On 03.03.2015 10:12, Gerhard Wiesinger wrote:
>> On 02.03.2015 18:15, Gerhard Wiesinger wrote:
>>> On 02.03.2015 16:52, Gerhard Wiesinger wrote:
>>>> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>>>>
>>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>>>>> (ping works well). ssh is not working.
>>>>>> console window: after entering one character at login prompt, 
>>>>>> then crashed:
>>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 
>>>>>> 112 or
>>>>>> sacrifice child
>>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>>>>> anon-rss: 162712kB, file-rss: 220kB
>>>>> Can you get a vmcore or at least sysrq-t output?
>>>>
>>>> Yes, next time it happens I can analyze it.
>>>>
>>>> I think there are 2 problems:
>>>> 1.) OOM (Out of Memory) problem with the low memory settings and 
>>>> kernel settings (see below)
>>>> 2.) Instability problem which might have a dependency to 1.)
>>>>
>>>> What I've done so far (thanks to Andrey Korolyov for ideas and help):
>>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
>>>> virsh dumpxml database | grep "<type"
>>>>     <type arch='x86_64' machine='pc-0.15'>hvm</type>
>>>>
>>>> virsh edit database
>>>> virsh dumpxml database | grep "<type"
>>>>     <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>>>>
>>>> SMBIOS is updated therefore from 2.4 to 2.8:
>>>> dmesg|grep -i SMBIOS
>>>> [    0.000000] SMBIOS 2.8 present.
>>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc 
>>>> nohz=off highres=off
>>>> c.) Changed overcommit to 1
>>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
>>>> d.) Tried 1 VCPU instead of 2
>>>> e.) Installed 512MB vRAM instead of 384MB
>>>> f.) Prepared for sysrq and vmcore
>>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
>>>> sysctl -w kernel.sysrq=1
>>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
>>>> virsh dump domain-name /tmp/dumpfile
>>>> g.) Further ideas, not yet done: disable memory balooning by 
>>>> blacklisting baloon driver or remove from virsh xml config
>>>>
>>>> Summary:
>>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM 
>>>> problem, no crash
>>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM 
>>>> problem, no crash
>>>
>>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM 
>>> problem, no crash
>>
>> 3b.) Still happened again at the nightly backup with same 
>> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, 
>> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no 
>> reaction, BUT CRASHED again
>>
>
> 3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, 
> pc-i440fx-2.2: OOM problem, no crash
>
> postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
> Free swap  = 905924kB
> Total swap = 1081340kB
> Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
> Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB, 
> file-rss:0kB
>
> An OOM should not occour:
> https://www.kernel.org/doc/gorman/html/understand/understand016.html
> Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
>
> Why does an OOM condition occour? Looks like a bug in the kernel?
> Any ideas?

# Allocating 800MB, killed by OOM killer
./mallocsleep 805306368
Killed

Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice child
Killed process 27160 (mallocsleep) total-vm:790588kB, anon-rss:214948kB, 
file-rss:0kB

free -m
               total        used        free      shared buff/cache   
available
Mem:            363          23         252          23 87         295
Swap:          1055         134         921

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1392
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1392
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


  # Maschine is getting inresponsive and stalls for seconds, but never 
reaches more than 1055MB swap size (+ 384MB RAM)
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- 
------cpu-----
  r  b   swpd   free   buff  cache   si   so    bi    bo in   cs us sy 
id wa st
  0  0 136472 241196   1400  98544    4   57  1724    67 211  261  2  3 
91  2  2
  0  0 136472 241228   1400  98540    0    0     0     0 30   48  0  0 
100  0  0
  0  0 136472 241228   1408  98532    0    0     0    52 53   51  0  0 
89 11  0
  0  0 136472 241224   1408  98540    0    0     0   112 44   92  0  0 
100  0  0
  0  0 136472 241224   1408  98540    0    0     0     0 24   32  0  0 
100  0  0
  0  0 136472 241352   1408  98540    0    0     0     0 31   44  0  1 
100  0  0
  0  0 136472 241328   1408  98540    0    0     0    36 97  142  0  1 
99  0  0
  0  0 136472 241364   1408  98540    0    0     0     0 22   30  0  0 
100  0  0
  0  0 136472 241376   1416  98532    0    0     0    80 52   45  0  0 
92  8  1
  1  0 136472   9236   1416  98548    0    0     8     0 762   55 11 23 
66  0  0
  2  7 270496   3804    140  61172 1144 412268 15028 412340 92805 
301836  1 49  1 27 22
  1 12 620320   4788    140  35240 1240 114864 96860 114976 46242 96395  
1 26  0 61 12
  3 18 661436   4788    144  35568  508    0 167884     0 5605 8097  5 
76  0 16  4
  3  4 661220   4288    144  34256  252    0 273684     0 7454 9777  3 
71  0 19  7
  5 20 661024   4532    144  34772  320    0 238288     0 9452 12395  3 
78  0 13  6
  6 19 660596   4592    144  35884  320    0 233160     8 12401 16798  5 
67  0 12 15
  3 20 677268   4296    140  36816 2180 18200 444328 18332 19382 36234  
8 67  0 11 14
  3 25 677208   4792    136  36044   68    0 524340    12 20637 26558  3 
74  0 15  8
  2 21 687880   4964    136  38200  260 10784 311152 10884 17707 28941  
4 78  0 12  5
  3 21 693808   4380    176  36860  136 6024 388932  6096 14576 22372  3 
84  0  6  7
  3 27 693740   4432    152  38288   56 20736 419592 20744 23212 31219  
4 87  0  7  2
  3 23 713696   4384    152  38172  796    0 481420    96 16498 27177  8 
87  0  4  1
  3 27 713360   4116    152  38372 1844    0 1308552   296 25074 33901  
5 85  0  9  1
  3 29 714628   4416    180  41992  256 2556 501832  2704 56498 76293  3 
91  0  5  1
  3 29 714572   3860    172  41076  156    0 920736   152 12131 17339  5 
94  0  0  0
  4 28 714396   5108    152  40124  212 10924 567648 11148 41901 56712  
4 90  0  4  2
  3 30 725216   4060    136  40604  124    0 286384   156 21992 35505  5 
91  0  2  3
  8 12 148836 230388    320  70888 5356    0 24304    52 9977 15084 17 
75  0  5  3
  0  0 146692 271900    416  76680 2200    0  6592     0 1561 3198 10 10 
78  2  1
  0  0 146584 271900    416  76892  152    0   184     0 75  139  0  0 
100  0  1
  0  0 146488 271396    552  76980  128    0   264    36 124  230  0  1 
98  1  0
  0  0 146372 271076    680  77196  124    0   252     8 79  167  0  0 
100  0  0
  0  0 146312 270948    688  77332   64    0    64    80 61  102  0  0 
97  3  1

What's wrong here?
Kernel Bug?

Ciao,
Gerhard

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

typedef unsigned int BOOL;
typedef char* PCHAR;
typedef unsigned int DWORD;

#define FALSE 0
#define TRUE 1

BOOL getlong(PCHAR s,DWORD* retvalue)
{
   char *eptr;
   long value;

   value=strtol(s,&eptr,0);
   if ((eptr==s)||(*eptr!='\0')) return FALSE;
   if (value<0) return FALSE;
   *retvalue=value;
   return TRUE;
}

int main(int argc,char* argv[])
{
   unsigned int* p;
   unsigned int size=16*1024*1024;
   unsigned int size_of=sizeof(unsigned int);
   int i;

   if (argc>1)
   {
     if (!getlong(argv[1],&size))
     {
       printf("Wrong memsize!\n");
       exit(1);
     }
   }

   p=malloc(size);

   for(i=0;i<(size/size_of);i++) p[i]=0;

   sleep(3600);

   free(p);

   return 0;
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
  2015-03-03 13:18                                             ` Gerhard Wiesinger
@ 2015-03-03 20:50                                               ` Gerhard Wiesinger
  0 siblings, 0 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-03 20:50 UTC (permalink / raw)
  To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt

On 03.03.2015 14:18, Gerhard Wiesinger wrote:
> On 03.03.2015 13:28, Gerhard Wiesinger wrote:
>> On 03.03.2015 10:12, Gerhard Wiesinger wrote:
>>> On 02.03.2015 18:15, Gerhard Wiesinger wrote:
>>>> On 02.03.2015 16:52, Gerhard Wiesinger wrote:
>>>>> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>>>>>
>>>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>>>>>> (ping works well). ssh is not working.
>>>>>>> console window: after entering one character at login prompt, 
>>>>>>> then crashed:
>>>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 
>>>>>>> 112 or
>>>>>>> sacrifice child
>>>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>>>>>> anon-rss: 162712kB, file-rss: 220kB
>>>>>> Can you get a vmcore or at least sysrq-t output?
>>>>>
>>>>> Yes, next time it happens I can analyze it.
>>>>>
>>>>> I think there are 2 problems:
>>>>> 1.) OOM (Out of Memory) problem with the low memory settings and 
>>>>> kernel settings (see below)
>>>>> 2.) Instability problem which might have a dependency to 1.)
>>>>>
>>>>> What I've done so far (thanks to Andrey Korolyov for ideas and help):
>>>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
>>>>> virsh dumpxml database | grep "<type"
>>>>>     <type arch='x86_64' machine='pc-0.15'>hvm</type>
>>>>>
>>>>> virsh edit database
>>>>> virsh dumpxml database | grep "<type"
>>>>>     <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>>>>>
>>>>> SMBIOS is updated therefore from 2.4 to 2.8:
>>>>> dmesg|grep -i SMBIOS
>>>>> [    0.000000] SMBIOS 2.8 present.
>>>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc 
>>>>> nohz=off highres=off
>>>>> c.) Changed overcommit to 1
>>>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
>>>>> d.) Tried 1 VCPU instead of 2
>>>>> e.) Installed 512MB vRAM instead of 384MB
>>>>> f.) Prepared for sysrq and vmcore
>>>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
>>>>> sysctl -w kernel.sysrq=1
>>>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
>>>>> virsh dump domain-name /tmp/dumpfile
>>>>> g.) Further ideas, not yet done: disable memory balooning by 
>>>>> blacklisting baloon driver or remove from virsh xml config
>>>>>
>>>>> Summary:
>>>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM 
>>>>> problem, no crash
>>>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM 
>>>>> problem, no crash
>>>>
>>>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM 
>>>> problem, no crash
>>>
>>> 3b.) Still happened again at the nightly backup with same 
>>> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, 
>>> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no 
>>> reaction, BUT CRASHED again
>>>
>>
>> 3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, 
>> pc-i440fx-2.2: OOM problem, no crash
>>
>> postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
>> Free swap  = 905924kB
>> Total swap = 1081340kB
>> Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
>> Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB, 
>> file-rss:0kB
>>
>> An OOM should not occour:
>> https://www.kernel.org/doc/gorman/html/understand/understand016.html
>> Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
>>
>> Why does an OOM condition occour? Looks like a bug in the kernel?
>> Any ideas?
>
> # Allocating 800MB, killed by OOM killer
> ./mallocsleep 805306368
> Killed
>
> Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice 
> child
> Killed process 27160 (mallocsleep) total-vm:790588kB, 
> anon-rss:214948kB, file-rss:0kB
>
> free -m
>               total        used        free      shared buff/cache   
> available
> Mem:            363          23         252          23 87 295
> Swap:          1055         134         921
>
> ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1392
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 1392
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
>
>  # Maschine is getting inresponsive and stalls for seconds, but never 
> reaches more than 1055MB swap size (+ 384MB RAM)
> vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- 
> ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo in   cs us sy 
> id wa st
>  0  0 136472 241196   1400  98544    4   57  1724    67 211  261 2  3 
> 91  2  2
>  0  0 136472 241228   1400  98540    0    0     0     0 30   48 0  0 
> 100  0  0
>  0  0 136472 241228   1408  98532    0    0     0    52 53   51 0  0 
> 89 11  0
>  0  0 136472 241224   1408  98540    0    0     0   112 44   92 0  0 
> 100  0  0
>  0  0 136472 241224   1408  98540    0    0     0     0 24   32 0  0 
> 100  0  0
>  0  0 136472 241352   1408  98540    0    0     0     0 31   44 0  1 
> 100  0  0
>  0  0 136472 241328   1408  98540    0    0     0    36 97  142 0  1 
> 99  0  0
>  0  0 136472 241364   1408  98540    0    0     0     0 22   30 0  0 
> 100  0  0
>  0  0 136472 241376   1416  98532    0    0     0    80 52   45 0  0 
> 92  8  1
>  1  0 136472   9236   1416  98548    0    0     8     0 762   55 11 23 
> 66  0  0
>  2  7 270496   3804    140  61172 1144 412268 15028 412340 92805 
> 301836  1 49  1 27 22
>  1 12 620320   4788    140  35240 1240 114864 96860 114976 46242 
> 96395  1 26  0 61 12
>  3 18 661436   4788    144  35568  508    0 167884     0 5605 8097  5 
> 76  0 16  4
>  3  4 661220   4288    144  34256  252    0 273684     0 7454 9777  3 
> 71  0 19  7
>  5 20 661024   4532    144  34772  320    0 238288     0 9452 12395  3 
> 78  0 13  6
>  6 19 660596   4592    144  35884  320    0 233160     8 12401 16798  
> 5 67  0 12 15
>  3 20 677268   4296    140  36816 2180 18200 444328 18332 19382 36234  
> 8 67  0 11 14
>  3 25 677208   4792    136  36044   68    0 524340    12 20637 26558  
> 3 74  0 15  8
>  2 21 687880   4964    136  38200  260 10784 311152 10884 17707 28941  
> 4 78  0 12  5
>  3 21 693808   4380    176  36860  136 6024 388932  6096 14576 22372  
> 3 84  0  6  7
>  3 27 693740   4432    152  38288   56 20736 419592 20744 23212 31219  
> 4 87  0  7  2
>  3 23 713696   4384    152  38172  796    0 481420    96 16498 27177  
> 8 87  0  4  1
>  3 27 713360   4116    152  38372 1844    0 1308552   296 25074 33901  
> 5 85  0  9  1
>  3 29 714628   4416    180  41992  256 2556 501832  2704 56498 76293  
> 3 91  0  5  1
>  3 29 714572   3860    172  41076  156    0 920736   152 12131 17339  
> 5 94  0  0  0
>  4 28 714396   5108    152  40124  212 10924 567648 11148 41901 56712  
> 4 90  0  4  2
>  3 30 725216   4060    136  40604  124    0 286384   156 21992 35505  
> 5 91  0  2  3
>  8 12 148836 230388    320  70888 5356    0 24304    52 9977 15084 17 
> 75  0  5  3
>  0  0 146692 271900    416  76680 2200    0  6592     0 1561 3198 10 
> 10 78  2  1
>  0  0 146584 271900    416  76892  152    0   184     0 75  139 0  0 
> 100  0  1
>  0  0 146488 271396    552  76980  128    0   264    36 124  230 0  1 
> 98  1  0
>  0  0 146372 271076    680  77196  124    0   252     8 79  167 0  0 
> 100  0  0
>  0  0 146312 270948    688  77332   64    0    64    80 61  102 0  0 
> 97  3  1
>
> What's wrong here?
> Kernel Bug?
>

Reminds me all of the post here:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html
Last month, these outages began to happen more regularly but also very 
randomly. The symptoms were quite similar:
     CPU spiked to 100% utilization.
     Disk I/O spiked.
     Server became completely inaccessible via SSH, etc.
     Logs show the Linux Out Of Memory (OOM) killer killing user 
processes that have hit their cgroup's memory limit shortly before the 
server froze.
     Host memory was not under pressure - it was close to fully utilized 
(which is normal) but there was a lot of unused swap.

Ciao,
Gerhard

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-03-03 20:51 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-08 13:36 [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever Gerhard Wiesinger
2015-01-08 17:24 ` Paolo Bonzini
2015-01-08 18:12   ` Gerhard Wiesinger
2015-01-08 18:22     ` Paolo Bonzini
2015-01-08 22:28       ` Gerhard Wiesinger
2015-01-08 22:42         ` Paolo Bonzini
2015-01-12 11:41         ` Gerhard Wiesinger
2015-01-12 11:46           ` Paolo Bonzini
2015-01-12 11:53             ` Gerhard Wiesinger
2015-01-12 11:54               ` Paolo Bonzini
2015-01-13 20:13           ` Gerhard Wiesinger
2015-01-13 20:48             ` Paolo Bonzini
2015-01-13 21:08               ` Cole Robinson
2015-01-13 21:14               ` Gerhard Wiesinger
2015-01-13 21:16                 ` Paolo Bonzini
2015-01-13 22:01                   ` Gerhard Wiesinger
2015-01-14  0:59                     ` Laine Stump
2015-01-14  9:15                       ` Gerhard Wiesinger
2015-02-15  8:18                         ` Gerhard Wiesinger
2015-02-16 14:18                           ` Paolo Bonzini
2015-02-16 15:09                             ` Gerhard Wiesinger
2015-02-16 15:29                               ` Paolo Bonzini
2015-03-01 10:36                                 ` Gerhard Wiesinger
2015-03-02  9:26                                   ` Paolo Bonzini
2015-03-02 15:52                                     ` Gerhard Wiesinger
2015-03-02 17:15                                       ` Gerhard Wiesinger
2015-03-03  9:12                                         ` Gerhard Wiesinger
2015-03-03 10:12                                           ` Paolo Bonzini
2015-03-03 11:51                                           ` Gonglei
2015-03-03 12:28                                           ` Gerhard Wiesinger
2015-03-03 13:18                                             ` Gerhard Wiesinger
2015-03-03 20:50                                               ` Gerhard Wiesinger
2015-01-14 17:47             ` Juan Quintela
2015-01-14 17:52               ` Juan Quintela
2015-01-14 22:19                 ` Gerhard Wiesinger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.