* [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
@ 2015-01-08 13:36 Gerhard Wiesinger
2015-01-08 17:24 ` Paolo Bonzini
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-08 13:36 UTC (permalink / raw)
To: qemu-devel
Hello,
After upgrading my KVM environment from Fedora 20 to Fedora 21 up2date
(hosts and guests, Intel CPU) I've the following problem:
1.) On the database VM PostgresSQL e.g. 2 processes hang with 100% cpu
2.) On the monitoring VM Munin/RRDtool also hangs with 100% cpu
Killing of processes is not possible, only reboot helps.
But I nailed it down to the following:
yum install strace
strace -y -p339
Process 339 attached
# no system calls here
^CProcess 339 detached
Going deeper:
yum --enablerepo fedora-debuginfo,updates-debuginfo install gdb
postgresql-debuginfo
gdb postgres 339
(gdb) bt
#0 0x00007fffffbf8ff8 in gettimeofday ()
#1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
(gdb) frame 1
#1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
1274 gettimeofday(&tp, NULL);
(same on second process here)
Quitting and reattaching gdb also hangs here, so gettimeofday takes 100%
CPU and never ends!
Therefore I guess this is a problem either in the Linux kernel or in
QEMU/KVM.
It might be the case that something changed in timer handling or some
default changed here. VMs are time syncrhonized with NTP so this might
also be a problem that the syscall hangs here.
Workaround when it happens: reboot ....
Any further ideas?
Thank you.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-08 13:36 [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever Gerhard Wiesinger
@ 2015-01-08 17:24 ` Paolo Bonzini
2015-01-08 18:12 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-08 17:24 UTC (permalink / raw)
To: Gerhard Wiesinger, qemu-devel
On 08/01/2015 14:36, Gerhard Wiesinger wrote:
> Quitting and reattaching gdb also hangs here, so gettimeofday takes 100%
> CPU and never ends!
>
> Therefore I guess this is a problem either in the Linux kernel or in
> QEMU/KVM.
What kernel are you running on (and were you running on)?
Can you try F20 host and F21 guest or vice versa?
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-08 17:24 ` Paolo Bonzini
@ 2015-01-08 18:12 ` Gerhard Wiesinger
2015-01-08 18:22 ` Paolo Bonzini
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-08 18:12 UTC (permalink / raw)
To: Paolo Bonzini, qemu-devel
On 08.01.2015 18:24, Paolo Bonzini wrote:
>
> On 08/01/2015 14:36, Gerhard Wiesinger wrote:
>> Quitting and reattaching gdb also hangs here, so gettimeofday takes 100%
>> CPU and never ends!
>>
>> Therefore I guess this is a problem either in the Linux kernel or in
>> QEMU/KVM.
> What kernel are you running on (and were you running on)?
>
> Can you try F20 host and F21 guest or vice versa?
Hello Paolo,
Always latest available stable versions (kernel/qemu-kvm):
F20: 3.17.6-200.fc20.x86_64 on guest/host, qemu-kvm-1.6.2-10.fc20.x86_64
on host
F21: 3.17.7-300.fc21.x86_64 on guest/host, qemu-kvm-2.1.2-7.fc21.x86_64
on host
(I had also 3.17.6-300.fc21.x86_64 and it happended there, too).
The topic is: it happens after some time (e.g. hours to days)
It is production environment running around 10VMs and I want to avoid
many experiments there. Since kernels were the same on FC20/F21 and
qemu/kvm changed from 1.6.2 to 2.1.2 I guess the topic seems to be
there. Also newer gcc might be a topic.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/time/timekeeping.c?id=refs/tags/v3.17.7#n695
I guess it hangs in the do/while loop here:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/time/timekeeping.c?id=refs/tags/v3.17.7#n493
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-08 18:12 ` Gerhard Wiesinger
@ 2015-01-08 18:22 ` Paolo Bonzini
2015-01-08 22:28 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-08 18:22 UTC (permalink / raw)
To: Gerhard Wiesinger, qemu-devel
On 08/01/2015 19:12, Gerhard Wiesinger wrote:
> Since kernels were the same on FC20/F21 and qemu/kvm changed from
> 1.6.2 to 2.1.2 I guess the topic seems to be there. Also newer gcc
> might be a topic.
Indeed. Can you try the 2.2.0 qemu-kvm release, available in the
fedora-virt-preview repository?
http://fedoraproject.org/wiki/Virtualization_Preview_Repository
Thanks,
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-08 18:22 ` Paolo Bonzini
@ 2015-01-08 22:28 ` Gerhard Wiesinger
2015-01-08 22:42 ` Paolo Bonzini
2015-01-12 11:41 ` Gerhard Wiesinger
0 siblings, 2 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-08 22:28 UTC (permalink / raw)
To: Paolo Bonzini, qemu-devel
On 08.01.2015 19:22, Paolo Bonzini wrote:
>
> On 08/01/2015 19:12, Gerhard Wiesinger wrote:
>> Since kernels were the same on FC20/F21 and qemu/kvm changed from
>> 1.6.2 to 2.1.2 I guess the topic seems to be there. Also newer gcc
>> might be a topic.
> Indeed. Can you try the 2.2.0 qemu-kvm release, available in the
> fedora-virt-preview repository?
>
> http://fedoraproject.org/wiki/Virtualization_Preview_Repository
Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
I'll keep you up to date in the next days whether it happens again or not.
BTW: Has something changed in the time code area between 1.6.2 and 2.1.2?
Thank you so far.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-08 22:28 ` Gerhard Wiesinger
@ 2015-01-08 22:42 ` Paolo Bonzini
2015-01-12 11:41 ` Gerhard Wiesinger
1 sibling, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-08 22:42 UTC (permalink / raw)
To: Gerhard Wiesinger, qemu-devel
On 08/01/2015 23:28, Gerhard Wiesinger wrote:
>>
>
> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
>
> I'll keep you up to date in the next days whether it happens again or not.
>
> BTW: Has something changed in the time code area between 1.6.2 and 2.1.2?
Most of this is actually in the kernel, but sometimes even
reading/writing the status out of the kernel can trigger bugs.
2.2 has more robust live migration of the kvm pv clock, but fixing that
also uncovered a few other bugs and it's possible they were latent in
2.1. Crossing fingers.
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-08 22:28 ` Gerhard Wiesinger
2015-01-08 22:42 ` Paolo Bonzini
@ 2015-01-12 11:41 ` Gerhard Wiesinger
2015-01-12 11:46 ` Paolo Bonzini
2015-01-13 20:13 ` Gerhard Wiesinger
1 sibling, 2 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-12 11:41 UTC (permalink / raw)
To: Paolo Bonzini, qemu-devel
On 08.01.2015 23:28, Gerhard Wiesinger wrote:
> On 08.01.2015 19:22, Paolo Bonzini wrote:
>>
>> Indeed. Can you try the 2.2.0 qemu-kvm release, available in the
>> fedora-virt-preview repository?
>>
>> http://fedoraproject.org/wiki/Virtualization_Preview_Repository
>
> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
>
> I'll keep you up to date in the next days whether it happens again or
> not.
With qemu-kvm 2.2.0 release from the above repository the 100% usage
didn't happen so far (although I had to reboot after kernel update).
So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling.
Any backporting planned?
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-12 11:41 ` Gerhard Wiesinger
@ 2015-01-12 11:46 ` Paolo Bonzini
2015-01-12 11:53 ` Gerhard Wiesinger
2015-01-13 20:13 ` Gerhard Wiesinger
1 sibling, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-12 11:46 UTC (permalink / raw)
To: Gerhard Wiesinger, qemu-devel
On 12/01/2015 12:41, Gerhard Wiesinger wrote:
>>
>> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
>>
>> I'll keep you up to date in the next days whether it happens again or
>> not.
>
> With qemu-kvm 2.2.0 release from the above repository the 100% usage
> didn't happen so far (although I had to reboot after kernel update).
>
> So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling.
> Any backporting planned?
That's difficult without bisection pointing out where the bugs were
fixed. 2.1.3 is scheduled real soon now and it will be the last release
from the 2.1.x branch.
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-12 11:46 ` Paolo Bonzini
@ 2015-01-12 11:53 ` Gerhard Wiesinger
2015-01-12 11:54 ` Paolo Bonzini
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-12 11:53 UTC (permalink / raw)
To: Paolo Bonzini, qemu-devel
On 12.01.2015 12:46, Paolo Bonzini wrote:
>
> On 12/01/2015 12:41, Gerhard Wiesinger wrote:
>>> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs.
>>>
>>> I'll keep you up to date in the next days whether it happens again or
>>> not.
>> With qemu-kvm 2.2.0 release from the above repository the 100% usage
>> didn't happen so far (although I had to reboot after kernel update).
>>
>> So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling.
>> Any backporting planned?
> That's difficult without bisection pointing out where the bugs were
> fixed. 2.1.3 is scheduled real soon now and it will be the last release
> from the 2.1.x branch.
So no security updates planned for 2.1.x afterwards?
For Fedora: Update for 2.2.x for the regulary update repo planned?
Will updated for 2.2.x also be released in the preview repos?
http://fedoraproject.org/wiki/Virtualization_Preview_Repository
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-12 11:53 ` Gerhard Wiesinger
@ 2015-01-12 11:54 ` Paolo Bonzini
0 siblings, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-12 11:54 UTC (permalink / raw)
To: Gerhard Wiesinger, qemu-devel
On 12/01/2015 12:53, Gerhard Wiesinger wrote:
>>>
>> That's difficult without bisection pointing out where the bugs were
>> fixed. 2.1.3 is scheduled real soon now and it will be the last release
>> from the 2.1.x branch.
>
> So no security updates planned for 2.1.x afterwards?
Distros usually take care of them, except for really egregious
vulnerabilities.
Paolo
> For Fedora: Update for 2.2.x for the regulary update repo planned?
> Will updated for 2.2.x also be released in the preview repos?
> http://fedoraproject.org/wiki/Virtualization_Preview_Repository
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-12 11:41 ` Gerhard Wiesinger
2015-01-12 11:46 ` Paolo Bonzini
@ 2015-01-13 20:13 ` Gerhard Wiesinger
2015-01-13 20:48 ` Paolo Bonzini
2015-01-14 17:47 ` Juan Quintela
1 sibling, 2 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-13 20:13 UTC (permalink / raw)
To: Paolo Bonzini, qemu-devel
On 12.01.2015 12:41, Gerhard Wiesinger wrote:
> On 08.01.2015 23:28, Gerhard Wiesinger wrote:
>>
>> I'll keep you up to date in the next days whether it happens again or
>> not.
>
> With qemu-kvm 2.2.0 release from the above repository the 100% usage
> didn't happen so far (although I had to reboot after kernel update).
It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL
is running:
(gdb) bt
#0 0x00007fff9a1feff4 in gettimeofday ()
#1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
What we know:
OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
qemu-kvm-1.6.2-10.fc20.x86_64 on host
NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
qemu-kvm-2.1.2-7.fc21.x86_64 on host
NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
qemu-kvm-2.2.0-1.fc21.x86_64 on host
No one less can reproduce or has similar problems?
Any further ideas?
BTW: I'm running ntp in the following manner: internet <=> ntp server in
VM <=> ntp client on KVM host (firewall runs in KVM)
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-13 20:13 ` Gerhard Wiesinger
@ 2015-01-13 20:48 ` Paolo Bonzini
2015-01-13 21:08 ` Cole Robinson
2015-01-13 21:14 ` Gerhard Wiesinger
2015-01-14 17:47 ` Juan Quintela
1 sibling, 2 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-13 20:48 UTC (permalink / raw)
To: Gerhard Wiesinger, qemu-devel, Cole Robinson, virt
On 13/01/2015 21:13, Gerhard Wiesinger wrote:
>
> It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL
> is running:
> (gdb) bt
> #0 0x00007fff9a1feff4 in gettimeofday ()
> #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>
> What we know:
> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
> qemu-kvm-1.6.2-10.fc20.x86_64 on host
> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
> qemu-kvm-2.1.2-7.fc21.x86_64 on host
> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>
> No one less can reproduce or has similar problems?
> Any further ideas?
Hmm, too bad. :(
Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you
help building some RPMS for Gerhard?
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-13 20:48 ` Paolo Bonzini
@ 2015-01-13 21:08 ` Cole Robinson
2015-01-13 21:14 ` Gerhard Wiesinger
1 sibling, 0 replies; 35+ messages in thread
From: Cole Robinson @ 2015-01-13 21:08 UTC (permalink / raw)
To: Paolo Bonzini, Gerhard Wiesinger, qemu-devel, virt
On 01/13/2015 03:48 PM, Paolo Bonzini wrote:
>
>
> On 13/01/2015 21:13, Gerhard Wiesinger wrote:
>>
>> It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL
>> is running:
>> (gdb) bt
>> #0 0x00007fff9a1feff4 in gettimeofday ()
>> #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>>
>> What we know:
>> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
>> qemu-kvm-1.6.2-10.fc20.x86_64 on host
>> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.1.2-7.fc21.x86_64 on host
>> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>>
>> No one less can reproduce or has similar problems?
>> Any further ideas?
>
> Hmm, too bad. :(
>
> Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you
> help building some RPMS for Gerhard?
>
> Paolo
>
There's F21 qemu-2.0 builds here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=538383
And 1.7 builds here: http://koji.fedoraproject.org/koji/buildinfo?buildID=499026
There might be dependency issues trying to install on latest f21, if they were
built against older libraries. Worth a shot though. Besides that something
custom will need to be built, though at that point it's probably easier to
build qemu from source
- Cole
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-13 20:48 ` Paolo Bonzini
2015-01-13 21:08 ` Cole Robinson
@ 2015-01-13 21:14 ` Gerhard Wiesinger
2015-01-13 21:16 ` Paolo Bonzini
1 sibling, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-13 21:14 UTC (permalink / raw)
To: Paolo Bonzini, qemu-devel, Cole Robinson, virt
On 13.01.2015 21:48, Paolo Bonzini wrote:
>
> On 13/01/2015 21:13, Gerhard Wiesinger wrote:
>> It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL
>> is running:
>> (gdb) bt
>> #0 0x00007fff9a1feff4 in gettimeofday ()
>> #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>>
>> What we know:
>> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
>> qemu-kvm-1.6.2-10.fc20.x86_64 on host
>> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.1.2-7.fc21.x86_64 on host
>> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>>
>> No one less can reproduce or has similar problems?
>> Any further ideas?
> Hmm, too bad. :(
>
> Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you
> help building some RPMS for Gerhard?
>
Yes, I can try. RPMs would be fine.
Don't know if it is related, but on another VM maschine I was getting
the following 75s after reboot:
[ 78.857006] INFO: rcu_sched self-detected stall on CPU { 0} (t=60000
jiffies g=1966 c=1965 q=0)
[ 258.860006] INFO: rcu_sched self-detected stall on CPU { 0} (t=240003
jiffies g=1966 c=1965 q=0)
I also had a look at the kernel code again:
http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
499 do {
500 seq = read_seqcount_begin(&tk_core.seq);
501
502 ts->tv_sec = tk->xtime_sec;
503 nsecs = timekeeping_get_ns(&tk->tkr);
504
505 } while (read_seqcount_retry(&tk_core.seq, seq));
So it looks like that the seqcount always changes and therefore loops
forever here (as far as I digged it down this is the only loop here).
Might be something wrong with the memory barriers in recent qemu-kvm
releases?
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-13 21:14 ` Gerhard Wiesinger
@ 2015-01-13 21:16 ` Paolo Bonzini
2015-01-13 22:01 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-01-13 21:16 UTC (permalink / raw)
To: Gerhard Wiesinger, qemu-devel, Cole Robinson, virt
On 13/01/2015 22:14, Gerhard Wiesinger wrote:
>
> I also had a look at the kernel code again:
> http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
> 499 do {
> 500 seq = read_seqcount_begin(&tk_core.seq);
> 501
> 502 ts->tv_sec = tk->xtime_sec;
> 503 nsecs = timekeeping_get_ns(&tk->tkr);
> 504
> 505 } while (read_seqcount_retry(&tk_core.seq, seq));
>
> So it looks like that the seqcount always changes and therefore loops
> forever here (as far as I digged it down this is the only loop here).
>
> Might be something wrong with the memory barriers in recent qemu-kvm
> releases?
No, that's not possible. Unless you pause/resume or migrate the VM, all
of the handling of kvmclock is entirely in the kernel.
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-13 21:16 ` Paolo Bonzini
@ 2015-01-13 22:01 ` Gerhard Wiesinger
2015-01-14 0:59 ` Laine Stump
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-13 22:01 UTC (permalink / raw)
To: Paolo Bonzini, qemu-devel, Cole Robinson, virt
On 13.01.2015 22:16, Paolo Bonzini wrote:
>
> On 13/01/2015 22:14, Gerhard Wiesinger wrote:
>> I also had a look at the kernel code again:
>> http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
>> 499 do {
>> 500 seq = read_seqcount_begin(&tk_core.seq);
>> 501
>> 502 ts->tv_sec = tk->xtime_sec;
>> 503 nsecs = timekeeping_get_ns(&tk->tkr);
>> 504
>> 505 } while (read_seqcount_retry(&tk_core.seq, seq));
>>
>> So it looks like that the seqcount always changes and therefore loops
>> forever here (as far as I digged it down this is the only loop here).
>>
>> Might be something wrong with the memory barriers in recent qemu-kvm
>> releases?
> No, that's not possible. Unless you pause/resume or migrate the VM, all
> of the handling of kvmclock is entirely in the kernel.
Any other possible explaination of the problem?
Had a look at the diff (I guess the right file at least in qemu tree):
# no critical changes IHMO here
git diff -u v1.6.2..v2.1.2 ./hw/i386/kvm/clock.c
Trying to reproduce with a loop:
#include <sys/time.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
struct timeval tv;
int i = 0;
for (;;)
{
gettimeofday(&tv, 0);
++i;
if (i >= 10000000)
{
i = 0;
printf("%i\n", (int)tv.tv_sec);
}
}
return 0;
}
As I wrote this: "First tests seem to run well, so no quick win ....", I
could reproduce it with a stall in 318s :-)
(gdb) bt
#0 0x00007fff6d9fefff in gettimeofday ()
#1 0x00000000004005ad in main (argc=1, argv=0x7fff6d9b28b8) at
gettimeofdayloop.c:10
So we have at least a testcase which is quickly to reproduce.
So we are digging down my second findings about a major bug in qemu-kvm :-)
Can someone try, too?
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-13 22:01 ` Gerhard Wiesinger
@ 2015-01-14 0:59 ` Laine Stump
2015-01-14 9:15 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Laine Stump @ 2015-01-14 0:59 UTC (permalink / raw)
To: Gerhard Wiesinger, Paolo Bonzini, qemu-devel, Cole Robinson, virt
On 01/13/2015 05:01 PM, Gerhard Wiesinger wrote:
> On 13.01.2015 22:16, Paolo Bonzini wrote:
>>
>> On 13/01/2015 22:14, Gerhard Wiesinger wrote:
>>> I also had a look at the kernel code again:
>>> http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
>>>
>>> 499 do {
>>> 500 seq = read_seqcount_begin(&tk_core.seq);
>>> 501
>>> 502 ts->tv_sec = tk->xtime_sec;
>>> 503 nsecs = timekeeping_get_ns(&tk->tkr);
>>> 504
>>> 505 } while (read_seqcount_retry(&tk_core.seq, seq));
>>>
>>> So it looks like that the seqcount always changes and therefore loops
>>> forever here (as far as I digged it down this is the only loop here).
>>>
>>> Might be something wrong with the memory barriers in recent qemu-kvm
>>> releases?
>> No, that's not possible. Unless you pause/resume or migrate the VM, all
>> of the handling of kvmclock is entirely in the kernel.
>
> Any other possible explaination of the problem?
>
> Had a look at the diff (I guess the right file at least in qemu tree):
> # no critical changes IHMO here
> git diff -u v1.6.2..v2.1.2 ./hw/i386/kvm/clock.c
>
> Trying to reproduce with a loop:
> #include <sys/time.h>
> #include <stdio.h>
>
> int main(int argc, char* argv[])
> {
> struct timeval tv;
> int i = 0;
> for (;;)
> {
> gettimeofday(&tv, 0);
> ++i;
> if (i >= 10000000)
> {
> i = 0;
> printf("%i\n", (int)tv.tv_sec);
> }
> }
> return 0;
> }
>
> As I wrote this: "First tests seem to run well, so no quick win ....",
> I could reproduce it with a stall in 318s :-)
> (gdb) bt
> #0 0x00007fff6d9fefff in gettimeofday ()
> #1 0x00000000004005ad in main (argc=1, argv=0x7fff6d9b28b8) at
> gettimeofdayloop.c:10
>
> So we have at least a testcase which is quickly to reproduce.
>
> So we are digging down my second findings about a major bug in
> qemu-kvm :-)
>
> Can someone try, too?
>
> Ciao,
> Gerhard
>
>
Take a look at the following kernel bug. It specifically deals with a
hang in gettimeofday() in a KVM guest:
https://bugzilla.redhat.com/show_bug.cgi?id=1178975
There is a link to a patched kernel you can try; it fixed my problems (I
was repeatedly getting hangs in python-urlgrabber during yum updates on
F21).
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-14 0:59 ` Laine Stump
@ 2015-01-14 9:15 ` Gerhard Wiesinger
2015-02-15 8:18 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-14 9:15 UTC (permalink / raw)
To: Laine Stump, Paolo Bonzini, qemu-devel, Cole Robinson, virt
On 14.01.2015 01:59, Laine Stump wrote:
>
> Take a look at the following kernel bug. It specifically deals with a
> hang in gettimeofday() in a KVM guest:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
>
> There is a link to a patched kernel you can try; it fixed my problems (I
> was repeatedly getting hangs in python-urlgrabber during yum updates on
> F21).
>
Looks to be fixed, commented in:
https://bugzilla.redhat.com/show_bug.cgi?id=1178975
Installed kernels.
http://koji.fedoraproject.org/koji/taskinfo?taskID=8575247
Time to release a new official kernel :-)
Thanx for the comments.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-13 20:13 ` Gerhard Wiesinger
2015-01-13 20:48 ` Paolo Bonzini
@ 2015-01-14 17:47 ` Juan Quintela
2015-01-14 17:52 ` Juan Quintela
1 sibling, 1 reply; 35+ messages in thread
From: Juan Quintela @ 2015-01-14 17:47 UTC (permalink / raw)
To: Gerhard Wiesinger; +Cc: Paolo Bonzini, qemu-devel
Gerhard Wiesinger <lists@wiesinger.com> wrote:
> On 12.01.2015 12:41, Gerhard Wiesinger wrote:
>> On 08.01.2015 23:28, Gerhard Wiesinger wrote:
>>>
>>> I'll keep you up to date in the next days whether it happens again
>>> or not.
>>
>> With qemu-kvm 2.2.0 release from the above repository the 100% usage
>> didn't happen so far (although I had to reboot after kernel update).
>
> It happens also with qemu-kvm 2.2.0 on another VM where also
> PostgreSQL is running:
> (gdb) bt
> #0 0x00007fff9a1feff4 in gettimeofday ()
> #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>
> What we know:
> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
> qemu-kvm-1.6.2-10.fc20.x86_64 on host
> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
> qemu-kvm-2.1.2-7.fc21.x86_64 on host
> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>
> No one less can reproduce or has similar problems?
I have similar problems, F21 guest on F21 host.
https://bugzilla.redhat.com/show_bug.cgi?id=1174664
For me, it is firefox that from time to time loops on vclock_gettime().
At that point the DSO have got corrupted (kvmclock msr's), and as you
say, only solution is reboot. Could you look at the bug, and see if the
areas pointed by the MSR also got corrupted? I tried using kernel-debug
with debugging for memory allocation (marcelo request). I see pattens
like 0x5a5a5a5a, so clearly there is some corruption, but haven't been
able to find _what_ is making that happen.
> Any further ideas?
Nope :-(
> BTW: I'm running ntp in the following manner: internet <=> ntp server
> in VM <=> ntp client on KVM host (firewall runs in KVM)
I am uisng ntp on the host against Internet, but the guest don't use ntp
(I do a ntpdate if I see that the guest has drifted too much, and it is
a rare event. Right now I don't rememeber having fixed it lately).
Later, Juan.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-14 17:47 ` Juan Quintela
@ 2015-01-14 17:52 ` Juan Quintela
2015-01-14 22:19 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Juan Quintela @ 2015-01-14 17:52 UTC (permalink / raw)
To: Gerhard Wiesinger; +Cc: Paolo Bonzini, qemu-devel
Juan Quintela <quintela@redhat.com> wrote:
> Gerhard Wiesinger <lists@wiesinger.com> wrote:
>> On 12.01.2015 12:41, Gerhard Wiesinger wrote:
>>> On 08.01.2015 23:28, Gerhard Wiesinger wrote:
>>>>
>>>> I'll keep you up to date in the next days whether it happens again
>>>> or not.
>>>
>>> With qemu-kvm 2.2.0 release from the above repository the 100% usage
>>> didn't happen so far (although I had to reboot after kernel update).
>>
>> It happens also with qemu-kvm 2.2.0 on another VM where also
>> PostgreSQL is running:
>> (gdb) bt
>> #0 0x00007fff9a1feff4 in gettimeofday ()
>> #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
>>
>> What we know:
>> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
>> qemu-kvm-1.6.2-10.fc20.x86_64 on host
>> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.1.2-7.fc21.x86_64 on host
>> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
>> qemu-kvm-2.2.0-1.fc21.x86_64 on host
>>
>> No one less can reproduce or has similar problems?
>
> I have similar problems, F21 guest on F21 host.
> https://bugzilla.redhat.com/show_bug.cgi?id=1174664
>
> For me, it is firefox that from time to time loops on vclock_gettime().
> At that point the DSO have got corrupted (kvmclock msr's), and as you
> say, only solution is reboot. Could you look at the bug, and see if the
> areas pointed by the MSR also got corrupted? I tried using kernel-debug
> with debugging for memory allocation (marcelo request). I see pattens
> like 0x5a5a5a5a, so clearly there is some corruption, but haven't been
> able to find _what_ is making that happen.
>
>> Any further ideas?
>
> Nope :-(
>
>> BTW: I'm running ntp in the following manner: internet <=> ntp server
>> in VM <=> ntp client on KVM host (firewall runs in KVM)
>
> I am uisng ntp on the host against Internet, but the guest don't use ntp
> (I do a ntpdate if I see that the guest has drifted too much, and it is
> a rare event. Right now I don't rememeber having fixed it lately).
I forgot tell on the previous patch, I am using 2vcpus. with a single
vcpu I have been unable to trigger this bug.
Later, Juan.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-14 17:52 ` Juan Quintela
@ 2015-01-14 22:19 ` Gerhard Wiesinger
0 siblings, 0 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-01-14 22:19 UTC (permalink / raw)
To: quintela; +Cc: Paolo Bonzini, qemu-devel
On 14.01.2015 18:52, Juan Quintela wrote:
> Juan Quintela <quintela@redhat.com> wrote:
> I forgot tell on the previous patch, I am using 2vcpus. with a single
> vcpu I have been unable to trigger this bug.
There is already a fix with a new patched kernel available for the
guest, see the bugzilla entry and my posts in this thread.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-01-14 9:15 ` Gerhard Wiesinger
@ 2015-02-15 8:18 ` Gerhard Wiesinger
2015-02-16 14:18 ` Paolo Bonzini
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-02-15 8:18 UTC (permalink / raw)
To: Laine Stump, Paolo Bonzini, qemu-devel, Cole Robinson, virt
On 14.01.2015 10:15, Gerhard Wiesinger wrote:
> On 14.01.2015 01:59, Laine Stump wrote:
>>
>> Take a look at the following kernel bug. It specifically deals with a
>> hang in gettimeofday() in a KVM guest:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
>>
>> There is a link to a patched kernel you can try; it fixed my problems (I
>> was repeatedly getting hangs in python-urlgrabber during yum updates on
>> F21).
>>
>
> Looks to be fixed, commented in:
> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
> Installed kernels.
> http://koji.fedoraproject.org/koji/taskinfo?taskID=8575247
>
There seems to be another problem. My PostgreSQL server VM still crashes
(2 times).
1.) First time crash: no reaction at all in KVM console, no network access
2.) Second time "crash":
- KVM console 2 characters could be entered, then console was dead forever
- ping works
- ssh not
- host: 100% cpu on 2 cores
strace -y -p on kvm host (loops through this):
read(7<anon_inode:[eventfd]>, 0x7fffb862bf20, 16) = -1 EAGAIN (Resource
temporarily unavailable)
ioctl(11<anon_inode:kvm-vm>, KVM_GET_DIRTY_LOG, 0x7fffb862bd30) = 0
ioctl(11<anon_inode:kvm-vm>, KVM_GET_DIRTY_LOG, 0x7fffb862bd30) = 0
write(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
ppoll([{fd=98<socket:[4660317]>, events=POLLIN|POLLERR|POLLHUP},
{fd=89<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=88<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=87<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=86<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=85<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=84<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=83<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=82<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=81<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=80<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=79<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=78<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=77<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=76<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=75<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=74<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=73<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=72<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=71<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=70<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=69<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=68<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=67<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=66<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=65<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=64<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=63<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=62<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=61<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=60<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=59<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, ...], 71,
{0, 124112228}, NULL, 8) = 1 ([...], left {0, 124090614})
write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
read(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 512) = 8
write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP
qemu-kvm-2.2.0-5.fc21.x86_64
Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
https://bugzilla.redhat.com/show_bug.cgi?id=1178975
is fixed (didn't occour with the test program posted at
https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
happened before reproduceable in 2min, still running)
So I guess there is another problem in the kernel with volatile and gcc
optimizations (or maybe in qemu-KVM)
Any ideas?
Any other bug reports like this?
Thank you.
Ciao,
Gerhard
-- http://www.wiesinger.com/
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-02-15 8:18 ` Gerhard Wiesinger
@ 2015-02-16 14:18 ` Paolo Bonzini
2015-02-16 15:09 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-02-16 14:18 UTC (permalink / raw)
To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt
On 15/02/2015 09:18, Gerhard Wiesinger wrote:
>
> ppoll([{fd=98<socket:[4660317]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=89<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=88<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=87<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=86<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=85<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=84<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=83<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=82<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=81<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=80<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=79<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=78<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=77<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=76<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=75<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=74<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=73<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=72<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=71<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=70<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=69<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=68<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=67<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=66<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=65<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=64<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=63<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=62<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=61<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=60<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
> {fd=59<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, ...], 71,
> {0, 124112228}, NULL, 8) = 1 ([...], left {0, 124090614})
> write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
> read(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 512) = 8
> write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
Can you grab some random backtraces ("thread apply all bt full") with gdb?
What is the libvirt XML or qemu command line?
Paolo
> Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP
> qemu-kvm-2.2.0-5.fc21.x86_64
>
> Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
> is fixed (didn't occour with the test program posted at
> https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
> happened before reproduceable in 2min, still running)
>
> So I guess there is another problem in the kernel with volatile and gcc
> optimizations (or maybe in qemu-KVM)
No, this doesn't look like volatile.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-02-16 14:18 ` Paolo Bonzini
@ 2015-02-16 15:09 ` Gerhard Wiesinger
2015-02-16 15:29 ` Paolo Bonzini
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-02-16 15:09 UTC (permalink / raw)
To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt
On 16.02.2015 15:18, Paolo Bonzini wrote:
>
> On 15/02/2015 09:18, Gerhard Wiesinger wrote:
>>
>> Can you grab some random backtraces ("thread apply all bt full") with gdb?
Very low load on the machine, so I guess most will be sleeping and hard
to catch non sleeping ones. See below
For the records: gdb -p 14139 -ex 'thread apply all bt full' -batch
>>
>> What is the libvirt XML or qemu command line?
See below.
>> Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP
>> qemu-kvm-2.2.0-5.fc21.x86_64
>>
>> Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
>> is fixed (didn't occour with the test program posted at
>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
>> happened before reproduceable in 2min, still running)
>>
>> So I guess there is another problem in the kernel with volatile and gcc
>> optimizations (or maybe in qemu-KVM)
> No, this doesn't look like volatile.
But why 100% on 2 cores?
There were also recent bugfixes with optimizations in gcc-4.9.2-6.fc21
maybe we hit one of these.
- update from the 4.9 branch
- PRs c++/54442, c++/64514, c++/64521, c++/64901, c/57653, c/61553,
c/64766,
c/64778, c/64824, c/64868, debug/64511, debug/64663, fortran/56867,
fortran/57023, fortran/60922, fortran/62044, fortran/63733,
fortran/64230, fortran/64528, fortran/64771, ipa/63970, ipa/64068,
ipa/64559, libstdc++/64476, libstdc++/64584, libstdc++/64585,
libstdc++/64646, libstdc++/64649, libstdc++/64680, middle-end/63704,
middle-end/64391, middle-end/64421, middle-end/64734,
rtl-optimization/61058, rtl-optimization/63637,
rtl-optimization/64286, rtl-optimization/64557, target/61413,
target/63424, target/64358, target/64479, target/64505, target/64513,
target/64580, target/64795, target/64882, target/64938, target/64979,
testsuite/64712, tree-optimization/64563
Ciao,
Gerhard
/usr/bin/qemu-system-x86_64 -machine accel=kvm -name myvm -S -machine
pc-0.15,accel=kvm,usb=off -m 384 -realtime mlock=off -smp
2,sockets=2,cores=1,threads=1 -uuid b40e77d3-cd86-4d59-9ee4-5756ec88bf99
-no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/myvm.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot menu=on,strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
lsi,id=scsi0,bus=pci.0,addr=0x6 -device ahci,id=ahci0,bus=pci.0,addr=0x9
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
if=none,id=drive-ide0-1-0,readonly=on,format=raw -device
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive
file=/var/lib/libvirt/images/myvm.img,if=none,id=drive-virtio-disk0,format=qcow2
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=34 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:aa:bb:cc,bus=pci.0,addr=0x8
-netdev tap,fd=35,id=hostnet1,vhost=on,vhostfd=36 -device
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:aa:bb:cd,bus=pci.0,addr=0x7
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0
-vnc 127.0.0.1:9 -k de -device
VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
<domain type='kvm' id='12'>
<name>myvm</name>
<uuid>b40e77d3-cd86-4d59-9ee4-5756ec88bf99</uuid>
<description>Fedora 21</description>
<memory unit='KiB'>393216</memory>
<currentMemory unit='KiB'>393216</currentMemory>
<vcpu placement='static'>2</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='x86_64' machine='pc-0.15'>hvm</type>
<bootmenu enable='yes'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/bin/qemu-kvm</emulator>
<disk type='block' device='cdrom'>
<driver name='qemu' type='raw'/>
<backingStore/>
<target dev='hdc' bus='ide'/>
<readonly/>
<alias name='ide0-1-0'/>
<address type='drive' controller='0' bus='1' target='0' unit='0'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/myvm.img'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
<boot order='1'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>
</disk>
<controller type='ide' index='0'>
<alias name='ide0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x1'/>
</controller>
<controller type='usb' index='0'>
<alias name='usb0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x2'/>
</controller>
<controller type='virtio-serial' index='0'>
<alias name='virtio-serial0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04'
function='0x0'/>
</controller>
<controller type='scsi' index='0'>
<alias name='scsi0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06'
function='0x0'/>
</controller>
<controller type='sata' index='0'>
<alias name='sata0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x09'
function='0x0'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<alias name='pci.0'/>
</controller>
<interface type='network'>
<mac address='52:54:00:aa:bb:cc'/>
<source network='DMZ' bridge='virbr2'/>
<target dev='vnet16'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x08'
function='0x0'/>
</interface>
<interface type='bridge'>
<mac address='52:54:00:aa:bb:cd'/>
<source bridge='br0'/>
<target dev='vnet17'/>
<model type='virtio'/>
<alias name='net1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/9'/>
<target port='0'/>
<alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/9'>
<source path='/dev/pts/9'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<input type='tablet' bus='usb'>
<alias name='input0'/>
</input>
<input type='mouse' bus='ps2'/>
<input type='keyboard' bus='ps2'/>
<graphics type='vnc' port='5909' autoport='yes' listen='127.0.0.1'
keymap='de'>
<listen type='address' address='127.0.0.1'/>
</graphics>
<video>
<model type='vga' vram='16384' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02'
function='0x0'/>
</video>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>
</memballoon>
</devices>
</domain>
Thread 11 (Thread 0x7ffe8bdfe700 (LWP 14147)):
#0 0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6
#1 0x00007ffea8c37c35 in kvm_vcpu_ioctl ()
#2 0x00007ffea8c37cec in kvm_cpu_exec ()
#3 0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn ()
#4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 10 (Thread 0x7ffe8b5fd700 (LWP 14148)):
#0 0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6
#1 0x00007ffea8c37c35 in kvm_vcpu_ioctl ()
#2 0x00007ffea8c37cec in kvm_cpu_exec ()
#3 0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn ()
#4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 9 (Thread 0x7ffe893ff700 (LWP 14158)):
#0 0x00007ffea771e590 in pthread_cond_wait@@GLIBC_2.3.2 () at
/lib64/libpthread.so.0
#1 0x00007ffea8eccd79 in qemu_cond_wait ()
#2 0x00007ffea8e50dd3 in vnc_worker_thread_loop ()
#3 0x00007ffea8e511b8 in vnc_worker_thread ()
#4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 8 (Thread 0x7ffe4effd700 (LWP 18334)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 7 (Thread 0x7ffe8231f700 (LWP 18335)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 6 (Thread 0x7ffe908c7700 (LWP 22620)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 5 (Thread 0x7ffe2d7fa700 (LWP 12264)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 4 (Thread 0x7ffe83622700 (LWP 17892)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 3 (Thread 0x7ffe4f7fe700 (LWP 17893)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 2 (Thread 0x7ffe82b20700 (LWP 17894)):
---Type <return> to continue, or q <return> to quit---
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 1 (Thread 0x7ffea8aa5a80 (LWP 14139)):
#0 0x00007ffe9dca72c1 in ppoll () at /lib64/libc.so.6
#1 0x00007ffea8e6289c in qemu_poll_ns ()
#2 0x00007ffea8e62034 in main_loop_wait ()
#3 0x00007ffea8bfdcdd in main ()
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007ffe9dca72c1 in ppoll () from /lib64/libc.so.6
Thread 8 (Thread 0x7ffe8bdfe700 (LWP 14147)):
#0 0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6
#1 0x00007ffea8c37c35 in kvm_vcpu_ioctl ()
#2 0x00007ffea8c37cec in kvm_cpu_exec ()
#3 0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn ()
#4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 7 (Thread 0x7ffe8b5fd700 (LWP 14148)):
#0 0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6
#1 0x00007ffea8c37c35 in kvm_vcpu_ioctl ()
#2 0x00007ffea8c37cec in kvm_cpu_exec ()
#3 0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn ()
#4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 6 (Thread 0x7ffe893ff700 (LWP 14158)):
#0 0x00007ffea771e590 in pthread_cond_wait@@GLIBC_2.3.2 () at
/lib64/libpthread.so.0
#1 0x00007ffea8eccd79 in qemu_cond_wait ()
#2 0x00007ffea8e50dd3 in vnc_worker_thread_loop ()
#3 0x00007ffea8e511b8 in vnc_worker_thread ()
#4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 5 (Thread 0x7ffdeeffd700 (LWP 31356)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 4 (Thread 0x7ffdef7fe700 (LWP 32680)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 3 (Thread 0x7ffe0ffff700 (LWP 32681)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 2 (Thread 0x7ffe8131d700 (LWP 32682)):
#0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0
#1 0x00007ffea8eccec7 in qemu_sem_timedwait ()
#2 0x00007ffea8e5523c in worker_thread ()
#3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0
#4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6
Thread 1 (Thread 0x7ffea8aa5a80 (LWP 14139)):
#0 0x00007ffe9dca72c1 in ppoll () at /lib64/libc.so.6
#1 0x00007ffea8e6289c in qemu_poll_ns ()
#2 0x00007ffea8e62034 in main_loop_wait ()
#3 0x00007ffea8bfdcdd in main ()
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-02-16 15:09 ` Gerhard Wiesinger
@ 2015-02-16 15:29 ` Paolo Bonzini
2015-03-01 10:36 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-02-16 15:29 UTC (permalink / raw)
To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt
On 16/02/2015 16:09, Gerhard Wiesinger wrote:
> On 16.02.2015 15:18, Paolo Bonzini wrote:
>>
>> On 15/02/2015 09:18, Gerhard Wiesinger wrote:
>>>
>>> Can you grab some random backtraces ("thread apply all bt full") with
>>> gdb?
>
> Very low load on the machine, so I guess most will be sleeping and hard
> to catch non sleeping ones. See below
This likely means that the 100% threads are not the I/O (event loop)
threads, but the VCPU threads.
>>> Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP
>>> qemu-kvm-2.2.0-5.fc21.x86_64
>>>
>>> Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975
>>> is fixed (didn't occour with the test program posted at
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
>>> happened before reproduceable in 2min, still running)
>>>
>>> So I guess there is another problem in the kernel with volatile and gcc
>>> optimizations (or maybe in qemu-KVM)
>> No, this doesn't look like volatile.
>
> But why 100% on 2 cores?
Because you have 2 VCPUs, I guess.
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-02-16 15:29 ` Paolo Bonzini
@ 2015-03-01 10:36 ` Gerhard Wiesinger
2015-03-02 9:26 ` Paolo Bonzini
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-01 10:36 UTC (permalink / raw)
To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt
On 16.02.2015 16:29, Paolo Bonzini wrote:
>
> On 16/02/2015 16:09, Gerhard Wiesinger wrote:
>> On 16.02.2015 15:18, Paolo Bonzini wrote:
>>> On 15/02/2015 09:18, Gerhard Wiesinger wrote:
>>>> Can you grab some random backtraces ("thread apply all bt full") with
>>>> gdb?
>> Very low load on the machine, so I guess most will be sleeping and hard
>> to catch non sleeping ones. See below
> This likely means that the 100% threads are not the I/O (event loop)
> threads, but the VCPU threads.
ok, happened again, details below.
Any further ideas from the stack traces?
So far it happened only the PostgreSQL database VM. Kernel is alive
(ping works well). ssh is not working.
console window: after entering one character at login prompt, then crashed:
[1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
sacrifice child
[1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
anon-rss: 162712kB, file-rss: 220kB
VM uptime is ~1 day, 2 cores 100%CPU.
VM is very stripped down, nevertheless is should have enough memory (and
also swap). Looks like it crashed at the nighlty backup (pg_dumpall),
command is: ssh -x ${REMOTE_USER}@${REMOTE_HOST} "pg_dumpall" | bzip2 -9
> ${DEST_SQL}
free
total used free shared buff/cache
available
Mem: 372264 28536 214780 20884 128948 303360
Swap: 1081340 0 1081340
Ciao,
Gerhard
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007feee62412c1 in ppoll () from /lib64/libc.so.6
Thread 4 (Thread 0x7feed3fff700 (LWP 4636)):
#0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6
#1 0x00007feef11d1c35 in kvm_vcpu_ioctl ()
#2 0x00007feef11d1cec in kvm_cpu_exec ()
#3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
#4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5 0x00007feee624c79d in clone () at /lib64/libc.so.6
Thread 3 (Thread 0x7feed37fe700 (LWP 4637)):
#0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6
#1 0x00007feef11d1c35 in kvm_vcpu_ioctl ()
#2 0x00007feef11d1cec in kvm_cpu_exec ()
#3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
#4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5 0x00007feee624c79d in clone () at /lib64/libc.so.6
Thread 2 (Thread 0x7feed1bff700 (LWP 4653)):
#0 0x00007feeefcb8590 in pthread_cond_wait@@GLIBC_2.3.2 () at
/lib64/libpthread.so.0
#1 0x00007feef1466d79 in qemu_cond_wait ()
#2 0x00007feef13eadd3 in vnc_worker_thread_loop ()
#3 0x00007feef13eb1b8 in vnc_worker_thread ()
#4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5 0x00007feee624c79d in clone () at /lib64/libc.so.6
Thread 1 (Thread 0x7feef103fa80 (LWP 4563)):
#0 0x00007feee62412c1 in ppoll () at /lib64/libc.so.6
#1 0x00007feef13fc89c in qemu_poll_ns ()
#2 0x00007feef13fc034 in main_loop_wait ()
#3 0x00007feef1197cdd in main ()
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-01 10:36 ` Gerhard Wiesinger
@ 2015-03-02 9:26 ` Paolo Bonzini
2015-03-02 15:52 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-03-02 9:26 UTC (permalink / raw)
To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt
On 01/03/2015 11:36, Gerhard Wiesinger wrote:
> So far it happened only the PostgreSQL database VM. Kernel is alive
> (ping works well). ssh is not working.
> console window: after entering one character at login prompt, then crashed:
> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
> sacrifice child
> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
> anon-rss: 162712kB, file-rss: 220kB
Can you get a vmcore or at least sysrq-t output?
Paolo
> VM uptime is ~1 day, 2 cores 100%CPU.
> VM is very stripped down, nevertheless is should have enough memory (and
> also swap). Looks like it crashed at the nighlty backup (pg_dumpall),
> command is: ssh -x ${REMOTE_USER}@${REMOTE_HOST} "pg_dumpall" | bzip2 -9
>> ${DEST_SQL}
> free
> total used free shared buff/cache
> available
> Mem: 372264 28536 214780 20884 128948 303360
> Swap: 1081340 0 1081340
>
> Ciao,
> Gerhard
>
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 0x00007feee62412c1 in ppoll () from /lib64/libc.so.6
>
> Thread 4 (Thread 0x7feed3fff700 (LWP 4636)):
> #0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6
> #1 0x00007feef11d1c35 in kvm_vcpu_ioctl ()
> #2 0x00007feef11d1cec in kvm_cpu_exec ()
> #3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
> #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
> #5 0x00007feee624c79d in clone () at /lib64/libc.so.6
>
> Thread 3 (Thread 0x7feed37fe700 (LWP 4637)):
> #0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6
> #1 0x00007feef11d1c35 in kvm_vcpu_ioctl ()
> #2 0x00007feef11d1cec in kvm_cpu_exec ()
> #3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
> #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
> #5 0x00007feee624c79d in clone () at /lib64/libc.so.6
>
> Thread 2 (Thread 0x7feed1bff700 (LWP 4653)):
> #0 0x00007feeefcb8590 in pthread_cond_wait@@GLIBC_2.3.2 () at
> /lib64/libpthread.so.0
> #1 0x00007feef1466d79 in qemu_cond_wait ()
> #2 0x00007feef13eadd3 in vnc_worker_thread_loop ()
> #3 0x00007feef13eb1b8 in vnc_worker_thread ()
> #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
> #5 0x00007feee624c79d in clone () at /lib64/libc.so.6
>
> Thread 1 (Thread 0x7feef103fa80 (LWP 4563)):
> #0 0x00007feee62412c1 in ppoll () at /lib64/libc.so.6
> #1 0x00007feef13fc89c in qemu_poll_ns ()
> #2 0x00007feef13fc034 in main_loop_wait ()
> #3 0x00007feef1197cdd in main ()
>
>
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-02 9:26 ` Paolo Bonzini
@ 2015-03-02 15:52 ` Gerhard Wiesinger
2015-03-02 17:15 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-02 15:52 UTC (permalink / raw)
To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt
On 02.03.2015 10:26, Paolo Bonzini wrote:
>
> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>> So far it happened only the PostgreSQL database VM. Kernel is alive
>> (ping works well). ssh is not working.
>> console window: after entering one character at login prompt, then crashed:
>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
>> sacrifice child
>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>> anon-rss: 162712kB, file-rss: 220kB
> Can you get a vmcore or at least sysrq-t output?
Yes, next time it happens I can analyze it.
I think there are 2 problems:
1.) OOM (Out of Memory) problem with the low memory settings and kernel
settings (see below)
2.) Instability problem which might have a dependency to 1.)
What I've done so far (thanks to Andrey Korolyov for ideas and help):
a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
virsh dumpxml database | grep "<type"
<type arch='x86_64' machine='pc-0.15'>hvm</type>
virsh edit database
virsh dumpxml database | grep "<type"
<type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
SMBIOS is updated therefore from 2.4 to 2.8:
dmesg|grep -i SMBIOS
[ 0.000000] SMBIOS 2.8 present.
b.) Switched to tsc clock, kernel parameters: clocksource=tsc nohz=off
highres=off
c.) Changed overcommit to 1
echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
d.) Tried 1 VCPU instead of 2
e.) Installed 512MB vRAM instead of 384MB
f.) Prepared for sysrq and vmcore
echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
sysctl -w kernel.sysrq=1
virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
virsh dump domain-name /tmp/dumpfile
g.) Further ideas, not yet done: disable memory balooning by
blacklisting baloon driver or remove from virsh xml config
Summary:
1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM problem,
no crash
2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem,
no crash
So the OOM problem seems to be solved (at least it didn't happen so far)
by installing 512MB RAM and setting vm.overcommit_memory = 1 (I guess
just setting overcommit would be fine, too).
Instability didn't occour so far. If I can't reproduce it, I'll revert
the settings.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-02 15:52 ` Gerhard Wiesinger
@ 2015-03-02 17:15 ` Gerhard Wiesinger
2015-03-03 9:12 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-02 17:15 UTC (permalink / raw)
To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt
On 02.03.2015 16:52, Gerhard Wiesinger wrote:
> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>
>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>> (ping works well). ssh is not working.
>>> console window: after entering one character at login prompt, then
>>> crashed:
>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
>>> sacrifice child
>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>> anon-rss: 162712kB, file-rss: 220kB
>> Can you get a vmcore or at least sysrq-t output?
>
> Yes, next time it happens I can analyze it.
>
> I think there are 2 problems:
> 1.) OOM (Out of Memory) problem with the low memory settings and
> kernel settings (see below)
> 2.) Instability problem which might have a dependency to 1.)
>
> What I've done so far (thanks to Andrey Korolyov for ideas and help):
> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
> virsh dumpxml database | grep "<type"
> <type arch='x86_64' machine='pc-0.15'>hvm</type>
>
> virsh edit database
> virsh dumpxml database | grep "<type"
> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>
> SMBIOS is updated therefore from 2.4 to 2.8:
> dmesg|grep -i SMBIOS
> [ 0.000000] SMBIOS 2.8 present.
> b.) Switched to tsc clock, kernel parameters: clocksource=tsc nohz=off
> highres=off
> c.) Changed overcommit to 1
> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
> d.) Tried 1 VCPU instead of 2
> e.) Installed 512MB vRAM instead of 384MB
> f.) Prepared for sysrq and vmcore
> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
> sysctl -w kernel.sysrq=1
> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
> virsh dump domain-name /tmp/dumpfile
> g.) Further ideas, not yet done: disable memory balooning by
> blacklisting baloon driver or remove from virsh xml config
>
> Summary:
> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM problem,
> no crash
> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem,
> no crash
3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem,
no crash
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-02 17:15 ` Gerhard Wiesinger
@ 2015-03-03 9:12 ` Gerhard Wiesinger
2015-03-03 10:12 ` Paolo Bonzini
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-03 9:12 UTC (permalink / raw)
To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt
On 02.03.2015 18:15, Gerhard Wiesinger wrote:
> On 02.03.2015 16:52, Gerhard Wiesinger wrote:
>> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>>
>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>>> (ping works well). ssh is not working.
>>>> console window: after entering one character at login prompt, then
>>>> crashed:
>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
>>>> sacrifice child
>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>>> anon-rss: 162712kB, file-rss: 220kB
>>> Can you get a vmcore or at least sysrq-t output?
>>
>> Yes, next time it happens I can analyze it.
>>
>> I think there are 2 problems:
>> 1.) OOM (Out of Memory) problem with the low memory settings and
>> kernel settings (see below)
>> 2.) Instability problem which might have a dependency to 1.)
>>
>> What I've done so far (thanks to Andrey Korolyov for ideas and help):
>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
>> virsh dumpxml database | grep "<type"
>> <type arch='x86_64' machine='pc-0.15'>hvm</type>
>>
>> virsh edit database
>> virsh dumpxml database | grep "<type"
>> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>>
>> SMBIOS is updated therefore from 2.4 to 2.8:
>> dmesg|grep -i SMBIOS
>> [ 0.000000] SMBIOS 2.8 present.
>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc
>> nohz=off highres=off
>> c.) Changed overcommit to 1
>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
>> d.) Tried 1 VCPU instead of 2
>> e.) Installed 512MB vRAM instead of 384MB
>> f.) Prepared for sysrq and vmcore
>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
>> sysctl -w kernel.sysrq=1
>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
>> virsh dump domain-name /tmp/dumpfile
>> g.) Further ideas, not yet done: disable memory balooning by
>> blacklisting baloon driver or remove from virsh xml config
>>
>> Summary:
>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM
>> problem, no crash
>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
>> problem, no crash
>
> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem,
> no crash
3b.) Still happened again at the nightly backup with same configuration
as in 3.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory =
1, pc-i440fx-2.2: no OOM problem, ping ok, no reaction, BUT CRASHED again
SYSRQ: no reaction of the VM
virsh send-key vm KEY_LEFTALT KEY_SYSRQ KEY_T
virsh dump vm file.core
error: Failed to core dump domain vm to file.core
error: internal error: unable to execute QEMU command 'migrate': State
blocked by non-migratable device '0000:00:09.0/ich9_ahci'
Removed the SATA controller, dump should work for the future.
Any futher ideas?
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-03 9:12 ` Gerhard Wiesinger
@ 2015-03-03 10:12 ` Paolo Bonzini
2015-03-03 11:51 ` Gonglei
2015-03-03 12:28 ` Gerhard Wiesinger
2 siblings, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-03-03 10:12 UTC (permalink / raw)
To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt
On 03/03/2015 10:12, Gerhard Wiesinger wrote:
>
> virsh dump vm file.core
> error: Failed to core dump domain vm to file.core
> error: internal error: unable to execute QEMU command 'migrate': State
> blocked by non-migratable device '0000:00:09.0/ich9_ahci'
>
> Removed the SATA controller, dump should work for the future.
Thanks.
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-03 9:12 ` Gerhard Wiesinger
2015-03-03 10:12 ` Paolo Bonzini
@ 2015-03-03 11:51 ` Gonglei
2015-03-03 12:28 ` Gerhard Wiesinger
2 siblings, 0 replies; 35+ messages in thread
From: Gonglei @ 2015-03-03 11:51 UTC (permalink / raw)
To: Gerhard Wiesinger, Paolo Bonzini, Laine Stump, qemu-devel,
Cole Robinson, virt
On 2015/3/3 17:12, Gerhard Wiesinger wrote:
> 3b.) Still happened again at the nightly backup with same configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no reaction, BUT CRASHED again
>
> SYSRQ: no reaction of the VM
> virsh send-key vm KEY_LEFTALT KEY_SYSRQ KEY_T
>
> virsh dump vm file.core
> error: Failed to core dump domain vm to file.core
> error: internal error: unable to execute QEMU command 'migrate': State blocked by non-migratable device '0000:00:09.0/ich9_ahci'
>
You can use '--memory-only' argument to dump vm's memory only. eg:
virsh dump vm file.core --memory-only
> Removed the SATA controller, dump should work for the future.
>
> Any futher ideas?
Regards,
-Gonglei
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-03 9:12 ` Gerhard Wiesinger
2015-03-03 10:12 ` Paolo Bonzini
2015-03-03 11:51 ` Gonglei
@ 2015-03-03 12:28 ` Gerhard Wiesinger
2015-03-03 13:18 ` Gerhard Wiesinger
2 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-03 12:28 UTC (permalink / raw)
To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt
On 03.03.2015 10:12, Gerhard Wiesinger wrote:
> On 02.03.2015 18:15, Gerhard Wiesinger wrote:
>> On 02.03.2015 16:52, Gerhard Wiesinger wrote:
>>> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>>>
>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>>>> (ping works well). ssh is not working.
>>>>> console window: after entering one character at login prompt, then
>>>>> crashed:
>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score
>>>>> 112 or
>>>>> sacrifice child
>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>>>> anon-rss: 162712kB, file-rss: 220kB
>>>> Can you get a vmcore or at least sysrq-t output?
>>>
>>> Yes, next time it happens I can analyze it.
>>>
>>> I think there are 2 problems:
>>> 1.) OOM (Out of Memory) problem with the low memory settings and
>>> kernel settings (see below)
>>> 2.) Instability problem which might have a dependency to 1.)
>>>
>>> What I've done so far (thanks to Andrey Korolyov for ideas and help):
>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
>>> virsh dumpxml database | grep "<type"
>>> <type arch='x86_64' machine='pc-0.15'>hvm</type>
>>>
>>> virsh edit database
>>> virsh dumpxml database | grep "<type"
>>> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>>>
>>> SMBIOS is updated therefore from 2.4 to 2.8:
>>> dmesg|grep -i SMBIOS
>>> [ 0.000000] SMBIOS 2.8 present.
>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc
>>> nohz=off highres=off
>>> c.) Changed overcommit to 1
>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
>>> d.) Tried 1 VCPU instead of 2
>>> e.) Installed 512MB vRAM instead of 384MB
>>> f.) Prepared for sysrq and vmcore
>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
>>> sysctl -w kernel.sysrq=1
>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
>>> virsh dump domain-name /tmp/dumpfile
>>> g.) Further ideas, not yet done: disable memory balooning by
>>> blacklisting baloon driver or remove from virsh xml config
>>>
>>> Summary:
>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM
>>> problem, no crash
>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
>>> problem, no crash
>>
>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
>> problem, no crash
>
> 3b.) Still happened again at the nightly backup with same
> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU,
> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no
> reaction, BUT CRASHED again
>
3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1,
pc-i440fx-2.2: OOM problem, no crash
postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Free swap = 905924kB
Total swap = 1081340kB
Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB,
file-rss:0kB
An OOM should not occour:
https://www.kernel.org/doc/gorman/html/understand/understand016.html
Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
Why does an OOM condition occour? Looks like a bug in the kernel?
Any ideas?
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-03 12:28 ` Gerhard Wiesinger
@ 2015-03-03 13:18 ` Gerhard Wiesinger
2015-03-03 20:50 ` Gerhard Wiesinger
0 siblings, 1 reply; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-03 13:18 UTC (permalink / raw)
To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt
On 03.03.2015 13:28, Gerhard Wiesinger wrote:
> On 03.03.2015 10:12, Gerhard Wiesinger wrote:
>> On 02.03.2015 18:15, Gerhard Wiesinger wrote:
>>> On 02.03.2015 16:52, Gerhard Wiesinger wrote:
>>>> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>>>>
>>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>>>>> (ping works well). ssh is not working.
>>>>>> console window: after entering one character at login prompt,
>>>>>> then crashed:
>>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score
>>>>>> 112 or
>>>>>> sacrifice child
>>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>>>>> anon-rss: 162712kB, file-rss: 220kB
>>>>> Can you get a vmcore or at least sysrq-t output?
>>>>
>>>> Yes, next time it happens I can analyze it.
>>>>
>>>> I think there are 2 problems:
>>>> 1.) OOM (Out of Memory) problem with the low memory settings and
>>>> kernel settings (see below)
>>>> 2.) Instability problem which might have a dependency to 1.)
>>>>
>>>> What I've done so far (thanks to Andrey Korolyov for ideas and help):
>>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
>>>> virsh dumpxml database | grep "<type"
>>>> <type arch='x86_64' machine='pc-0.15'>hvm</type>
>>>>
>>>> virsh edit database
>>>> virsh dumpxml database | grep "<type"
>>>> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>>>>
>>>> SMBIOS is updated therefore from 2.4 to 2.8:
>>>> dmesg|grep -i SMBIOS
>>>> [ 0.000000] SMBIOS 2.8 present.
>>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc
>>>> nohz=off highres=off
>>>> c.) Changed overcommit to 1
>>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
>>>> d.) Tried 1 VCPU instead of 2
>>>> e.) Installed 512MB vRAM instead of 384MB
>>>> f.) Prepared for sysrq and vmcore
>>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
>>>> sysctl -w kernel.sysrq=1
>>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
>>>> virsh dump domain-name /tmp/dumpfile
>>>> g.) Further ideas, not yet done: disable memory balooning by
>>>> blacklisting baloon driver or remove from virsh xml config
>>>>
>>>> Summary:
>>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM
>>>> problem, no crash
>>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
>>>> problem, no crash
>>>
>>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
>>> problem, no crash
>>
>> 3b.) Still happened again at the nightly backup with same
>> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU,
>> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no
>> reaction, BUT CRASHED again
>>
>
> 3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1,
> pc-i440fx-2.2: OOM problem, no crash
>
> postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
> Free swap = 905924kB
> Total swap = 1081340kB
> Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
> Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB,
> file-rss:0kB
>
> An OOM should not occour:
> https://www.kernel.org/doc/gorman/html/understand/understand016.html
> Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
>
> Why does an OOM condition occour? Looks like a bug in the kernel?
> Any ideas?
# Allocating 800MB, killed by OOM killer
./mallocsleep 805306368
Killed
Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice child
Killed process 27160 (mallocsleep) total-vm:790588kB, anon-rss:214948kB,
file-rss:0kB
free -m
total used free shared buff/cache
available
Mem: 363 23 252 23 87 295
Swap: 1055 134 921
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1392
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1392
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
# Maschine is getting inresponsive and stalls for seconds, but never
reaches more than 1055MB swap size (+ 384MB RAM)
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system--
------cpu-----
r b swpd free buff cache si so bi bo in cs us sy
id wa st
0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 3
91 2 2
0 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0
100 0 0
0 0 136472 241228 1408 98532 0 0 0 52 53 51 0 0
89 11 0
0 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0
100 0 0
0 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0
100 0 0
0 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1
100 0 0
0 0 136472 241328 1408 98540 0 0 0 36 97 142 0 1
99 0 0
0 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0
100 0 0
0 0 136472 241376 1416 98532 0 0 0 80 52 45 0 0
92 8 1
1 0 136472 9236 1416 98548 0 0 8 0 762 55 11 23
66 0 0
2 7 270496 3804 140 61172 1144 412268 15028 412340 92805
301836 1 49 1 27 22
1 12 620320 4788 140 35240 1240 114864 96860 114976 46242 96395
1 26 0 61 12
3 18 661436 4788 144 35568 508 0 167884 0 5605 8097 5
76 0 16 4
3 4 661220 4288 144 34256 252 0 273684 0 7454 9777 3
71 0 19 7
5 20 661024 4532 144 34772 320 0 238288 0 9452 12395 3
78 0 13 6
6 19 660596 4592 144 35884 320 0 233160 8 12401 16798 5
67 0 12 15
3 20 677268 4296 140 36816 2180 18200 444328 18332 19382 36234
8 67 0 11 14
3 25 677208 4792 136 36044 68 0 524340 12 20637 26558 3
74 0 15 8
2 21 687880 4964 136 38200 260 10784 311152 10884 17707 28941
4 78 0 12 5
3 21 693808 4380 176 36860 136 6024 388932 6096 14576 22372 3
84 0 6 7
3 27 693740 4432 152 38288 56 20736 419592 20744 23212 31219
4 87 0 7 2
3 23 713696 4384 152 38172 796 0 481420 96 16498 27177 8
87 0 4 1
3 27 713360 4116 152 38372 1844 0 1308552 296 25074 33901
5 85 0 9 1
3 29 714628 4416 180 41992 256 2556 501832 2704 56498 76293 3
91 0 5 1
3 29 714572 3860 172 41076 156 0 920736 152 12131 17339 5
94 0 0 0
4 28 714396 5108 152 40124 212 10924 567648 11148 41901 56712
4 90 0 4 2
3 30 725216 4060 136 40604 124 0 286384 156 21992 35505 5
91 0 2 3
8 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 17
75 0 5 3
0 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 10 10
78 2 1
0 0 146584 271900 416 76892 152 0 184 0 75 139 0 0
100 0 1
0 0 146488 271396 552 76980 128 0 264 36 124 230 0 1
98 1 0
0 0 146372 271076 680 77196 124 0 252 8 79 167 0 0
100 0 0
0 0 146312 270948 688 77332 64 0 64 80 61 102 0 0
97 3 1
What's wrong here?
Kernel Bug?
Ciao,
Gerhard
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
typedef unsigned int BOOL;
typedef char* PCHAR;
typedef unsigned int DWORD;
#define FALSE 0
#define TRUE 1
BOOL getlong(PCHAR s,DWORD* retvalue)
{
char *eptr;
long value;
value=strtol(s,&eptr,0);
if ((eptr==s)||(*eptr!='\0')) return FALSE;
if (value<0) return FALSE;
*retvalue=value;
return TRUE;
}
int main(int argc,char* argv[])
{
unsigned int* p;
unsigned int size=16*1024*1024;
unsigned int size_of=sizeof(unsigned int);
int i;
if (argc>1)
{
if (!getlong(argv[1],&size))
{
printf("Wrong memsize!\n");
exit(1);
}
}
p=malloc(size);
for(i=0;i<(size/size_of);i++) p[i]=0;
sleep(3600);
free(p);
return 0;
}
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
2015-03-03 13:18 ` Gerhard Wiesinger
@ 2015-03-03 20:50 ` Gerhard Wiesinger
0 siblings, 0 replies; 35+ messages in thread
From: Gerhard Wiesinger @ 2015-03-03 20:50 UTC (permalink / raw)
To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt
On 03.03.2015 14:18, Gerhard Wiesinger wrote:
> On 03.03.2015 13:28, Gerhard Wiesinger wrote:
>> On 03.03.2015 10:12, Gerhard Wiesinger wrote:
>>> On 02.03.2015 18:15, Gerhard Wiesinger wrote:
>>>> On 02.03.2015 16:52, Gerhard Wiesinger wrote:
>>>>> On 02.03.2015 10:26, Paolo Bonzini wrote:
>>>>>>
>>>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote:
>>>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive
>>>>>>> (ping works well). ssh is not working.
>>>>>>> console window: after entering one character at login prompt,
>>>>>>> then crashed:
>>>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score
>>>>>>> 112 or
>>>>>>> sacrifice child
>>>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
>>>>>>> anon-rss: 162712kB, file-rss: 220kB
>>>>>> Can you get a vmcore or at least sysrq-t output?
>>>>>
>>>>> Yes, next time it happens I can analyze it.
>>>>>
>>>>> I think there are 2 problems:
>>>>> 1.) OOM (Out of Memory) problem with the low memory settings and
>>>>> kernel settings (see below)
>>>>> 2.) Instability problem which might have a dependency to 1.)
>>>>>
>>>>> What I've done so far (thanks to Andrey Korolyov for ideas and help):
>>>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
>>>>> virsh dumpxml database | grep "<type"
>>>>> <type arch='x86_64' machine='pc-0.15'>hvm</type>
>>>>>
>>>>> virsh edit database
>>>>> virsh dumpxml database | grep "<type"
>>>>> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>>>>>
>>>>> SMBIOS is updated therefore from 2.4 to 2.8:
>>>>> dmesg|grep -i SMBIOS
>>>>> [ 0.000000] SMBIOS 2.8 present.
>>>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc
>>>>> nohz=off highres=off
>>>>> c.) Changed overcommit to 1
>>>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
>>>>> d.) Tried 1 VCPU instead of 2
>>>>> e.) Installed 512MB vRAM instead of 384MB
>>>>> f.) Prepared for sysrq and vmcore
>>>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
>>>>> sysctl -w kernel.sysrq=1
>>>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
>>>>> virsh dump domain-name /tmp/dumpfile
>>>>> g.) Further ideas, not yet done: disable memory balooning by
>>>>> blacklisting baloon driver or remove from virsh xml config
>>>>>
>>>>> Summary:
>>>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM
>>>>> problem, no crash
>>>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
>>>>> problem, no crash
>>>>
>>>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
>>>> problem, no crash
>>>
>>> 3b.) Still happened again at the nightly backup with same
>>> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU,
>>> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no
>>> reaction, BUT CRASHED again
>>>
>>
>> 3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1,
>> pc-i440fx-2.2: OOM problem, no crash
>>
>> postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
>> Free swap = 905924kB
>> Total swap = 1081340kB
>> Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
>> Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB,
>> file-rss:0kB
>>
>> An OOM should not occour:
>> https://www.kernel.org/doc/gorman/html/understand/understand016.html
>> Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
>>
>> Why does an OOM condition occour? Looks like a bug in the kernel?
>> Any ideas?
>
> # Allocating 800MB, killed by OOM killer
> ./mallocsleep 805306368
> Killed
>
> Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice
> child
> Killed process 27160 (mallocsleep) total-vm:790588kB,
> anon-rss:214948kB, file-rss:0kB
>
> free -m
> total used free shared buff/cache
> available
> Mem: 363 23 252 23 87 295
> Swap: 1055 134 921
>
> ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 1392
> max locked memory (kbytes, -l) 64
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) 8192
> cpu time (seconds, -t) unlimited
> max user processes (-u) 1392
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
>
> # Maschine is getting inresponsive and stalls for seconds, but never
> reaches more than 1055MB swap size (+ 384MB RAM)
> vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system--
> ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy
> id wa st
> 0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 3
> 91 2 2
> 0 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0
> 100 0 0
> 0 0 136472 241228 1408 98532 0 0 0 52 53 51 0 0
> 89 11 0
> 0 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0
> 100 0 0
> 0 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0
> 100 0 0
> 0 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1
> 100 0 0
> 0 0 136472 241328 1408 98540 0 0 0 36 97 142 0 1
> 99 0 0
> 0 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0
> 100 0 0
> 0 0 136472 241376 1416 98532 0 0 0 80 52 45 0 0
> 92 8 1
> 1 0 136472 9236 1416 98548 0 0 8 0 762 55 11 23
> 66 0 0
> 2 7 270496 3804 140 61172 1144 412268 15028 412340 92805
> 301836 1 49 1 27 22
> 1 12 620320 4788 140 35240 1240 114864 96860 114976 46242
> 96395 1 26 0 61 12
> 3 18 661436 4788 144 35568 508 0 167884 0 5605 8097 5
> 76 0 16 4
> 3 4 661220 4288 144 34256 252 0 273684 0 7454 9777 3
> 71 0 19 7
> 5 20 661024 4532 144 34772 320 0 238288 0 9452 12395 3
> 78 0 13 6
> 6 19 660596 4592 144 35884 320 0 233160 8 12401 16798
> 5 67 0 12 15
> 3 20 677268 4296 140 36816 2180 18200 444328 18332 19382 36234
> 8 67 0 11 14
> 3 25 677208 4792 136 36044 68 0 524340 12 20637 26558
> 3 74 0 15 8
> 2 21 687880 4964 136 38200 260 10784 311152 10884 17707 28941
> 4 78 0 12 5
> 3 21 693808 4380 176 36860 136 6024 388932 6096 14576 22372
> 3 84 0 6 7
> 3 27 693740 4432 152 38288 56 20736 419592 20744 23212 31219
> 4 87 0 7 2
> 3 23 713696 4384 152 38172 796 0 481420 96 16498 27177
> 8 87 0 4 1
> 3 27 713360 4116 152 38372 1844 0 1308552 296 25074 33901
> 5 85 0 9 1
> 3 29 714628 4416 180 41992 256 2556 501832 2704 56498 76293
> 3 91 0 5 1
> 3 29 714572 3860 172 41076 156 0 920736 152 12131 17339
> 5 94 0 0 0
> 4 28 714396 5108 152 40124 212 10924 567648 11148 41901 56712
> 4 90 0 4 2
> 3 30 725216 4060 136 40604 124 0 286384 156 21992 35505
> 5 91 0 2 3
> 8 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 17
> 75 0 5 3
> 0 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 10
> 10 78 2 1
> 0 0 146584 271900 416 76892 152 0 184 0 75 139 0 0
> 100 0 1
> 0 0 146488 271396 552 76980 128 0 264 36 124 230 0 1
> 98 1 0
> 0 0 146372 271076 680 77196 124 0 252 8 79 167 0 0
> 100 0 0
> 0 0 146312 270948 688 77332 64 0 64 80 61 102 0 0
> 97 3 1
>
> What's wrong here?
> Kernel Bug?
>
Reminds me all of the post here:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html
Last month, these outages began to happen more regularly but also very
randomly. The symptoms were quite similar:
CPU spiked to 100% utilization.
Disk I/O spiked.
Server became completely inaccessible via SSH, etc.
Logs show the Linux Out Of Memory (OOM) killer killing user
processes that have hit their cgroup's memory limit shortly before the
server froze.
Host memory was not under pressure - it was close to fully utilized
(which is normal) but there was a lot of unused swap.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2015-03-03 20:51 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-08 13:36 [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever Gerhard Wiesinger
2015-01-08 17:24 ` Paolo Bonzini
2015-01-08 18:12 ` Gerhard Wiesinger
2015-01-08 18:22 ` Paolo Bonzini
2015-01-08 22:28 ` Gerhard Wiesinger
2015-01-08 22:42 ` Paolo Bonzini
2015-01-12 11:41 ` Gerhard Wiesinger
2015-01-12 11:46 ` Paolo Bonzini
2015-01-12 11:53 ` Gerhard Wiesinger
2015-01-12 11:54 ` Paolo Bonzini
2015-01-13 20:13 ` Gerhard Wiesinger
2015-01-13 20:48 ` Paolo Bonzini
2015-01-13 21:08 ` Cole Robinson
2015-01-13 21:14 ` Gerhard Wiesinger
2015-01-13 21:16 ` Paolo Bonzini
2015-01-13 22:01 ` Gerhard Wiesinger
2015-01-14 0:59 ` Laine Stump
2015-01-14 9:15 ` Gerhard Wiesinger
2015-02-15 8:18 ` Gerhard Wiesinger
2015-02-16 14:18 ` Paolo Bonzini
2015-02-16 15:09 ` Gerhard Wiesinger
2015-02-16 15:29 ` Paolo Bonzini
2015-03-01 10:36 ` Gerhard Wiesinger
2015-03-02 9:26 ` Paolo Bonzini
2015-03-02 15:52 ` Gerhard Wiesinger
2015-03-02 17:15 ` Gerhard Wiesinger
2015-03-03 9:12 ` Gerhard Wiesinger
2015-03-03 10:12 ` Paolo Bonzini
2015-03-03 11:51 ` Gonglei
2015-03-03 12:28 ` Gerhard Wiesinger
2015-03-03 13:18 ` Gerhard Wiesinger
2015-03-03 20:50 ` Gerhard Wiesinger
2015-01-14 17:47 ` Juan Quintela
2015-01-14 17:52 ` Juan Quintela
2015-01-14 22:19 ` Gerhard Wiesinger
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.