All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] coroutine-ucontext broken for x86-32
@ 2012-05-08 19:35 Jan Kiszka
  2012-05-09  7:32 ` Michael Tokarev
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jan Kiszka @ 2012-05-08 19:35 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Michael Tokarev

Hi,

I hunted down a fairly subtle corruption of the VCPU thread signal mask
in KVM mode when using the ucontext version of coroutines:

coroutine_new calls getcontext, makecontext, swapcontext. Those
functions get/set also the signal mask of the caller. Unfortunately,
they only use the sigprocmask syscall on i386, not the rt_sigprocmask
version. So they do not properly save/restore the blocked RT signals,
namely our SIG_IPI - it becomes unblocke this way. And this will sooner
or later make the kernel actually deliver a SIG_IPI to our
dummy_handler, and we miss a wakeup, which means losing control over
VCPU thread - qemu hangs.

I was able to reproduce the issue very reliably with virtio-block
enabled, 32-bit qemu userspace on a 64-bit host, using a 32-bit WinXP
guest.

Simple workaround:

diff --git a/main-loop.h b/main-loop.h
index c06b8bc..dce1cd9 100644
--- a/main-loop.h
+++ b/main-loop.h
@@ -25,11 +25,7 @@
 #ifndef QEMU_MAIN_LOOP_H
 #define QEMU_MAIN_LOOP_H 1
 
-#ifdef SIGRTMIN
-#define SIG_IPI (SIGRTMIN+4)
-#else
 #define SIG_IPI SIGUSR1
-#endif
 
 /**
  * qemu_init_main_loop: Set up the process so that it can run the main loop.


But maybe someone has a better idea, ie. something that addresses the
issue at the root. Otherwise we would have to erect large warning signs:
"Do not use RT signals! Coroutines will break them for you."

Michael, maybe this also relates to the issue you saw. I'm not able to
reproduce any VAPIC problems after make Windows bootable by switching
to SIGUSR1.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-08 19:35 [Qemu-devel] coroutine-ucontext broken for x86-32 Jan Kiszka
@ 2012-05-09  7:32 ` Michael Tokarev
  2012-05-09 11:12   ` Jan Kiszka
  2012-05-09 10:11 ` Kevin Wolf
  2012-05-09 18:05 ` Michael Tokarev
  2 siblings, 1 reply; 13+ messages in thread
From: Michael Tokarev @ 2012-05-09  7:32 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel

On 08.05.2012 23:35, Jan Kiszka wrote:
> Hi,
> 
> I hunted down a fairly subtle corruption of the VCPU thread signal mask
> in KVM mode when using the ucontext version of coroutines:
> 
> coroutine_new calls getcontext, makecontext, swapcontext. Those
> functions get/set also the signal mask of the caller. Unfortunately,
> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
> version. So they do not properly save/restore the blocked RT signals,
> namely our SIG_IPI - it becomes unblocke this way. And this will sooner
> or later make the kernel actually deliver a SIG_IPI to our
> dummy_handler, and we miss a wakeup, which means losing control over
> VCPU thread - qemu hangs.
> 
> I was able to reproduce the issue very reliably with virtio-block
> enabled, 32-bit qemu userspace on a 64-bit host, using a 32-bit WinXP
> guest.

Jan, I tried to hunt down (well, FSVO anyway, since I don't understand
qemu code as a whole still) this very issue since some 0.15 (IIRC -
when coroutines were introduced) version.  The sympthom I faced was
32bit kvm process lockup when rebooting windows guest.  The cause
was lost/ignored interrupts, and for me it was possible to just
suspend/resume (SIGSTOP/SIGCONT) the kvm process or to attach a
debugger or strace to it.  It looked like a corruption somewhere,
and while bisecting I were finding "unrelated" commits -- like,
eg, "switch qcow2 to coroutines" (I was using -snapshot, so qcow2
was actually in use, but the commit itself were innocent).  There
are several discussions in archives, debian bugreport about it and
several IRC discussions, all with no outcome.  So at least now I
can say that it is not only me who see the issue, so it passes a
reality check somehow... ;)

But the thing is: generally, almost no one cares about 32/64bit
"mixed" environment anymore.  I had a few users in Debian who
complained, and it has always been the same scenario: an old 32bit
install moved to a new hardware, next due to large amount of
memory, switch to 64bit kernel, and the result is "something
not working".  My suggestion to them has always been "reinstall".
I use such a mixed environment myself on my development box
(and actually even on production machines @office), so I'm
one of the first to face issues in this area, and it sometimes
does not let me to do other things -- eg, I can't debug some
other bug because qemu locks up due to this 32/64 thing.  I
learned to use a 64bit chroot for this things after all.

So I'm not sure if there's enough interest to hunt this.  It
must be something very simple, and it might pop up somewhere
else, but so far it - seemingly - only affects 32/64bit mixed
environment.

> Simple workaround:
> 
> diff --git a/main-loop.h b/main-loop.h
> index c06b8bc..dce1cd9 100644
> --- a/main-loop.h
> +++ b/main-loop.h
> @@ -25,11 +25,7 @@
>  #ifndef QEMU_MAIN_LOOP_H
>  #define QEMU_MAIN_LOOP_H 1
>  
> -#ifdef SIGRTMIN
> -#define SIG_IPI (SIGRTMIN+4)
> -#else
>  #define SIG_IPI SIGUSR1
> -#endif
>  
>  /**
>   * qemu_init_main_loop: Set up the process so that it can run the main loop.
> 
> 
> But maybe someone has a better idea, ie. something that addresses the
> issue at the root. Otherwise we would have to erect large warning signs:
> "Do not use RT signals! Coroutines will break them for you."
> 
> Michael, maybe this also relates to the issue you saw. I'm not able to
> reproduce any VAPIC problems after make Windows bootable by switching
> to SIGUSR1.

I'll try to verify it later today, I've unrelated urgent family
issues right now...

Thank you!

/mjt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-08 19:35 [Qemu-devel] coroutine-ucontext broken for x86-32 Jan Kiszka
  2012-05-09  7:32 ` Michael Tokarev
@ 2012-05-09 10:11 ` Kevin Wolf
  2012-05-09 10:55   ` Jan Kiszka
  2012-05-09 11:15   ` Peter Maydell
  2012-05-09 18:05 ` Michael Tokarev
  2 siblings, 2 replies; 13+ messages in thread
From: Kevin Wolf @ 2012-05-09 10:11 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Anthony Liguori, Michael Tokarev, qemu-devel

Am 08.05.2012 21:35, schrieb Jan Kiszka:
> Hi,
> 
> I hunted down a fairly subtle corruption of the VCPU thread signal mask
> in KVM mode when using the ucontext version of coroutines:
> 
> coroutine_new calls getcontext, makecontext, swapcontext. Those
> functions get/set also the signal mask of the caller. Unfortunately,
> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
> version. So they do not properly save/restore the blocked RT signals,
> namely our SIG_IPI - it becomes unblocke this way.

If other coroutine backends work (sigaltstack?), we could try to detect
the situation in configure and set the right default. Not sure what the
condition is, glibc + i386?

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09 10:11 ` Kevin Wolf
@ 2012-05-09 10:55   ` Jan Kiszka
  2012-05-09 11:15   ` Peter Maydell
  1 sibling, 0 replies; 13+ messages in thread
From: Jan Kiszka @ 2012-05-09 10:55 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Michael Tokarev, qemu-devel

On 2012-05-09 07:11, Kevin Wolf wrote:
> Am 08.05.2012 21:35, schrieb Jan Kiszka:
>> Hi,
>>
>> I hunted down a fairly subtle corruption of the VCPU thread signal mask
>> in KVM mode when using the ucontext version of coroutines:
>>
>> coroutine_new calls getcontext, makecontext, swapcontext. Those
>> functions get/set also the signal mask of the caller. Unfortunately,
>> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
>> version. So they do not properly save/restore the blocked RT signals,
>> namely our SIG_IPI - it becomes unblocke this way.
> 
> If other coroutine backends work (sigaltstack?),

Had a brief look at signalstack. Is kill(getpid(), SIGUSR2) in
coroutine_new correct? This should broadcast to the whole process, but
we only block SIGUSR2 in the caller's context. Why not pthread_kill?

I will give it a try nevertheless.

> we could try to detect
> the situation in configure and set the right default. Not sure what the
> condition is, glibc + i386?

So far it looks like a glibc/i386 thing, but maybe it is driven by the
ABI that defines the size of jmpbuf and, thus, the ability to store RT
signals as well. Then it is a generic i386 limitation.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09  7:32 ` Michael Tokarev
@ 2012-05-09 11:12   ` Jan Kiszka
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kiszka @ 2012-05-09 11:12 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel

On 2012-05-09 04:32, Michael Tokarev wrote:
> On 08.05.2012 23:35, Jan Kiszka wrote:
>> Hi,
>>
>> I hunted down a fairly subtle corruption of the VCPU thread signal mask
>> in KVM mode when using the ucontext version of coroutines:
>>
>> coroutine_new calls getcontext, makecontext, swapcontext. Those
>> functions get/set also the signal mask of the caller. Unfortunately,
>> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
>> version. So they do not properly save/restore the blocked RT signals,
>> namely our SIG_IPI - it becomes unblocke this way. And this will sooner
>> or later make the kernel actually deliver a SIG_IPI to our
>> dummy_handler, and we miss a wakeup, which means losing control over
>> VCPU thread - qemu hangs.
>>
>> I was able to reproduce the issue very reliably with virtio-block
>> enabled, 32-bit qemu userspace on a 64-bit host, using a 32-bit WinXP
>> guest.
> 
> Jan, I tried to hunt down (well, FSVO anyway, since I don't understand
> qemu code as a whole still) this very issue since some 0.15 (IIRC -
> when coroutines were introduced) version.  The sympthom I faced was
> 32bit kvm process lockup when rebooting windows guest.  The cause
> was lost/ignored interrupts, and for me it was possible to just
> suspend/resume (SIGSTOP/SIGCONT) the kvm process or to attach a
> debugger or strace to it.  It looked like a corruption somewhere,
> and while bisecting I were finding "unrelated" commits -- like,
> eg, "switch qcow2 to coroutines" (I was using -snapshot, so qcow2
> was actually in use, but the commit itself were innocent).  There
> are several discussions in archives, debian bugreport about it and
> several IRC discussions, all with no outcome.  So at least now I
> can say that it is not only me who see the issue, so it passes a
> reality check somehow... ;)
> 
> But the thing is: generally, almost no one cares about 32/64bit
> "mixed" environment anymore.  I had a few users in Debian who
> complained, and it has always been the same scenario: an old 32bit
> install moved to a new hardware, next due to large amount of
> memory, switch to 64bit kernel, and the result is "something
> not working".  My suggestion to them has always been "reinstall".
> I use such a mixed environment myself on my development box
> (and actually even on production machines @office), so I'm
> one of the first to face issues in this area, and it sometimes
> does not let me to do other things -- eg, I can't debug some
> other bug because qemu locks up due to this 32/64 thing.  I
> learned to use a 64bit chroot for this things after all.
> 
> So I'm not sure if there's enough interest to hunt this.  It
> must be something very simple, and it might pop up somewhere
> else, but so far it - seemingly - only affects 32/64bit mixed
> environment.

This issue also affects 32/32 installations.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09 10:11 ` Kevin Wolf
  2012-05-09 10:55   ` Jan Kiszka
@ 2012-05-09 11:15   ` Peter Maydell
  2012-05-09 11:38     ` Jan Kiszka
  2012-05-09 14:37     ` Avi Kivity
  1 sibling, 2 replies; 13+ messages in thread
From: Peter Maydell @ 2012-05-09 11:15 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Jan Kiszka, Anthony Liguori, Michael Tokarev, qemu-devel

On 9 May 2012 11:11, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 08.05.2012 21:35, schrieb Jan Kiszka:
>> I hunted down a fairly subtle corruption of the VCPU thread signal mask
>> in KVM mode when using the ucontext version of coroutines:
>>
>> coroutine_new calls getcontext, makecontext, swapcontext. Those
>> functions get/set also the signal mask of the caller. Unfortunately,
>> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
>> version. So they do not properly save/restore the blocked RT signals,
>> namely our SIG_IPI - it becomes unblocke this way.
>
> If other coroutine backends work (sigaltstack?), we could try to detect
> the situation in configure and set the right default. Not sure what the
> condition is, glibc + i386?

I don't think you can do a compile-time test for this short of
just disabling use of the ucontext code on all i386/Linux platforms.

I think it's becoming increasingly obvious that the setcontext/getcontext
code path is not very well used and prone to nasty libc bugs. Trying
to implement coroutines in C is just a really bad idea and I think
we should be trying to reduce our use of them if we possibly can,
presumably by switching to actually using threads where we really
need the parallelism.

-- PMM

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09 11:15   ` Peter Maydell
@ 2012-05-09 11:38     ` Jan Kiszka
  2012-05-09 17:17       ` Anthony Liguori
  2012-05-09 14:37     ` Avi Kivity
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Kiszka @ 2012-05-09 11:38 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Kevin Wolf, Anthony Liguori, Michael Tokarev, qemu-devel

On 2012-05-09 08:15, Peter Maydell wrote:
> On 9 May 2012 11:11, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 08.05.2012 21:35, schrieb Jan Kiszka:
>>> I hunted down a fairly subtle corruption of the VCPU thread signal mask
>>> in KVM mode when using the ucontext version of coroutines:
>>>
>>> coroutine_new calls getcontext, makecontext, swapcontext. Those
>>> functions get/set also the signal mask of the caller. Unfortunately,
>>> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
>>> version. So they do not properly save/restore the blocked RT signals,
>>> namely our SIG_IPI - it becomes unblocke this way.
>>
>> If other coroutine backends work (sigaltstack?), we could try to detect
>> the situation in configure and set the right default. Not sure what the
>> condition is, glibc + i386?
> 
> I don't think you can do a compile-time test for this short of
> just disabling use of the ucontext code on all i386/Linux platforms.
> 
> I think it's becoming increasingly obvious that the setcontext/getcontext
> code path is not very well used and prone to nasty libc bugs. Trying
> to implement coroutines in C is just a really bad idea and I think
> we should be trying to reduce our use of them if we possibly can,
> presumably by switching to actually using threads where we really
> need the parallelism.

I tend to agree.

FWIW, sigaltstack works around the issue here, but I'm still looking s
bit skeptical at its implementation.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09 11:15   ` Peter Maydell
  2012-05-09 11:38     ` Jan Kiszka
@ 2012-05-09 14:37     ` Avi Kivity
  1 sibling, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2012-05-09 14:37 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Kevin Wolf, Jan Kiszka, Anthony Liguori, Michael Tokarev, qemu-devel

On 05/09/2012 02:15 PM, Peter Maydell wrote:
> I think it's becoming increasingly obvious that the setcontext/getcontext
> code path is not very well used and prone to nasty libc bugs. Trying
> to implement coroutines in C is just a really bad idea and I think
> we should be trying to reduce our use of them if we possibly can,
> presumably by switching to actually using threads where we really
> need the parallelism.
>

I happen to like coroutines, but if we switch to threads, we should
ensure that we eliminate the extra context switch if the layer below
(usually posix-aio-compat) also uses threads.  That leaves a theoretical
regression with qcow2-on-native-aio, but that's a fairly rare use case.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09 11:38     ` Jan Kiszka
@ 2012-05-09 17:17       ` Anthony Liguori
  2012-05-09 17:25         ` Jan Kiszka
  0 siblings, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2012-05-09 17:17 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Kevin Wolf, Peter Maydell, Anthony Liguori, Michael Tokarev, qemu-devel

On 05/09/2012 06:38 AM, Jan Kiszka wrote:
> On 2012-05-09 08:15, Peter Maydell wrote:
>> On 9 May 2012 11:11, Kevin Wolf<kwolf@redhat.com>  wrote:
>>> Am 08.05.2012 21:35, schrieb Jan Kiszka:
>>>> I hunted down a fairly subtle corruption of the VCPU thread signal mask
>>>> in KVM mode when using the ucontext version of coroutines:
>>>>
>>>> coroutine_new calls getcontext, makecontext, swapcontext. Those
>>>> functions get/set also the signal mask of the caller. Unfortunately,
>>>> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
>>>> version. So they do not properly save/restore the blocked RT signals,
>>>> namely our SIG_IPI - it becomes unblocke this way.
>>>
>>> If other coroutine backends work (sigaltstack?), we could try to detect
>>> the situation in configure and set the right default. Not sure what the
>>> condition is, glibc + i386?
>>
>> I don't think you can do a compile-time test for this short of
>> just disabling use of the ucontext code on all i386/Linux platforms.
>>
>> I think it's becoming increasingly obvious that the setcontext/getcontext
>> code path is not very well used and prone to nasty libc bugs. Trying
>> to implement coroutines in C is just a really bad idea and I think
>> we should be trying to reduce our use of them if we possibly can,
>> presumably by switching to actually using threads where we really
>> need the parallelism.
>
> I tend to agree.
>
> FWIW, sigaltstack works around the issue here, but I'm still looking s
> bit skeptical at its implementation.

Is there any downside to using SIGUSR1?

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09 17:17       ` Anthony Liguori
@ 2012-05-09 17:25         ` Jan Kiszka
  2012-05-09 17:31           ` Anthony Liguori
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kiszka @ 2012-05-09 17:25 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Peter Maydell, Anthony Liguori, Michael Tokarev, qemu-devel

On 2012-05-09 14:17, Anthony Liguori wrote:
> On 05/09/2012 06:38 AM, Jan Kiszka wrote:
>> On 2012-05-09 08:15, Peter Maydell wrote:
>>> On 9 May 2012 11:11, Kevin Wolf<kwolf@redhat.com>  wrote:
>>>> Am 08.05.2012 21:35, schrieb Jan Kiszka:
>>>>> I hunted down a fairly subtle corruption of the VCPU thread signal mask
>>>>> in KVM mode when using the ucontext version of coroutines:
>>>>>
>>>>> coroutine_new calls getcontext, makecontext, swapcontext. Those
>>>>> functions get/set also the signal mask of the caller. Unfortunately,
>>>>> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
>>>>> version. So they do not properly save/restore the blocked RT signals,
>>>>> namely our SIG_IPI - it becomes unblocke this way.
>>>>
>>>> If other coroutine backends work (sigaltstack?), we could try to detect
>>>> the situation in configure and set the right default. Not sure what the
>>>> condition is, glibc + i386?
>>>
>>> I don't think you can do a compile-time test for this short of
>>> just disabling use of the ucontext code on all i386/Linux platforms.
>>>
>>> I think it's becoming increasingly obvious that the setcontext/getcontext
>>> code path is not very well used and prone to nasty libc bugs. Trying
>>> to implement coroutines in C is just a really bad idea and I think
>>> we should be trying to reduce our use of them if we possibly can,
>>> presumably by switching to actually using threads where we really
>>> need the parallelism.
>>
>> I tend to agree.
>>
>> FWIW, sigaltstack works around the issue here, but I'm still looking s
>> bit skeptical at its implementation.
> 
> Is there any downside to using SIGUSR1?

You mean for SIG_IPI? I don't think so. But the point is that the, well,
limitation of ucontext will continue to break RT signals, and this in a
very nasty way as only a specific setup is affected. I can't imagine we
want this.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09 17:25         ` Jan Kiszka
@ 2012-05-09 17:31           ` Anthony Liguori
  2012-05-09 18:21             ` Jan Kiszka
  0 siblings, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2012-05-09 17:31 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Kevin Wolf, Peter Maydell, Michael Tokarev, qemu-devel, Anthony Liguori

On 05/09/2012 12:25 PM, Jan Kiszka wrote:
> On 2012-05-09 14:17, Anthony Liguori wrote:
>> On 05/09/2012 06:38 AM, Jan Kiszka wrote:
>>> On 2012-05-09 08:15, Peter Maydell wrote:
>>>> On 9 May 2012 11:11, Kevin Wolf<kwolf@redhat.com>   wrote:
>>>>> Am 08.05.2012 21:35, schrieb Jan Kiszka:
>>>>>> I hunted down a fairly subtle corruption of the VCPU thread signal mask
>>>>>> in KVM mode when using the ucontext version of coroutines:
>>>>>>
>>>>>> coroutine_new calls getcontext, makecontext, swapcontext. Those
>>>>>> functions get/set also the signal mask of the caller. Unfortunately,
>>>>>> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
>>>>>> version. So they do not properly save/restore the blocked RT signals,
>>>>>> namely our SIG_IPI - it becomes unblocke this way.
>>>>>
>>>>> If other coroutine backends work (sigaltstack?), we could try to detect
>>>>> the situation in configure and set the right default. Not sure what the
>>>>> condition is, glibc + i386?
>>>>
>>>> I don't think you can do a compile-time test for this short of
>>>> just disabling use of the ucontext code on all i386/Linux platforms.
>>>>
>>>> I think it's becoming increasingly obvious that the setcontext/getcontext
>>>> code path is not very well used and prone to nasty libc bugs. Trying
>>>> to implement coroutines in C is just a really bad idea and I think
>>>> we should be trying to reduce our use of them if we possibly can,
>>>> presumably by switching to actually using threads where we really
>>>> need the parallelism.
>>>
>>> I tend to agree.
>>>
>>> FWIW, sigaltstack works around the issue here, but I'm still looking s
>>> bit skeptical at its implementation.
>>
>> Is there any downside to using SIGUSR1?
>
> You mean for SIG_IPI? I don't think so. But the point is that the, well,
> limitation of ucontext will continue to break RT signals,

Yes, but we currently don't use RT signals, right?  So we could switch to 
SIGUSR1, fix the problem in glibc, and call it a day, no?

Regards,

Anthony Liguori

  and this in a
> very nasty way as only a specific setup is affected. I can't imagine we
> want this.
>
> Jan
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-08 19:35 [Qemu-devel] coroutine-ucontext broken for x86-32 Jan Kiszka
  2012-05-09  7:32 ` Michael Tokarev
  2012-05-09 10:11 ` Kevin Wolf
@ 2012-05-09 18:05 ` Michael Tokarev
  2 siblings, 0 replies; 13+ messages in thread
From: Michael Tokarev @ 2012-05-09 18:05 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel

On 08.05.2012 23:35, Jan Kiszka wrote:
> Hi,
> 
> I hunted down a fairly subtle corruption of the VCPU thread signal mask
> in KVM mode when using the ucontext version of coroutines:
> 
> coroutine_new calls getcontext, makecontext, swapcontext. Those
> functions get/set also the signal mask of the caller. Unfortunately,
> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
> version. So they do not properly save/restore the blocked RT signals,
> namely our SIG_IPI - it becomes unblocke this way. And this will sooner
> or later make the kernel actually deliver a SIG_IPI to our
> dummy_handler, and we miss a wakeup, which means losing control over
> VCPU thread - qemu hangs.
> 
> I was able to reproduce the issue very reliably with virtio-block
> enabled, 32-bit qemu userspace on a 64-bit host, using a 32-bit WinXP
> guest.
> 
> Simple workaround:
> 
> diff --git a/main-loop.h b/main-loop.h
> index c06b8bc..dce1cd9 100644
> --- a/main-loop.h
> +++ b/main-loop.h
> @@ -25,11 +25,7 @@
>  #ifndef QEMU_MAIN_LOOP_H
>  #define QEMU_MAIN_LOOP_H 1
>  
> -#ifdef SIGRTMIN
> -#define SIG_IPI (SIGRTMIN+4)
> -#else
>  #define SIG_IPI SIGUSR1
> -#endif
[]
> Michael, maybe this also relates to the issue you saw. I'm not able to
> reproduce any VAPIC problems after make Windows bootable by switching
> to SIGUSR1.

FWIW, this fixes both the STOP 0x5c during reboot of windows 32bit
guest and my other issue, with qemu stalling on 32/64bit environment.

So yes indeed, that's the same thing apparently...

Nice catch!

/mjt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] coroutine-ucontext broken for x86-32
  2012-05-09 17:31           ` Anthony Liguori
@ 2012-05-09 18:21             ` Jan Kiszka
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kiszka @ 2012-05-09 18:21 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Peter Maydell, Michael Tokarev, qemu-devel, Anthony Liguori

On 2012-05-09 14:31, Anthony Liguori wrote:
> On 05/09/2012 12:25 PM, Jan Kiszka wrote:
>> On 2012-05-09 14:17, Anthony Liguori wrote:
>>> On 05/09/2012 06:38 AM, Jan Kiszka wrote:
>>>> On 2012-05-09 08:15, Peter Maydell wrote:
>>>>> On 9 May 2012 11:11, Kevin Wolf<kwolf@redhat.com>   wrote:
>>>>>> Am 08.05.2012 21:35, schrieb Jan Kiszka:
>>>>>>> I hunted down a fairly subtle corruption of the VCPU thread signal mask
>>>>>>> in KVM mode when using the ucontext version of coroutines:
>>>>>>>
>>>>>>> coroutine_new calls getcontext, makecontext, swapcontext. Those
>>>>>>> functions get/set also the signal mask of the caller. Unfortunately,
>>>>>>> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
>>>>>>> version. So they do not properly save/restore the blocked RT signals,
>>>>>>> namely our SIG_IPI - it becomes unblocke this way.
>>>>>>
>>>>>> If other coroutine backends work (sigaltstack?), we could try to detect
>>>>>> the situation in configure and set the right default. Not sure what the
>>>>>> condition is, glibc + i386?
>>>>>
>>>>> I don't think you can do a compile-time test for this short of
>>>>> just disabling use of the ucontext code on all i386/Linux platforms.
>>>>>
>>>>> I think it's becoming increasingly obvious that the setcontext/getcontext
>>>>> code path is not very well used and prone to nasty libc bugs. Trying
>>>>> to implement coroutines in C is just a really bad idea and I think
>>>>> we should be trying to reduce our use of them if we possibly can,
>>>>> presumably by switching to actually using threads where we really
>>>>> need the parallelism.
>>>>
>>>> I tend to agree.
>>>>
>>>> FWIW, sigaltstack works around the issue here, but I'm still looking s
>>>> bit skeptical at its implementation.
>>>
>>> Is there any downside to using SIGUSR1?
>>
>> You mean for SIG_IPI? I don't think so. But the point is that the, well,
>> limitation of ucontext will continue to break RT signals,
> 
> Yes, but we currently don't use RT signals, right?  So we could switch to 
> SIGUSR1, fix the problem in glibc, and call it a day, no?

That cures the current symptom but does not prevent future diseases
around RT signals. I would prefer to disable ucontext usage on those
platforms we identified as broken.

BTW, I'm starting to believe it's not a glibc but rather a Linux kernel
issue, only biting us on 32/64. sigprocmask should only manipulate those
signals, its masks can address. Digging deeper...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-05-09 18:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-08 19:35 [Qemu-devel] coroutine-ucontext broken for x86-32 Jan Kiszka
2012-05-09  7:32 ` Michael Tokarev
2012-05-09 11:12   ` Jan Kiszka
2012-05-09 10:11 ` Kevin Wolf
2012-05-09 10:55   ` Jan Kiszka
2012-05-09 11:15   ` Peter Maydell
2012-05-09 11:38     ` Jan Kiszka
2012-05-09 17:17       ` Anthony Liguori
2012-05-09 17:25         ` Jan Kiszka
2012-05-09 17:31           ` Anthony Liguori
2012-05-09 18:21             ` Jan Kiszka
2012-05-09 14:37     ` Avi Kivity
2012-05-09 18:05 ` Michael Tokarev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.