All of lore.kernel.org
 help / color / mirror / Atom feed
* Race condition with scheduler runqueues
@ 2013-02-18 18:11 Andrew Cooper
  2013-02-19  9:28 ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2013-02-18 18:11 UTC (permalink / raw)
  To: Xen-devel List, George Dunlap, Keir Fraser, Jan Beulich

Hello,

Our testing has discovered a crash (pagefault at 0x0000000000000008)
which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
in sched_credit.c (because a static function of the same name also
exists in sched_credit2.c, which confused matters to start with)

The test case was a loop of localhost migrate of a 1vcpu HVM win8
domain.  The test case itself has passed many times in the past on the
same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
does not appear to be any relevant changes between the version of Xen in
the test and xen-4.1-testing.

The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
from dom0, targeting the VCPU of the migrating domain.

(XEN) Xen call trace:
(XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
(XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
(XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
(XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
(XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
(XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80

The relevant part of csched_vcpu_sleep() is

    else if ( __vcpu_on_runq(svc) )
        __runq_remove(svc);

which disassembles to

ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
ffff82c480116a07:       75 07                   jne    ffff82c480116a10
<csched_vcpu_sleep+0x40>
ffff82c480116a09:       f3 c3                   repz retq
ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
<- Pagefault here
ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)

The relevant crash registers from the pagefault are:
rax: 0000000000000000
rdx: 0000000000000000
 r8: ffff83080c89ed90

If I am reading the code correctly, this means that runq->next is NULL,
so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
fail with a fault when trying to update runq->prev, which is also NULL.

The only place I can spot in the code where the runq->{next,prev} could
conceivably be NULL is in csched_alloc_vdata() between the memset() and
INIT_LIST_HEAD().  This is logically sensible in combination with the
localhost migrate loop, and I cant immediately see anything to prevent
this race happening.

Can anyone with more knowledge than me in this area of code comment on
the plausibility of my analysis?  Unfortunately, I cant see an easy
solution.

~Andrew

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Race condition with scheduler runqueues
  2013-02-18 18:11 Race condition with scheduler runqueues Andrew Cooper
@ 2013-02-19  9:28 ` Jan Beulich
  2013-02-19 11:47   ` Andrew Cooper
  2013-02-27  6:19   ` Juergen Gross
  0 siblings, 2 replies; 8+ messages in thread
From: Jan Beulich @ 2013-02-19  9:28 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Juergen Gross, Keir Fraser, Xen-devel List

>>> On 18.02.13 at 19:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> Hello,
> 
> Our testing has discovered a crash (pagefault at 0x0000000000000008)
> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
> in sched_credit.c (because a static function of the same name also
> exists in sched_credit2.c, which confused matters to start with)
> 
> The test case was a loop of localhost migrate of a 1vcpu HVM win8
> domain.  The test case itself has passed many times in the past on the
> same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
> does not appear to be any relevant changes between the version of Xen in
> the test and xen-4.1-testing.
> 
> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
> from dom0, targeting the VCPU of the migrating domain.
> 
> (XEN) Xen call trace:
> (XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
> (XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
> (XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
> (XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
> (XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
> (XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
> 
> The relevant part of csched_vcpu_sleep() is
> 
>     else if ( __vcpu_on_runq(svc) )
>         __runq_remove(svc);
> 
> which disassembles to
> 
> ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
> ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
> ffff82c480116a07:       75 07                   jne    ffff82c480116a10
> <csched_vcpu_sleep+0x40>
> ffff82c480116a09:       f3 c3                   repz retq
> ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
> ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
> <- Pagefault here
> ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
> ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
> ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)
> 
> The relevant crash registers from the pagefault are:
> rax: 0000000000000000
> rdx: 0000000000000000
>  r8: ffff83080c89ed90
> 
> If I am reading the code correctly, this means that runq->next is NULL,
> so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
> fail with a fault when trying to update runq->prev, which is also NULL.
> 
> The only place I can spot in the code where the runq->{next,prev} could
> conceivably be NULL is in csched_alloc_vdata() between the memset() and
> INIT_LIST_HEAD().  This is logically sensible in combination with the
> localhost migrate loop, and I cant immediately see anything to prevent
> this race happening.

But that doesn't make sense: csched_alloc_vdata() doesn't store
svc into vc->sched_priv; that's being done by the generic
scheduler code once the actor returns.

So I'd rather suspect a stale pointer being used, which is easily
possible when racing with sched_move_domain() (as opposed to
schedule_cpu_switch(), where the new pointer gets stored
_before_ de-allocating the old one).

However, sched_move_domain() (as well as schedule_cpu_switch())
get called only from CPU pools code, and I would guess CPU pools
aren't involved here, and you don't in parallel soft offline/online
pCPU-s (as I'm sure you otherwise would have mentioned it).

But wait - libxl__domain_make() _unconditionally_ calls
xc_cpupool_movedomain(), as does XendDomainInfo's
_constructDomain(). The reason for this escapes me - Jürgen? Yet
I'd expect the pool ID matching check to short cut the resulting
sysctl, i.e. sched_move_domain() ought to not be reached anyway
(worth verifying of course).

The race there nevertheless ought to be fixed.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Race condition with scheduler runqueues
  2013-02-19  9:28 ` Jan Beulich
@ 2013-02-19 11:47   ` Andrew Cooper
  2013-02-26 12:08     ` George Dunlap
  2013-02-27  6:19   ` Juergen Gross
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2013-02-19 11:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Juergen Gross, Keir (Xen.org), Xen-devel List

On 19/02/13 09:28, Jan Beulich wrote:
>>>> On 18.02.13 at 19:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> Hello,
>>
>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>> in sched_credit.c (because a static function of the same name also
>> exists in sched_credit2.c, which confused matters to start with)
>>
>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>> domain.  The test case itself has passed many times in the past on the
>> same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
>> does not appear to be any relevant changes between the version of Xen in
>> the test and xen-4.1-testing.
>>
>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>> from dom0, targeting the VCPU of the migrating domain.
>>
>> (XEN) Xen call trace:
>> (XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>> (XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>> (XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>> (XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>> (XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>> (XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>
>> The relevant part of csched_vcpu_sleep() is
>>
>>     else if ( __vcpu_on_runq(svc) )
>>         __runq_remove(svc);
>>
>> which disassembles to
>>
>> ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
>> ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
>> ffff82c480116a07:       75 07                   jne    ffff82c480116a10
>> <csched_vcpu_sleep+0x40>
>> ffff82c480116a09:       f3 c3                   repz retq
>> ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>> ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
>> ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
>> <- Pagefault here
>> ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
>> ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
>> ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)
>>
>> The relevant crash registers from the pagefault are:
>> rax: 0000000000000000
>> rdx: 0000000000000000
>>  r8: ffff83080c89ed90
>>
>> If I am reading the code correctly, this means that runq->next is NULL,
>> so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
>> fail with a fault when trying to update runq->prev, which is also NULL.
>>
>> The only place I can spot in the code where the runq->{next,prev} could
>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>> INIT_LIST_HEAD().  This is logically sensible in combination with the
>> localhost migrate loop, and I cant immediately see anything to prevent
>> this race happening.
> But that doesn't make sense: csched_alloc_vdata() doesn't store
> svc into vc->sched_priv; that's being done by the generic
> scheduler code once the actor returns.

D'oh yes - I overlooked that.

>
> So I'd rather suspect a stale pointer being used, which is easily
> possible when racing with sched_move_domain() (as opposed to
> schedule_cpu_switch(), where the new pointer gets stored
> _before_ de-allocating the old one).

>
> However, sched_move_domain() (as well as schedule_cpu_switch())
> get called only from CPU pools code, and I would guess CPU pools
> aren't involved here, and you don't in parallel soft offline/online
> pCPU-s (as I'm sure you otherwise would have mentioned it).
>
> But wait - libxl__domain_make() _unconditionally_ calls
> xc_cpupool_movedomain(), as does XendDomainInfo's
> _constructDomain(). The reason for this escapes me - Jürgen? Yet
> I'd expect the pool ID matching check to short cut the resulting
> sysctl, i.e. sched_move_domain() ought to not be reached anyway
> (worth verifying of course).
>
> The race there nevertheless ought to be fixed.
>
> Jan

Our toolstack hooks directly into libxc and is completely ignorant of
cpupools.  Looking at the crash more closely, it might be a race elsewhere

Another dom0 vcpu is in an HVMOP_track_dirty_vram hypercall, and the
associated Xen stack trace is

[ffff82c4801777b2] time_calibration_std_rendezvous+0xb2/0xc0
 ffff82c480172d12  __smp_call_function_interrupt+0x62/0xb0
 ffff82c48017339e  smp_call_function_interrupt+0x4e/0x90
 ffff82c48014a65a  call_function_interrupt+0x2a/0x30
 ffff82c4801223b2  _spin_lock+0x12/0x20
 ffff82c4801734ab  flush_area_mask+0xcb/0x1c0
 ffff82c4801c862a  paging_log_dirty_range+0x3a/0x780
 ffff82c480121ea8  cpumask_raise_softirq+0x78/0x80
 ffff82c480117ed3  csched_vcpu_wake+0x193/0x420
 ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
 ffff82c4801f21c7  hap_track_dirty_vram+0x137/0x1c0
 ffff82c4801ad3fd  do_hvm_op+0x16dd/0x1f50
 ffff82c480106251  evtchn_send+0xa1/0x160
 ffff82c480106d36  do_event_channel_op+0x876/0xfd0
 ffff82c4801f9027  compat_update_descriptor+0x27/0x30
 ffff82c4801354f8  compat_multicall+0x198/0x380
 ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
 ffff82c4802013c4  compat_hypercall+0x74/0x80

the hap_track_dirty_vram() and paging_log_dirty_range() are part of the
same logical call trace, but it appears that we have taken an
event_check_interrupt() in the middle and called schedule() off the
bottom of it, calling csched_vcpu_wake().

I am currently trying to reason as to whether it is possible for a race
between csched_vcpu_{sleep,wake}() could result in the seen crash, but
it certainly looks like a smoking gun.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Race condition with scheduler runqueues
  2013-02-19 11:47   ` Andrew Cooper
@ 2013-02-26 12:08     ` George Dunlap
  2013-02-26 13:44       ` Andrew Cooper
  0 siblings, 1 reply; 8+ messages in thread
From: George Dunlap @ 2013-02-26 12:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Juergen Gross, Keir (Xen.org), Jan Beulich, Xen-devel List

On 02/19/2013 11:47 AM, Andrew Cooper wrote:
> On 19/02/13 09:28, Jan Beulich wrote:
>>>>> On 18.02.13 at 19:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> Hello,
>>>
>>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>>> in sched_credit.c (because a static function of the same name also
>>> exists in sched_credit2.c, which confused matters to start with)
>>>
>>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>>> domain.  The test case itself has passed many times in the past on the
>>> same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
>>> does not appear to be any relevant changes between the version of Xen in
>>> the test and xen-4.1-testing.
>>>
>>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>>> from dom0, targeting the VCPU of the migrating domain.
>>>
>>> (XEN) Xen call trace:
>>> (XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>>> (XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>>> (XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>>> (XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>>> (XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>>> (XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>>
>>> The relevant part of csched_vcpu_sleep() is
>>>
>>>      else if ( __vcpu_on_runq(svc) )
>>>          __runq_remove(svc);
>>>
>>> which disassembles to
>>>
>>> ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
>>> ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
>>> ffff82c480116a07:       75 07                   jne    ffff82c480116a10
>>> <csched_vcpu_sleep+0x40>
>>> ffff82c480116a09:       f3 c3                   repz retq
>>> ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>> ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
>>> ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
>>> <- Pagefault here
>>> ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
>>> ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
>>> ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)
>>>
>>> The relevant crash registers from the pagefault are:
>>> rax: 0000000000000000
>>> rdx: 0000000000000000
>>>   r8: ffff83080c89ed90
>>>
>>> If I am reading the code correctly, this means that runq->next is NULL,
>>> so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
>>> fail with a fault when trying to update runq->prev, which is also NULL.
>>>
>>> The only place I can spot in the code where the runq->{next,prev} could
>>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>>> INIT_LIST_HEAD().  This is logically sensible in combination with the
>>> localhost migrate loop, and I cant immediately see anything to prevent
>>> this race happening.
>> But that doesn't make sense: csched_alloc_vdata() doesn't store
>> svc into vc->sched_priv; that's being done by the generic
>> scheduler code once the actor returns.
>
> D'oh yes - I overlooked that.
>
>>
>> So I'd rather suspect a stale pointer being used, which is easily
>> possible when racing with sched_move_domain() (as opposed to
>> schedule_cpu_switch(), where the new pointer gets stored
>> _before_ de-allocating the old one).
>
>>
>> However, sched_move_domain() (as well as schedule_cpu_switch())
>> get called only from CPU pools code, and I would guess CPU pools
>> aren't involved here, and you don't in parallel soft offline/online
>> pCPU-s (as I'm sure you otherwise would have mentioned it).
>>
>> But wait - libxl__domain_make() _unconditionally_ calls
>> xc_cpupool_movedomain(), as does XendDomainInfo's
>> _constructDomain(). The reason for this escapes me - Jürgen? Yet
>> I'd expect the pool ID matching check to short cut the resulting
>> sysctl, i.e. sched_move_domain() ought to not be reached anyway
>> (worth verifying of course).
>>
>> The race there nevertheless ought to be fixed.
>>
>> Jan
>
> Our toolstack hooks directly into libxc and is completely ignorant of
> cpupools.  Looking at the crash more closely, it might be a race elsewhere
>
> Another dom0 vcpu is in an HVMOP_track_dirty_vram hypercall, and the
> associated Xen stack trace is
>
> [ffff82c4801777b2] time_calibration_std_rendezvous+0xb2/0xc0
>   ffff82c480172d12  __smp_call_function_interrupt+0x62/0xb0
>   ffff82c48017339e  smp_call_function_interrupt+0x4e/0x90
>   ffff82c48014a65a  call_function_interrupt+0x2a/0x30
>   ffff82c4801223b2  _spin_lock+0x12/0x20
>   ffff82c4801734ab  flush_area_mask+0xcb/0x1c0
>   ffff82c4801c862a  paging_log_dirty_range+0x3a/0x780
>   ffff82c480121ea8  cpumask_raise_softirq+0x78/0x80
>   ffff82c480117ed3  csched_vcpu_wake+0x193/0x420
>   ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
>   ffff82c4801f21c7  hap_track_dirty_vram+0x137/0x1c0
>   ffff82c4801ad3fd  do_hvm_op+0x16dd/0x1f50
>   ffff82c480106251  evtchn_send+0xa1/0x160
>   ffff82c480106d36  do_event_channel_op+0x876/0xfd0
>   ffff82c4801f9027  compat_update_descriptor+0x27/0x30
>   ffff82c4801354f8  compat_multicall+0x198/0x380
>   ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
>   ffff82c4802013c4  compat_hypercall+0x74/0x80
>
> the hap_track_dirty_vram() and paging_log_dirty_range() are part of the
> same logical call trace, but it appears that we have taken an
> event_check_interrupt() in the middle and called schedule() off the
> bottom of it, calling csched_vcpu_wake().
>
> I am currently trying to reason as to whether it is possible for a race
> between csched_vcpu_{sleep,wake}() could result in the seen crash, but
> it certainly looks like a smoking gun.

Any more progress on this one?

In theory all of those should be made mutually exclusive by holding the 
lock of the runqueue on which the VCPU is running.

Any chance there's a race with the assignment of the vcpu -- that is, a 
race in vcpu_schedule_lock() such that someone ends up grabbing the 
wrong lock?

I think that in theory once you call INIT_LIST_HEAD, none of those 
pointers should ever be set to zero; if one ever were it might get 
passed around a bit before actually being followed.  Any chance there's 
something uninitialized somewhere?

And of course, if all else fails, there's good old-fashioned memory 
clobbering as a possibility...

  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Race condition with scheduler runqueues
  2013-02-26 12:08     ` George Dunlap
@ 2013-02-26 13:44       ` Andrew Cooper
  0 siblings, 0 replies; 8+ messages in thread
From: Andrew Cooper @ 2013-02-26 13:44 UTC (permalink / raw)
  To: George Dunlap; +Cc: Juergen Gross, Keir (Xen.org), Jan Beulich, Xen-devel List

On 26/02/13 12:08, George Dunlap wrote:
> On 02/19/2013 11:47 AM, Andrew Cooper wrote:
>> On 19/02/13 09:28, Jan Beulich wrote:
>>>>>> On 18.02.13 at 19:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>> Hello,
>>>>
>>>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>>>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>>>> in sched_credit.c (because a static function of the same name also
>>>> exists in sched_credit2.c, which confused matters to start with)
>>>>
>>>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>>>> domain.  The test case itself has passed many times in the past on the
>>>> same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
>>>> does not appear to be any relevant changes between the version of Xen in
>>>> the test and xen-4.1-testing.
>>>>
>>>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>>>> from dom0, targeting the VCPU of the migrating domain.
>>>>
>>>> (XEN) Xen call trace:
>>>> (XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>>>> (XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>>>> (XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>>>> (XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>>>> (XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>>>> (XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>>>
>>>> The relevant part of csched_vcpu_sleep() is
>>>>
>>>>      else if ( __vcpu_on_runq(svc) )
>>>>          __runq_remove(svc);
>>>>
>>>> which disassembles to
>>>>
>>>> ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
>>>> ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
>>>> ffff82c480116a07:       75 07                   jne    ffff82c480116a10
>>>> <csched_vcpu_sleep+0x40>
>>>> ffff82c480116a09:       f3 c3                   repz retq
>>>> ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>>> ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
>>>> ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
>>>> <- Pagefault here
>>>> ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
>>>> ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
>>>> ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)
>>>>
>>>> The relevant crash registers from the pagefault are:
>>>> rax: 0000000000000000
>>>> rdx: 0000000000000000
>>>>   r8: ffff83080c89ed90
>>>>
>>>> If I am reading the code correctly, this means that runq->next is NULL,
>>>> so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
>>>> fail with a fault when trying to update runq->prev, which is also NULL.
>>>>
>>>> The only place I can spot in the code where the runq->{next,prev} could
>>>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>>>> INIT_LIST_HEAD().  This is logically sensible in combination with the
>>>> localhost migrate loop, and I cant immediately see anything to prevent
>>>> this race happening.
>>> But that doesn't make sense: csched_alloc_vdata() doesn't store
>>> svc into vc->sched_priv; that's being done by the generic
>>> scheduler code once the actor returns.
>>
>> D'oh yes - I overlooked that.
>>
>>>
>>> So I'd rather suspect a stale pointer being used, which is easily
>>> possible when racing with sched_move_domain() (as opposed to
>>> schedule_cpu_switch(), where the new pointer gets stored
>>> _before_ de-allocating the old one).
>>
>>>
>>> However, sched_move_domain() (as well as schedule_cpu_switch())
>>> get called only from CPU pools code, and I would guess CPU pools
>>> aren't involved here, and you don't in parallel soft offline/online
>>> pCPU-s (as I'm sure you otherwise would have mentioned it).
>>>
>>> But wait - libxl__domain_make() _unconditionally_ calls
>>> xc_cpupool_movedomain(), as does XendDomainInfo's
>>> _constructDomain(). The reason for this escapes me - Jürgen? Yet
>>> I'd expect the pool ID matching check to short cut the resulting
>>> sysctl, i.e. sched_move_domain() ought to not be reached anyway
>>> (worth verifying of course).
>>>
>>> The race there nevertheless ought to be fixed.
>>>
>>> Jan
>>
>> Our toolstack hooks directly into libxc and is completely ignorant of
>> cpupools.  Looking at the crash more closely, it might be a race elsewhere
>>
>> Another dom0 vcpu is in an HVMOP_track_dirty_vram hypercall, and the
>> associated Xen stack trace is
>>
>> [ffff82c4801777b2] time_calibration_std_rendezvous+0xb2/0xc0
>>   ffff82c480172d12  __smp_call_function_interrupt+0x62/0xb0
>>   ffff82c48017339e  smp_call_function_interrupt+0x4e/0x90
>>   ffff82c48014a65a  call_function_interrupt+0x2a/0x30
>>   ffff82c4801223b2  _spin_lock+0x12/0x20
>>   ffff82c4801734ab  flush_area_mask+0xcb/0x1c0
>>   ffff82c4801c862a  paging_log_dirty_range+0x3a/0x780
>>   ffff82c480121ea8  cpumask_raise_softirq+0x78/0x80
>>   ffff82c480117ed3  csched_vcpu_wake+0x193/0x420
>>   ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
>>   ffff82c4801f21c7  hap_track_dirty_vram+0x137/0x1c0
>>   ffff82c4801ad3fd  do_hvm_op+0x16dd/0x1f50
>>   ffff82c480106251  evtchn_send+0xa1/0x160
>>   ffff82c480106d36  do_event_channel_op+0x876/0xfd0
>>   ffff82c4801f9027  compat_update_descriptor+0x27/0x30
>>   ffff82c4801354f8  compat_multicall+0x198/0x380
>>   ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
>>   ffff82c4802013c4  compat_hypercall+0x74/0x80
>>
>> the hap_track_dirty_vram() and paging_log_dirty_range() are part of the
>> same logical call trace, but it appears that we have taken an
>> event_check_interrupt() in the middle and called schedule() off the
>> bottom of it, calling csched_vcpu_wake().
>>
>> I am currently trying to reason as to whether it is possible for a race
>> between csched_vcpu_{sleep,wake}() could result in the seen crash, but
>> it certainly looks like a smoking gun.
>
> Any more progress on this one?
>
> In theory all of those should be made mutually exclusive by holding the 
> lock of the runqueue on which the VCPU is running.
>
> Any chance there's a race with the assignment of the vcpu -- that is, a 
> race in vcpu_schedule_lock() such that someone ends up grabbing the 
> wrong lock?
>
> I think that in theory once you call INIT_LIST_HEAD, none of those 
> pointers should ever be set to zero; if one ever were it might get 
> passed around a bit before actually being followed.  Any chance there's 
> something uninitialized somewhere?
>
> And of course, if all else fails, there's good old-fashioned memory 
> clobbering as a possibility...
>
>   -George

No more progress I am afraid - Other more easily reproducible bugs came up.

I did identify another ticket in our system which got mis-clasified,
which appears to be the same/similar race condition

(XEN) Xen BUG at sched_credit.c:204
(XEN) ----[ Xen-4.1.3  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    11
(XEN) RIP:    e008:[<ffff82c480117bbb>] csched_vcpu_wake+0x3cb/0x3f0
(XEN) RFLAGS: 0000000000010003   CONTEXT: hypervisor
(XEN) rax: ffff830994b37438   rbx: ffff830994b36000   rcx: ffff830836ad1d60
(XEN) rdx: ffff830994b36000   rsi: 0000000000000000   rdi: ffff82c4802cb7c0
(XEN) rbp: ffff830a93343f20   rsp: ffff830836a97d80   r8:  ffff830a93343f20
(XEN) r9:  000000000000000b   r10: 000000000000000b   r11: ffff82c4802ba0e0
(XEN) r12: ffff830994b37500   r13: 000001b166257eba   r14: ffff82c4802cb7c0
(XEN) r15: ffff82c4802ba0e0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000a92721000   cr2: 0000000008ee7dc8
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff830836a97d80:
(XEN)    ffff8300be6bf7d0 00000000fed000f0 ffff82c48014a130 0000000000000000
(XEN)    ffff830836a97f18 ffff830836a98010 ffff82c480149dda ffff830994b36000
(XEN)    ffff8300be6aa000 ffff830994b37500 000001b166257eba ffff82c4802cb7c0
(XEN)    ffff82c4802ba0e0 ffff82c4801206e0 0000000000000000 0000000000000286
(XEN)    000001b1320009f1 0000000000000003 0000000000000001 0000000000000000
(XEN)    ffff8300be6aa000 ffff8300be6ab790 ffff82c4801b7620 ffff8304501a7010
(XEN)    ffff8304501a7010 ffff82c48014f13d ffff8300be6ab790 ffff8300be6ab790
(XEN)    000001b166257de4 ffff82c4801b7642 ffff830836a9e100 ffff82c48012388c
(XEN)    ffff8300be6ab7d0 ffff830836a9e100 ffff8304501a7018 ffff82c480123d35
(XEN)    ffff82c4801af486 000000000000000b ffffffffffffffff ffff830836a97f18
(XEN)    ffff82c4802a8500 ffff82c4802ac500 000001b16625431d ffff82c4801216b5
(XEN)    ffff830836a97f18 ffff8300bf2f0000 ffff8300be74c000 ffff82c4802ba0e0
(XEN)    ffff830836a9e060 ffff82c48014fdc5 000000000000000b 000000000000008d
(XEN)    fffffffffffffffe ffffff007b411228 ffffff007b411200 fffffffeba4c1b90
(XEN)    0000000000000000 ffffffff8044b8c0 ffffff007b583e10 ffffff007b583e88
(XEN)    fffffffeba4c1d10 00000000000003e8 00000000000f4240 0000000000000000
(XEN)    0000000000000001 0000000000000000 0000006d00000000 ffffffff807371c6
(XEN)    0000000000000000 0000000000000246 fffffffeba4c1b90 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    000000000000000b ffff8300bf2f0000 00000043b67d2880 0000000000000000
(XEN) Xen call trace:
(XEN)       [<ffff82c480117bbb>] csched_vcpu_wake+0x3cb/0x3f0
(XEN)      2[<ffff82c48014a130>] smp_apic_timer_interrupt+0x50/0x90
(XEN)      6[<ffff82c480149dda>] apic_timer_interrupt+0x2a/0x30
(XEN)     13[<ffff82c4801206e0>] vcpu_wake+0x180/0x600
(XEN)     22[<ffff82c4801b7620>] pt_timer_fn+0x0/0x30
(XEN)     25[<ffff82c48014f13d>] vcpu_kick+0x1d/0x80
(XEN)     29[<ffff82c4801b7642>] pt_timer_fn+0x22/0x30
(XEN)     31[<ffff82c48012388c>] execute_timer+0x4c/0x70
(XEN)     35[<ffff82c480123d35>] timer_softirq_action+0x85/0x220
(XEN)     36[<ffff82c4801af486>] hvm_vcpu_has_pending_irq+0x76/0xd0
(XEN)     43[<ffff82c4801216b5>] __do_softirq+0x65/0x90
(XEN)     49[<ffff82c48014fdc5>] idle_loop+0x25/0x50
(XEN)   
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 11:
(XEN) Xen BUG at sched_credit.c:204
(XEN) ****************************************

The bug in question is BUG_ON( cpu != svc->vcpu->processor );

This is suspcious as, like before, there is an LAPIC-triggered interrupt
in the middle of a scheduling codepath.  Again, it has been seen only
once through multiple cycles of testing.

If you have any ideas then fantastic, but I have not forgotten about it
- just had other stuff come up.  I will get back to it at some point!

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Race condition with scheduler runqueues
  2013-02-19  9:28 ` Jan Beulich
  2013-02-19 11:47   ` Andrew Cooper
@ 2013-02-27  6:19   ` Juergen Gross
  2013-02-27  6:28     ` Juergen Gross
  1 sibling, 1 reply; 8+ messages in thread
From: Juergen Gross @ 2013-02-27  6:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, Keir Fraser, Xen-devel List

[-- Attachment #1: Type: text/plain, Size: 4474 bytes --]

On 19.02.2013 10:28, Jan Beulich wrote:
>>>> On 18.02.13 at 19:11, Andrew Cooper<andrew.cooper3@citrix.com>  wrote:
>> Hello,
>>
>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>> in sched_credit.c (because a static function of the same name also
>> exists in sched_credit2.c, which confused matters to start with)
>>
>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>> domain.  The test case itself has passed many times in the past on the
>> same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
>> does not appear to be any relevant changes between the version of Xen in
>> the test and xen-4.1-testing.
>>
>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>> from dom0, targeting the VCPU of the migrating domain.
>>
>> (XEN) Xen call trace:
>> (XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>> (XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>> (XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>> (XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>> (XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>> (XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>
>> The relevant part of csched_vcpu_sleep() is
>>
>>      else if ( __vcpu_on_runq(svc) )
>>          __runq_remove(svc);
>>
>> which disassembles to
>>
>> ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
>> ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
>> ffff82c480116a07:       75 07                   jne    ffff82c480116a10
>> <csched_vcpu_sleep+0x40>
>> ffff82c480116a09:       f3 c3                   repz retq
>> ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>> ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
>> ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
>> <- Pagefault here
>> ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
>> ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
>> ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)
>>
>> The relevant crash registers from the pagefault are:
>> rax: 0000000000000000
>> rdx: 0000000000000000
>>   r8: ffff83080c89ed90
>>
>> If I am reading the code correctly, this means that runq->next is NULL,
>> so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
>> fail with a fault when trying to update runq->prev, which is also NULL.
>>
>> The only place I can spot in the code where the runq->{next,prev} could
>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>> INIT_LIST_HEAD().  This is logically sensible in combination with the
>> localhost migrate loop, and I cant immediately see anything to prevent
>> this race happening.
>
> But that doesn't make sense: csched_alloc_vdata() doesn't store
> svc into vc->sched_priv; that's being done by the generic
> scheduler code once the actor returns.
>
> So I'd rather suspect a stale pointer being used, which is easily
> possible when racing with sched_move_domain() (as opposed to
> schedule_cpu_switch(), where the new pointer gets stored
> _before_ de-allocating the old one).
>
> However, sched_move_domain() (as well as schedule_cpu_switch())
> get called only from CPU pools code, and I would guess CPU pools
> aren't involved here, and you don't in parallel soft offline/online
> pCPU-s (as I'm sure you otherwise would have mentioned it).
>
> But wait - libxl__domain_make() _unconditionally_ calls
> xc_cpupool_movedomain(), as does XendDomainInfo's
> _constructDomain(). The reason for this escapes me - Jürgen? Yet
> I'd expect the pool ID matching check to short cut the resulting
> sysctl, i.e. sched_move_domain() ought to not be reached anyway
> (worth verifying of course).
>
> The race there nevertheless ought to be fixed.

Something like the attached patch?

Not tested thoroughly yet.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

[-- Attachment #2: movedom.patch --]
[-- Type: text/x-patch, Size: 1955 bytes --]

Avoid stale pointer when moving domain to another cpupool

When a domain is moved to another cpupool the scheduler private data pointers
in vcpu and domain structures must never point to an already freed memory
area.

Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

diff -r 1d8c65aee03e xen/common/schedule.c
--- a/xen/common/schedule.c     Tue Feb 26 10:12:46 2013 +0000
+++ b/xen/common/schedule.c     Wed Feb 27 06:50:51 2013 +0100
@@ -231,12 +231,14 @@ int sched_move_domain(struct domain *d, 
     unsigned int new_p;
     void **vcpu_priv;
     void *domdata;
+    struct scheduler *old_ops;
+    void *old_domdata;
 
     domdata = SCHED_OP(c->sched, alloc_domdata, d);
     if ( domdata == NULL )
         return -ENOMEM;
 
-    vcpu_priv = xzalloc_array(void *, d->max_vcpus);
+    vcpu_priv = xzalloc_array(void *, d->max_vcpus * 2);
     if ( vcpu_priv == NULL )
     {
         SCHED_OP(c->sched, free_domdata, domdata);
@@ -257,18 +259,13 @@ int sched_move_domain(struct domain *d, 
             SCHED_OP(c->sched, free_domdata, domdata);
             return -ENOMEM;
         }
+        vcpu_priv[d->max_vcpus + v->vcpu_id] = v->sched_priv;
     }
 
     domain_pause(d);
 
-    for_each_vcpu ( d, v )
-    {
-        SCHED_OP(VCPU2OP(v), remove_vcpu, v);
-        SCHED_OP(VCPU2OP(v), free_vdata, v->sched_priv);
-        v->sched_priv = NULL;
-    }
-
-    SCHED_OP(DOM2OP(d), free_domdata, d->sched_priv);
+    old_ops = DOM2OP(d);
+    old_domdata = d->sched_priv;
 
     d->cpupool = c;
     d->sched_priv = domdata;
@@ -289,6 +286,15 @@ int sched_move_domain(struct domain *d, 
 
         SCHED_OP(c->sched, insert_vcpu, v);
     }
+
+    for_each_vcpu ( d, v )
+    {
+        SCHED_OP(old_ops, remove_vcpu, v);
+        SCHED_OP(old_ops, free_vdata, vcpu_priv[d->max_vcpus + v->vcpu_id]);
+        v->sched_priv = NULL;
+    }
+
+    SCHED_OP(old_ops, free_domdata, old_domdata);
 
     domain_update_node_affinity(d);


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Race condition with scheduler runqueues
  2013-02-27  6:19   ` Juergen Gross
@ 2013-02-27  6:28     ` Juergen Gross
  2013-02-27  9:12       ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Juergen Gross @ 2013-02-27  6:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, Keir Fraser, Xen-devel List

[-- Attachment #1: Type: text/plain, Size: 4393 bytes --]

On 27.02.2013 07:19, Juergen Gross wrote:
> On 19.02.2013 10:28, Jan Beulich wrote:
>>>>> On 18.02.13 at 19:11, Andrew Cooper<andrew.cooper3@citrix.com> wrote:
>>> Hello,
>>>
>>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>>> in sched_credit.c (because a static function of the same name also
>>> exists in sched_credit2.c, which confused matters to start with)
>>>
>>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>>> domain. The test case itself has passed many times in the past on the
>>> same Xen codebase (Xen-4.1.3), indicating that it is very rare. There
>>> does not appear to be any relevant changes between the version of Xen in
>>> the test and xen-4.1-testing.
>>>
>>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>>> from dom0, targeting the VCPU of the migrating domain.
>>>
>>> (XEN) Xen call trace:
>>> (XEN) [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>>> (XEN) 0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>>> (XEN) 12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>>> (XEN) 14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>>> (XEN) 24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>>> (XEN) 64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>>
>>> The relevant part of csched_vcpu_sleep() is
>>>
>>> else if ( __vcpu_on_runq(svc) )
>>> __runq_remove(svc);
>>>
>>> which disassembles to
>>>
>>> ffff82c480116a01: 49 8b 10 mov (%r8),%rdx
>>> ffff82c480116a04: 4c 39 c2 cmp %r8,%rdx
>>> ffff82c480116a07: 75 07 jne ffff82c480116a10
>>> <csched_vcpu_sleep+0x40>
>>> ffff82c480116a09: f3 c3 repz retq
>>> ffff82c480116a0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
>>> ffff82c480116a10: 49 8b 40 08 mov 0x8(%r8),%rax
>>> ffff82c480116a14: 48 89 42 08 mov %rax,0x8(%rdx) #
>>> <- Pagefault here
>>> ffff82c480116a18: 48 89 10 mov %rdx,(%rax)
>>> ffff82c480116a1b: 4d 89 40 08 mov %r8,0x8(%r8)
>>> ffff82c480116a1f: 4d 89 00 mov %r8,(%r8)
>>>
>>> The relevant crash registers from the pagefault are:
>>> rax: 0000000000000000
>>> rdx: 0000000000000000
>>> r8: ffff83080c89ed90
>>>
>>> If I am reading the code correctly, this means that runq->next is NULL,
>>> so we fail list_empty() and erroneously pass __vcpu_on_runq(). We then
>>> fail with a fault when trying to update runq->prev, which is also NULL.
>>>
>>> The only place I can spot in the code where the runq->{next,prev} could
>>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>>> INIT_LIST_HEAD(). This is logically sensible in combination with the
>>> localhost migrate loop, and I cant immediately see anything to prevent
>>> this race happening.
>>
>> But that doesn't make sense: csched_alloc_vdata() doesn't store
>> svc into vc->sched_priv; that's being done by the generic
>> scheduler code once the actor returns.
>>
>> So I'd rather suspect a stale pointer being used, which is easily
>> possible when racing with sched_move_domain() (as opposed to
>> schedule_cpu_switch(), where the new pointer gets stored
>> _before_ de-allocating the old one).
>>
>> However, sched_move_domain() (as well as schedule_cpu_switch())
>> get called only from CPU pools code, and I would guess CPU pools
>> aren't involved here, and you don't in parallel soft offline/online
>> pCPU-s (as I'm sure you otherwise would have mentioned it).
>>
>> But wait - libxl__domain_make() _unconditionally_ calls
>> xc_cpupool_movedomain(), as does XendDomainInfo's
>> _constructDomain(). The reason for this escapes me - Jürgen? Yet
>> I'd expect the pool ID matching check to short cut the resulting
>> sysctl, i.e. sched_move_domain() ought to not be reached anyway
>> (worth verifying of course).
>>
>> The race there nevertheless ought to be fixed.
>
> Something like the attached patch?
>
> Not tested thoroughly yet.

Argh. Sent an old version, sorry. This one should be better.

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

[-- Attachment #2: movedom.patch --]
[-- Type: text/x-patch, Size: 1882 bytes --]

Avoid stale pointer when moving domain to another cpupool

When a domain is moved to another cpupool the scheduler private data pointers
in vcpu and domain structures must never point to an already freed memory
area.

Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

diff -r 1d8c65aee03e xen/common/schedule.c
--- a/xen/common/schedule.c     Tue Feb 26 10:12:46 2013 +0000
+++ b/xen/common/schedule.c     Wed Feb 27 07:16:05 2013 +0100
@@ -231,12 +231,14 @@ int sched_move_domain(struct domain *d, 
     unsigned int new_p;
     void **vcpu_priv;
     void *domdata;
+    struct scheduler *old_ops;
+    void *old_domdata;
 
     domdata = SCHED_OP(c->sched, alloc_domdata, d);
     if ( domdata == NULL )
         return -ENOMEM;
 
-    vcpu_priv = xzalloc_array(void *, d->max_vcpus);
+    vcpu_priv = xzalloc_array(void *, d->max_vcpus * 2);
     if ( vcpu_priv == NULL )
     {
         SCHED_OP(c->sched, free_domdata, domdata);
@@ -257,18 +259,18 @@ int sched_move_domain(struct domain *d, 
             SCHED_OP(c->sched, free_domdata, domdata);
             return -ENOMEM;
         }
+        vcpu_priv[d->max_vcpus + v->vcpu_id] = v->sched_priv;
     }
 
     domain_pause(d);
 
+    old_ops = DOM2OP(d);
+    old_domdata = d->sched_priv;
+
     for_each_vcpu ( d, v )
     {
         SCHED_OP(VCPU2OP(v), remove_vcpu, v);
-        SCHED_OP(VCPU2OP(v), free_vdata, v->sched_priv);
-        v->sched_priv = NULL;
     }
-
-    SCHED_OP(DOM2OP(d), free_domdata, d->sched_priv);
 
     d->cpupool = c;
     d->sched_priv = domdata;
@@ -289,6 +291,13 @@ int sched_move_domain(struct domain *d, 
 
         SCHED_OP(c->sched, insert_vcpu, v);
     }
+
+    for_each_vcpu ( d, v )
+    {
+        SCHED_OP(old_ops, free_vdata, vcpu_priv[d->max_vcpus + v->vcpu_id]);
+    }
+
+    SCHED_OP(old_ops, free_domdata, old_domdata);
 
     domain_update_node_affinity(d);


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Race condition with scheduler runqueues
  2013-02-27  6:28     ` Juergen Gross
@ 2013-02-27  9:12       ` Jan Beulich
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Beulich @ 2013-02-27  9:12 UTC (permalink / raw)
  To: Juergen Gross; +Cc: George Dunlap, Andrew Cooper, Keir Fraser, Xen-devel List

>>> On 27.02.13 at 07:28, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote:
> On 27.02.2013 07:19, Juergen Gross wrote:
>> On 19.02.2013 10:28, Jan Beulich wrote:
>>>>>> On 18.02.13 at 19:11, Andrew Cooper<andrew.cooper3@citrix.com> wrote:
>>>> Hello,
>>>>
>>>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>>>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>>>> in sched_credit.c (because a static function of the same name also
>>>> exists in sched_credit2.c, which confused matters to start with)
>>>>
>>>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>>>> domain. The test case itself has passed many times in the past on the
>>>> same Xen codebase (Xen-4.1.3), indicating that it is very rare. There
>>>> does not appear to be any relevant changes between the version of Xen in
>>>> the test and xen-4.1-testing.
>>>>
>>>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>>>> from dom0, targeting the VCPU of the migrating domain.
>>>>
>>>> (XEN) Xen call trace:
>>>> (XEN) [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>>>> (XEN) 0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>>>> (XEN) 12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>>>> (XEN) 14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>>>> (XEN) 24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>>>> (XEN) 64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>>>
>>>> The relevant part of csched_vcpu_sleep() is
>>>>
>>>> else if ( __vcpu_on_runq(svc) )
>>>> __runq_remove(svc);
>>>>
>>>> which disassembles to
>>>>
>>>> ffff82c480116a01: 49 8b 10 mov (%r8),%rdx
>>>> ffff82c480116a04: 4c 39 c2 cmp %r8,%rdx
>>>> ffff82c480116a07: 75 07 jne ffff82c480116a10
>>>> <csched_vcpu_sleep+0x40>
>>>> ffff82c480116a09: f3 c3 repz retq
>>>> ffff82c480116a0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
>>>> ffff82c480116a10: 49 8b 40 08 mov 0x8(%r8),%rax
>>>> ffff82c480116a14: 48 89 42 08 mov %rax,0x8(%rdx) #
>>>> <- Pagefault here
>>>> ffff82c480116a18: 48 89 10 mov %rdx,(%rax)
>>>> ffff82c480116a1b: 4d 89 40 08 mov %r8,0x8(%r8)
>>>> ffff82c480116a1f: 4d 89 00 mov %r8,(%r8)
>>>>
>>>> The relevant crash registers from the pagefault are:
>>>> rax: 0000000000000000
>>>> rdx: 0000000000000000
>>>> r8: ffff83080c89ed90
>>>>
>>>> If I am reading the code correctly, this means that runq->next is NULL,
>>>> so we fail list_empty() and erroneously pass __vcpu_on_runq(). We then
>>>> fail with a fault when trying to update runq->prev, which is also NULL.
>>>>
>>>> The only place I can spot in the code where the runq->{next,prev} could
>>>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>>>> INIT_LIST_HEAD(). This is logically sensible in combination with the
>>>> localhost migrate loop, and I cant immediately see anything to prevent
>>>> this race happening.
>>>
>>> But that doesn't make sense: csched_alloc_vdata() doesn't store
>>> svc into vc->sched_priv; that's being done by the generic
>>> scheduler code once the actor returns.
>>>
>>> So I'd rather suspect a stale pointer being used, which is easily
>>> possible when racing with sched_move_domain() (as opposed to
>>> schedule_cpu_switch(), where the new pointer gets stored
>>> _before_ de-allocating the old one).
>>>
>>> However, sched_move_domain() (as well as schedule_cpu_switch())
>>> get called only from CPU pools code, and I would guess CPU pools
>>> aren't involved here, and you don't in parallel soft offline/online
>>> pCPU-s (as I'm sure you otherwise would have mentioned it).
>>>
>>> But wait - libxl__domain_make() _unconditionally_ calls
>>> xc_cpupool_movedomain(), as does XendDomainInfo's
>>> _constructDomain(). The reason for this escapes me - Jürgen? Yet
>>> I'd expect the pool ID matching check to short cut the resulting
>>> sysctl, i.e. sched_move_domain() ought to not be reached anyway
>>> (worth verifying of course).
>>>
>>> The race there nevertheless ought to be fixed.
>>
>> Something like the attached patch?
>>
>> Not tested thoroughly yet.
> 
> Argh. Sent an old version, sorry. This one should be better.

Sort of. I was under the impression that by storing just a single
"old" pointer temporarily inside the last loop and moving the
free_vdata invocation to the end of this same last loop would
equally well take care of it, without increasing storage
requirements (and hence also increasing the risk of failure),
and without yet another for_each_vcpu() loop.

Or, if going your route, you would want to move the final freeing
operations past the domain_unpause() I think (which could be
done even with the approach suggested above, by putting the
to-be-freed pointer back into the same vcpu_priv[] slot).

And then, if you already have to cache DOM2OP() in a local
variable, you should of course convert the remaining use of
VCPU2OP() to use the cached copy (the two uses of the latter
were bogus anyway, as the simpler DOM2OP() would have
done; similarly in sched_init_vcpu()).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-02-27  9:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-18 18:11 Race condition with scheduler runqueues Andrew Cooper
2013-02-19  9:28 ` Jan Beulich
2013-02-19 11:47   ` Andrew Cooper
2013-02-26 12:08     ` George Dunlap
2013-02-26 13:44       ` Andrew Cooper
2013-02-27  6:19   ` Juergen Gross
2013-02-27  6:28     ` Juergen Gross
2013-02-27  9:12       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.