All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Avoid race when moving cpu between cpupools
@ 2011-02-24 10:00 Juergen Gross
  2011-02-24 14:08 ` Andre Przywara
  0 siblings, 1 reply; 7+ messages in thread
From: Juergen Gross @ 2011-02-24 10:00 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 498 bytes --]

Moving cpus between cpupools is done under the schedule lock of the moved cpu.
When checking a cpu being member of a cpupool this must be done with the lock
of that cpu being held.
Hot-unplugging of physical cpus might encounter the same problems, but this
should happen only very rarely.

Signed-off-by: juergen.gross@ts.fujitsu.com


2 files changed, 35 insertions(+), 7 deletions(-)
xen/common/sched_credit.c |    3 ++-
xen/common/schedule.c     |   39 +++++++++++++++++++++++++++++++++------



[-- Attachment #2: xen-work.patch --]
[-- Type: text/x-patch, Size: 3367 bytes --]

# HG changeset patch
# User Juergen Gross <juergen.gross@ts.fujitsu.com>
# Date 1298541607 -3600
# Node ID 5485071c8b0a6a49f65b7cc47841f7d14b358247
# Parent  c0a46434347b265fc8c45e9be3adc41b43f4f682
Avoid race when moving cpu between cpupools

Moving cpus between cpupools is done under the schedule lock of the moved cpu.
When checking a cpu being member of a cpupool this must be done with the lock
of that cpu being held.
Hot-unplugging of physical cpus might encounter the same problems, but this
should happen only very rarely.

Signed-off-by: juergen.gross@ts.fujitsu.com

diff -r c0a46434347b -r 5485071c8b0a xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Wed Feb 16 18:23:48 2011 +0000
+++ b/xen/common/sched_credit.c	Thu Feb 24 11:00:07 2011 +0100
@@ -1268,7 +1268,8 @@ csched_load_balance(struct csched_privat
         /*
          * Any work over there to steal?
          */
-        speer = csched_runq_steal(peer_cpu, cpu, snext->pri);
+        speer = cpu_isset(peer_cpu, *online) ?
+            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
         pcpu_schedule_unlock(peer_cpu);
         if ( speer != NULL )
         {
diff -r c0a46434347b -r 5485071c8b0a xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Feb 16 18:23:48 2011 +0000
+++ b/xen/common/schedule.c	Thu Feb 24 11:00:07 2011 +0100
@@ -394,8 +394,32 @@ static void vcpu_migrate(struct vcpu *v)
 {
     unsigned long flags;
     int old_cpu, new_cpu;
+    int same_lock;
 
-    vcpu_schedule_lock_irqsave(v, flags);
+    for (;;)
+    {
+        vcpu_schedule_lock_irqsave(v, flags);
+
+        /* Select new CPU. */
+        old_cpu = v->processor;
+        new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v);
+        same_lock = (per_cpu(schedule_data, new_cpu).schedule_lock ==
+                     per_cpu(schedule_data, old_cpu).schedule_lock);
+
+        if ( same_lock )
+            break;
+
+        if ( !pcpu_schedule_trylock(new_cpu) )
+        {
+            vcpu_schedule_unlock_irqrestore(v, flags);
+            continue;
+        }
+        if ( cpu_isset(new_cpu, v->domain->cpupool->cpu_valid) )
+            break;
+
+        pcpu_schedule_unlock(new_cpu);
+        vcpu_schedule_unlock_irqrestore(v, flags);
+    }
 
     /*
      * NB. Check of v->running happens /after/ setting migration flag
@@ -405,13 +429,12 @@ static void vcpu_migrate(struct vcpu *v)
     if ( v->is_running ||
          !test_and_clear_bit(_VPF_migrating, &v->pause_flags) )
     {
+        if ( !same_lock )
+            pcpu_schedule_unlock(new_cpu);
+
         vcpu_schedule_unlock_irqrestore(v, flags);
         return;
     }
-
-    /* Select new CPU. */
-    old_cpu = v->processor;
-    new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v);
 
     /*
      * Transfer urgency status to new CPU before switching CPUs, as once
@@ -424,9 +447,13 @@ static void vcpu_migrate(struct vcpu *v)
         atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
     }
 
-    /* Switch to new CPU, then unlock old CPU.  This is safe because
+    /* Switch to new CPU, then unlock new and old CPU.  This is safe because
      * the lock pointer cant' change while the current lock is held. */
     v->processor = new_cpu;
+
+    if ( !same_lock )
+        pcpu_schedule_unlock(new_cpu);
+
     spin_unlock_irqrestore(
         per_cpu(schedule_data, old_cpu).schedule_lock, flags);
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Avoid race when moving cpu between cpupools
  2011-02-24 10:00 [PATCH] Avoid race when moving cpu between cpupools Juergen Gross
@ 2011-02-24 14:08 ` Andre Przywara
  2011-02-24 14:33   ` George Dunlap
  0 siblings, 1 reply; 7+ messages in thread
From: Andre Przywara @ 2011-02-24 14:08 UTC (permalink / raw)
  To: Juergen Gross, Keir Fraser; +Cc: Dunlap, xen-devel

Juergen Gross wrote:
> Moving cpus between cpupools is done under the schedule lock of the moved cpu.
> When checking a cpu being member of a cpupool this must be done with the lock
> of that cpu being held.

I have reviewed and tested the patch. It fixes my problem. My script has 
been running for several hundred iterations without any Xen crash, 
whereas without the patch the hypervisor crashed mostly at the second 
iteration.

Thanks Juergen and George for the persistent work!

> Hot-unplugging of physical cpus might encounter the same problems, but this
> should happen only very rarely.
> 
> Signed-off-by: juergen.gross@ts.fujitsu.com

Acked-by: Andre Przywara <andre.przywara@amd.com>

Keir, please apply for 4.1.0.


Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Avoid race when moving cpu between cpupools
  2011-02-24 14:08 ` Andre Przywara
@ 2011-02-24 14:33   ` George Dunlap
  2011-02-25 14:25     ` Andre Przywara
  0 siblings, 1 reply; 7+ messages in thread
From: George Dunlap @ 2011-02-24 14:33 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Juergen Gross, xen-devel, Keir Fraser, Diestelhorst, Stephan

Looks good -- thanks Juergen.

Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

 -George

On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara <andre.przywara@amd.com> wrote:
> Juergen Gross wrote:
>>
>> Moving cpus between cpupools is done under the schedule lock of the moved
>> cpu.
>> When checking a cpu being member of a cpupool this must be done with the
>> lock
>> of that cpu being held.
>
> I have reviewed and tested the patch. It fixes my problem. My script has
> been running for several hundred iterations without any Xen crash, whereas
> without the patch the hypervisor crashed mostly at the second iteration.
>
> Thanks Juergen and George for the persistent work!
>
>> Hot-unplugging of physical cpus might encounter the same problems, but
>> this
>> should happen only very rarely.
>>
>> Signed-off-by: juergen.gross@ts.fujitsu.com
>
> Acked-by: Andre Przywara <andre.przywara@amd.com>
>
> Keir, please apply for 4.1.0.
>
>
> Regards,
> Andre.
>
> --
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Avoid race when moving cpu between cpupools
  2011-02-24 14:33   ` George Dunlap
@ 2011-02-25 14:25     ` Andre Przywara
  2011-02-25 14:36       ` Keir Fraser
  2011-02-28  9:29       ` Juergen Gross
  0 siblings, 2 replies; 7+ messages in thread
From: Andre Przywara @ 2011-02-25 14:25 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, xen-devel, Keir Fraser, Diestelhorst, Stephan

George Dunlap wrote:
> Looks good -- thanks Juergen.
> 
> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
> 
>  -George
> 
> On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara <andre.przywara@amd.com> wrote:
>> Juergen Gross wrote:
>>> Moving cpus between cpupools is done under the schedule lock of the moved
>>> cpu.
>>> When checking a cpu being member of a cpupool this must be done with the
>>> lock
>>> of that cpu being held.
>> I have reviewed and tested the patch. It fixes my problem. My script has
>> been running for several hundred iterations without any Xen crash, whereas
>> without the patch the hypervisor crashed mostly at the second iteration.

Juergen,

can you rule out that this code will be triggered on two CPUs trying to 
switch to each other? As Stephan pointed out: the code looks like as 
this could trigger a possible dead-lock condition, where:
1) CPU A grabs lock (a) while CPU B grabs lock (b)
2) CPU A tries to grab (b) and CPU B tries to grab (a)
3) both fail and loop to 1)
A possible fix would be to introduce some ordering for the locks (just 
the pointer address) and let the "bigger" pointer yield to the "smaller" 
one. I am not sure if this is really necessary, but I now see strange 
hangs after running the script for a while (30min to 1hr).
Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or 
getting spurious ones, on two occasions the machine totally locked up.

I am not 100% sure whether this is CPUpools related, but I put some load 
on Dom0 (without messing with CPUpools) for the whole night and it ran fine.

Sorry for this :-(
I will try to further isolate this.

Anyway, it works much better with the fix than without and I will try to 
trigger this with the "reduce number of Dom0 vCPUs" patch.

Regards,
Andre.

>>
>> Thanks Juergen and George for the persistent work!
>>
>>> Hot-unplugging of physical cpus might encounter the same problems, but
>>> this
>>> should happen only very rarely.
>>>
>>> Signed-off-by: juergen.gross@ts.fujitsu.com
>> Acked-by: Andre Przywara <andre.przywara@amd.com>
>>
>> Keir, please apply for 4.1.0.
>>


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Avoid race when moving cpu between cpupools
  2011-02-25 14:25     ` Andre Przywara
@ 2011-02-25 14:36       ` Keir Fraser
  2011-02-28  9:29       ` Juergen Gross
  1 sibling, 0 replies; 7+ messages in thread
From: Keir Fraser @ 2011-02-25 14:36 UTC (permalink / raw)
  To: Andre Przywara, Juergen Gross
  Cc: George Dunlap, xen-devel, Diestelhorst, Stephan

On 25/02/2011 14:25, "Andre Przywara" <andre.przywara@amd.com> wrote:

> can you rule out that this code will be triggered on two CPUs trying to
> switch to each other? As Stephan pointed out: the code looks like as
> this could trigger a possible dead-lock condition, where:
> 1) CPU A grabs lock (a) while CPU B grabs lock (b)
> 2) CPU A tries to grab (b) and CPU B tries to grab (a)
> 3) both fail and loop to 1)
> A possible fix would be to introduce some ordering for the locks (just
> the pointer address) and let the "bigger" pointer yield to the "smaller"
> one. I am not sure if this is really necessary, but I now see strange
> hangs after running the script for a while (30min to 1hr).
> Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or
> getting spurious ones, on two occasions the machine totally locked up.

In other places in Xen where we take a pair of locks with no other implicit
ordering, we enforce an ordering based lock addresses. See
common/timer.c:migrate_timer() for example. I'm sure there must be at least
one example of this in the schedulign code already, with vcpus migrating
between cpus and needing both runqueue locks.

 -- Keir

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Avoid race when moving cpu between cpupools
  2011-02-25 14:25     ` Andre Przywara
  2011-02-25 14:36       ` Keir Fraser
@ 2011-02-28  9:29       ` Juergen Gross
  2011-02-28 10:00         ` Andre Przywara
  1 sibling, 1 reply; 7+ messages in thread
From: Juergen Gross @ 2011-02-28  9:29 UTC (permalink / raw)
  To: Andre Przywara
  Cc: George Dunlap, xen-devel, Keir Fraser, Diestelhorst, Stephan

On 02/25/11 15:25, Andre Przywara wrote:
> George Dunlap wrote:
>> Looks good -- thanks Juergen.
>>
>> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>>
>> -George
>>
>> On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara
>> <andre.przywara@amd.com> wrote:
>>> Juergen Gross wrote:
>>>> Moving cpus between cpupools is done under the schedule lock of the
>>>> moved
>>>> cpu.
>>>> When checking a cpu being member of a cpupool this must be done with
>>>> the
>>>> lock
>>>> of that cpu being held.
>>> I have reviewed and tested the patch. It fixes my problem. My script has
>>> been running for several hundred iterations without any Xen crash,
>>> whereas
>>> without the patch the hypervisor crashed mostly at the second iteration.
>
> Juergen,
>
> can you rule out that this code will be triggered on two CPUs trying to
> switch to each other? As Stephan pointed out: the code looks like as
> this could trigger a possible dead-lock condition, where:
> 1) CPU A grabs lock (a) while CPU B grabs lock (b)
> 2) CPU A tries to grab (b) and CPU B tries to grab (a)
> 3) both fail and loop to 1)

Good point. Not quite a dead-lock, but a possible live-lock :-)

> A possible fix would be to introduce some ordering for the locks (just
> the pointer address) and let the "bigger" pointer yield to the "smaller"
> one.

Done this and sent a patch.

> I am not sure if this is really necessary, but I now see strange
> hangs after running the script for a while (30min to 1hr).
> Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or
> getting spurious ones, on two occasions the machine totally locked up.
>
> I am not 100% sure whether this is CPUpools related, but I put some load
> on Dom0 (without messing with CPUpools) for the whole night and it ran
> fine.

Did you try to do this with all Dom0-vcpus pinned to 6 physical cpus?
I had the same problems when using only few physical cpus for many vcpus.
And I'm pretty sure this was NOT the possible live-lock, as it happened
already without this change when I tried to reproduce your problem.

>
> Sorry for this :-(
> I will try to further isolate this.
>
> Anyway, it works much better with the fix than without and I will try to
> trigger this with the "reduce number of Dom0 vCPUs" patch.


Thanks, Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Avoid race when moving cpu between cpupools
  2011-02-28  9:29       ` Juergen Gross
@ 2011-02-28 10:00         ` Andre Przywara
  0 siblings, 0 replies; 7+ messages in thread
From: Andre Przywara @ 2011-02-28 10:00 UTC (permalink / raw)
  To: Juergen Gross
  Cc: George Dunlap, xen-devel, Keir Fraser, Diestelhorst, Stephan

Juergen Gross wrote:
> On 02/25/11 15:25, Andre Przywara wrote:
>> George Dunlap wrote:
>>> Looks good -- thanks Juergen.
>>>
>>> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>>>
>>> -George
>>>
>>> On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara
>>> <andre.przywara@amd.com> wrote:
>>>> Juergen Gross wrote:
>>>>> Moving cpus between cpupools is done under the schedule lock of the
>>>>> moved
>>>>> cpu.
>>>>> When checking a cpu being member of a cpupool this must be done with
>>>>> the
>>>>> lock
>>>>> of that cpu being held.
>>>> I have reviewed and tested the patch. It fixes my problem. My script has
>>>> been running for several hundred iterations without any Xen crash,
>>>> whereas
>>>> without the patch the hypervisor crashed mostly at the second iteration.
>> Juergen,
>>
>> can you rule out that this code will be triggered on two CPUs trying to
>> switch to each other? As Stephan pointed out: the code looks like as
>> this could trigger a possible dead-lock condition, where:
>> 1) CPU A grabs lock (a) while CPU B grabs lock (b)
>> 2) CPU A tries to grab (b) and CPU B tries to grab (a)
>> 3) both fail and loop to 1)
> 
> Good point. Not quite a dead-lock, but a possible live-lock :-)

Yeah, sorry. That was the wrong wording. Kudos go to Stephan for 
pointing this out.
> 
>> A possible fix would be to introduce some ordering for the locks (just
>> the pointer address) and let the "bigger" pointer yield to the "smaller"
>> one.
> 
> Done this and sent a patch.

Thanks, it looks good on the first glance. Not yet tested, though.

> 
>> I am not sure if this is really necessary, but I now see strange
>> hangs after running the script for a while (30min to 1hr).
>> Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or
>> getting spurious ones, on two occasions the machine totally locked up.
>>
>> I am not 100% sure whether this is CPUpools related, but I put some load
>> on Dom0 (without messing with CPUpools) for the whole night and it ran
>> fine.
> 
> Did you try to do this with all Dom0-vcpus pinned to 6 physical cpus?
> I had the same problems when using only few physical cpus for many vcpus.
> And I'm pretty sure this was NOT the possible live-lock, as it happened
> already without this change when I tried to reproduce your problem.

That is my current theory, too. I inserted counters in the try-loop for 
the locks to detect possible lock-ups, but they didn't went over 99, so 
this is not the reason.
The high overcommit (48 vCPUs on 6 pCPUs) is probably responsible. The 
new reduction of Dom0 vCPUs should avoid this situation in the future.

> 
>> Sorry for this :-(
>> I will try to further isolate this.
>>
>> Anyway, it works much better with the fix than without and I will try to
>> trigger this with the "reduce number of Dom0 vCPUs" patch.

Unfortunately I got a Dom0 crash with the new patch. Reverting 22934 
worked fine. I will investigate this now.

root@dosorca:/data/images# xl cpupool-numa-split
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-02-28 10:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-24 10:00 [PATCH] Avoid race when moving cpu between cpupools Juergen Gross
2011-02-24 14:08 ` Andre Przywara
2011-02-24 14:33   ` George Dunlap
2011-02-25 14:25     ` Andre Przywara
2011-02-25 14:36       ` Keir Fraser
2011-02-28  9:29       ` Juergen Gross
2011-02-28 10:00         ` Andre Przywara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.