All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] xen: simplify suspend/resume handling
@ 2019-03-28 12:06 Juergen Gross
  2019-03-28 12:06 ` [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier Juergen Gross
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 12:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich, Roger Pau Monné

Especially in the scheduler area (schedule.c, cpupool.c) there is a
rather complex handling involved when doing suspend and resume.

This can be simplified a lot by not performing a complete cpu down and
up cycle for the non-boot cpus, but keeping the pure software related
state and freeing it only in case a cpu didn't come up again during
resume.

In summary not only the complexity can be reduced, but the failure
tolerance will be even better with this series: With a dedicated hook
for failing cpus when resuming it is now possible to survive e.g. a
cpupool being left without any cpu after resume by moving its domains
to cpupool0.

Juergen Gross (6):
  xen/sched: call cpu_disable_scheduler() via cpu notifier
  xen: add helper for calling notifier_call_chain() to common/cpu.c
  xen: add new cpu notifier action CPU_RESUME_FAILED
  xen: don't free percpu areas during suspend
  xen/cpupool: simplify suspend/resume handling
  xen/sched: don't disable scheduler on cpus during suspend

 xen/arch/arm/smpboot.c     |   4 -
 xen/arch/x86/percpu.c      |   3 +-
 xen/arch/x86/smpboot.c     |   3 -
 xen/common/cpu.c           |  61 +++++++-------
 xen/common/cpupool.c       | 131 ++++++++++++-----------------
 xen/common/schedule.c      | 203 +++++++++++++++++++--------------------------
 xen/include/xen/cpu.h      |  29 ++++---
 xen/include/xen/sched-if.h |   1 -
 8 files changed, 190 insertions(+), 245 deletions(-)

-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier
  2019-03-28 12:06 [PATCH v2 0/6] xen: simplify suspend/resume handling Juergen Gross
@ 2019-03-28 12:06 ` Juergen Gross
  2019-03-29 17:19   ` Dario Faggioli
  2019-04-01 10:34   ` Julien Grall
  2019-03-28 12:06 ` [PATCH v2 2/6] xen: add helper for calling notifier_call_chain() to common/cpu.c Juergen Gross
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 12:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu, George Dunlap,
	Andrew Cooper, Dario Faggioli, Julien Grall, Jan Beulich,
	Roger Pau Monné

cpu_disable_scheduler() is being called from __cpu_disable() today.
There is no need to execute it on the cpu just being disabled, so use
the CPU_DEAD case of the cpu notifier chain. Moving the call out of
stop_machine() context is fine, as we just need to hold the domain RCU
lock and need the scheduler percpu data to be still allocated.

Add another hook for CPU_DOWN_PREPARE to bail out early in case
cpu_disable_scheduler() would fail. This will avoid crashes in rare
cases for cpu hotplug or suspend.

While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
incarnation.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- add CPU_DOWN_PREPARE hook
- BUG() in case of cpu_disable_scheduler() failing in CPU_DEAD
  (Jan Beulich)
- modify ARM __cpu_disable(), too (Andrew Cooper)
---
 xen/arch/arm/smpboot.c |  4 ----
 xen/arch/x86/smpboot.c |  3 ---
 xen/common/schedule.c  | 42 +++++++++++++++++++++++++++++++++++-------
 3 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
index 25cd44549c..0728a9b505 100644
--- a/xen/arch/arm/smpboot.c
+++ b/xen/arch/arm/smpboot.c
@@ -386,10 +386,6 @@ void __cpu_disable(void)
     /* It's now safe to remove this processor from the online map */
     cpumask_clear_cpu(cpu, &cpu_online_map);
 
-    if ( cpu_disable_scheduler(cpu) )
-        BUG();
-    smp_mb();
-
     /* Return to caller; eventually the IPI mechanism will unwind and the 
      * scheduler will drop to the idle loop, which will call stop_cpu(). */
 }
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 7d1226d7bc..b7a0a4a419 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1221,9 +1221,6 @@ void __cpu_disable(void)
     cpumask_clear_cpu(cpu, &cpu_online_map);
     fixup_irqs(&cpu_online_map, 1);
     fixup_eoi();
-
-    if ( cpu_disable_scheduler(cpu) )
-        BUG();
 }
 
 void __cpu_die(unsigned int cpu)
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 60755a631e..5d2bbd5198 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -773,8 +773,9 @@ void restore_vcpu_affinity(struct domain *d)
 }
 
 /*
- * This function is used by cpu_hotplug code from stop_machine context
+ * This function is used by cpu_hotplug code via cpu notifier chain
  * and from cpupools to switch schedulers on a cpu.
+ * Caller must get domlist_read_lock.
  */
 int cpu_disable_scheduler(unsigned int cpu)
 {
@@ -789,12 +790,6 @@ int cpu_disable_scheduler(unsigned int cpu)
     if ( c == NULL )
         return ret;
 
-    /*
-     * We'd need the domain RCU lock, but:
-     *  - when we are called from cpupool code, it's acquired there already;
-     *  - when we are called for CPU teardown, we're in stop-machine context,
-     *    so that's not be a problem.
-     */
     for_each_domain_in_cpupool ( d, c )
     {
         for_each_vcpu ( d, v )
@@ -893,6 +888,30 @@ int cpu_disable_scheduler(unsigned int cpu)
     return ret;
 }
 
+static int cpu_disable_scheduler_check(unsigned int cpu)
+{
+    struct domain *d;
+    struct vcpu *v;
+    struct cpupool *c;
+
+    c = per_cpu(cpupool, cpu);
+    if ( c == NULL )
+        return 0;
+
+    for_each_domain_in_cpupool ( d, c )
+    {
+        for_each_vcpu ( d, v )
+        {
+            if ( v->affinity_broken )
+                return -EADDRINUSE;
+            if ( system_state != SYS_STATE_suspend && v->processor == cpu )
+                return -EAGAIN;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * In general, this must be called with the scheduler lock held, because the
  * adjust_affinity hook may want to modify the vCPU state. However, when the
@@ -1737,7 +1756,16 @@ static int cpu_schedule_callback(
     case CPU_UP_PREPARE:
         rc = cpu_schedule_up(cpu);
         break;
+    case CPU_DOWN_PREPARE:
+        rcu_read_lock(&domlist_read_lock);
+        rc = cpu_disable_scheduler_check(cpu);
+        rcu_read_unlock(&domlist_read_lock);
+        break;
     case CPU_DEAD:
+        rcu_read_lock(&domlist_read_lock);
+        rc = cpu_disable_scheduler(cpu);
+        BUG_ON(rc);
+        rcu_read_unlock(&domlist_read_lock);
         SCHED_OP(sched, deinit_pdata, sd->sched_priv, cpu);
         /* Fallthrough */
     case CPU_UP_CANCELED:
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/6] xen: add helper for calling notifier_call_chain() to common/cpu.c
  2019-03-28 12:06 [PATCH v2 0/6] xen: simplify suspend/resume handling Juergen Gross
  2019-03-28 12:06 ` [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier Juergen Gross
@ 2019-03-28 12:06 ` Juergen Gross
  2019-03-29 17:33   ` Dario Faggioli
  2019-03-28 12:06 ` [PATCH v2 3/6] xen: add new cpu notifier action CPU_RESUME_FAILED Juergen Gross
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 12:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich

Add a helper cpu_notifier_call_chain() to call notifier_call_chain()
for a cpu with a specified action, returning an errno value.

This avoids coding the same pattern multiple times.

While at it avoid side effects from using BUG_ON() by not using
cpu_online(cpu) as a parameter.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- add nofail parameter to cpu_notifier_call_chain()
- avoid side effects from using BUG_ON() macro (Andrew Cooper)
---
 xen/common/cpu.c | 56 ++++++++++++++++++++++++++------------------------------
 1 file changed, 26 insertions(+), 30 deletions(-)

diff --git a/xen/common/cpu.c b/xen/common/cpu.c
index 836c62f97f..8bf69600a6 100644
--- a/xen/common/cpu.c
+++ b/xen/common/cpu.c
@@ -71,11 +71,21 @@ void __init register_cpu_notifier(struct notifier_block *nb)
     spin_unlock(&cpu_add_remove_lock);
 }
 
+static int cpu_notifier_call_chain(unsigned int cpu, unsigned long action,
+                                   struct notifier_block **nb, bool nofail)
+{
+    void *hcpu = (void *)(long)cpu;
+    int notifier_rc = notifier_call_chain(&cpu_chain, action, hcpu, nb);
+    int ret = (notifier_rc == NOTIFY_DONE) ? 0 : notifier_to_errno(notifier_rc);
+
+    BUG_ON(ret && nofail);
+
+    return ret;
+}
+
 static void _take_cpu_down(void *unused)
 {
-    void *hcpu = (void *)(long)smp_processor_id();
-    int notifier_rc = notifier_call_chain(&cpu_chain, CPU_DYING, hcpu, NULL);
-    BUG_ON(notifier_rc != NOTIFY_DONE);
+    cpu_notifier_call_chain(smp_processor_id(), CPU_DYING, NULL, true);
     __cpu_disable();
 }
 
@@ -87,8 +97,7 @@ static int take_cpu_down(void *arg)
 
 int cpu_down(unsigned int cpu)
 {
-    int err, notifier_rc;
-    void *hcpu = (void *)(long)cpu;
+    int err;
     struct notifier_block *nb = NULL;
 
     if ( !cpu_hotplug_begin() )
@@ -100,12 +109,9 @@ int cpu_down(unsigned int cpu)
         return -EINVAL;
     }
 
-    notifier_rc = notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE, hcpu, &nb);
-    if ( notifier_rc != NOTIFY_DONE )
-    {
-        err = notifier_to_errno(notifier_rc);
+    err = cpu_notifier_call_chain(cpu, CPU_DOWN_PREPARE, &nb, false);
+    if ( err )
         goto fail;
-    }
 
     if ( unlikely(system_state < SYS_STATE_active) )
         on_selected_cpus(cpumask_of(cpu), _take_cpu_down, NULL, true);
@@ -113,26 +119,24 @@ int cpu_down(unsigned int cpu)
         goto fail;
 
     __cpu_die(cpu);
-    BUG_ON(cpu_online(cpu));
+    err = cpu_online(cpu);
+    BUG_ON(err);
 
-    notifier_rc = notifier_call_chain(&cpu_chain, CPU_DEAD, hcpu, NULL);
-    BUG_ON(notifier_rc != NOTIFY_DONE);
+    cpu_notifier_call_chain(cpu, CPU_DEAD, NULL, true);
 
     send_global_virq(VIRQ_PCPU_STATE);
     cpu_hotplug_done();
     return 0;
 
  fail:
-    notifier_rc = notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED, hcpu, &nb);
-    BUG_ON(notifier_rc != NOTIFY_DONE);
+    cpu_notifier_call_chain(cpu, CPU_DOWN_FAILED, &nb, true);
     cpu_hotplug_done();
     return err;
 }
 
 int cpu_up(unsigned int cpu)
 {
-    int notifier_rc, err = 0;
-    void *hcpu = (void *)(long)cpu;
+    int err;
     struct notifier_block *nb = NULL;
 
     if ( !cpu_hotplug_begin() )
@@ -144,19 +148,15 @@ int cpu_up(unsigned int cpu)
         return -EINVAL;
     }
 
-    notifier_rc = notifier_call_chain(&cpu_chain, CPU_UP_PREPARE, hcpu, &nb);
-    if ( notifier_rc != NOTIFY_DONE )
-    {
-        err = notifier_to_errno(notifier_rc);
+    err = cpu_notifier_call_chain(cpu, CPU_UP_PREPARE, &nb, false);
+    if ( err )
         goto fail;
-    }
 
     err = __cpu_up(cpu);
     if ( err < 0 )
         goto fail;
 
-    notifier_rc = notifier_call_chain(&cpu_chain, CPU_ONLINE, hcpu, NULL);
-    BUG_ON(notifier_rc != NOTIFY_DONE);
+    cpu_notifier_call_chain(cpu, CPU_ONLINE, NULL, true);
 
     send_global_virq(VIRQ_PCPU_STATE);
 
@@ -164,18 +164,14 @@ int cpu_up(unsigned int cpu)
     return 0;
 
  fail:
-    notifier_rc = notifier_call_chain(&cpu_chain, CPU_UP_CANCELED, hcpu, &nb);
-    BUG_ON(notifier_rc != NOTIFY_DONE);
+    cpu_notifier_call_chain(cpu, CPU_UP_CANCELED, &nb, true);
     cpu_hotplug_done();
     return err;
 }
 
 void notify_cpu_starting(unsigned int cpu)
 {
-    void *hcpu = (void *)(long)cpu;
-    int notifier_rc = notifier_call_chain(
-        &cpu_chain, CPU_STARTING, hcpu, NULL);
-    BUG_ON(notifier_rc != NOTIFY_DONE);
+    cpu_notifier_call_chain(cpu, CPU_STARTING, NULL, true);
 }
 
 static cpumask_t frozen_cpus;
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/6] xen: add new cpu notifier action CPU_RESUME_FAILED
  2019-03-28 12:06 [PATCH v2 0/6] xen: simplify suspend/resume handling Juergen Gross
  2019-03-28 12:06 ` [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier Juergen Gross
  2019-03-28 12:06 ` [PATCH v2 2/6] xen: add helper for calling notifier_call_chain() to common/cpu.c Juergen Gross
@ 2019-03-28 12:06 ` Juergen Gross
  2019-03-28 12:06 ` [PATCH v2 4/6] xen: don't free percpu areas during suspend Juergen Gross
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 12:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich

Add a new cpu notifier action CPU_RESUME_FAILED which is called for all
cpus which failed to come up at resume. The calls will be done after
all other cpus are already up in order to know which resources are
available then.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
V2:
- added comment in xen/include/xen/cpu.h (Dario Faggioli)
---
 xen/common/cpu.c      |  5 +++++
 xen/include/xen/cpu.h | 29 ++++++++++++++++++-----------
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/xen/common/cpu.c b/xen/common/cpu.c
index 8bf69600a6..a6efc5e604 100644
--- a/xen/common/cpu.c
+++ b/xen/common/cpu.c
@@ -218,7 +218,12 @@ void enable_nonboot_cpus(void)
             printk("Error bringing CPU%d up: %d\n", cpu, error);
             BUG_ON(error == -EBUSY);
         }
+        else
+            __cpumask_clear_cpu(cpu, &frozen_cpus);
     }
 
+    for_each_cpu ( cpu, &frozen_cpus )
+        cpu_notifier_call_chain(cpu, CPU_RESUME_FAILED, NULL, true);
+
     cpumask_clear(&frozen_cpus);
 }
diff --git a/xen/include/xen/cpu.h b/xen/include/xen/cpu.h
index 2fe3ec05d8..4638c509e2 100644
--- a/xen/include/xen/cpu.h
+++ b/xen/include/xen/cpu.h
@@ -22,33 +22,40 @@ void register_cpu_notifier(struct notifier_block *nb);
  *  CPU_UP_PREPARE -> CPU_STARTING -> CPU_ONLINE -- successful CPU up
  *  CPU_DOWN_PREPARE -> CPU_DOWN_FAILED          -- failed CPU down
  *  CPU_DOWN_PREPARE -> CPU_DYING -> CPU_DEAD    -- successful CPU down
- * 
+ * in the resume case we have additionally:
+ *  CPU_UP_PREPARE -> CPU_UP_CANCELLED -> CPU_RESUME_FAILED -- CPU not resumed
+ *  with the CPU_RESUME_FAILED handler called only after all CPUs have been
+ *  tried to put online again in order to know which CPUs did restart
+ *  successfully.
+ *
  * Hence note that only CPU_*_PREPARE handlers are allowed to fail. Also note
  * that once CPU_DYING is delivered, an offline action can no longer fail.
- * 
+ *
  * Notifiers are called highest-priority-first when:
  *  (a) A CPU is coming up; or (b) CPU_DOWN_FAILED
  * Notifiers are called lowest-priority-first when:
  *  (a) A CPU is going down; or (b) CPU_UP_CANCELED
  */
 /* CPU_UP_PREPARE: Preparing to bring CPU online. */
-#define CPU_UP_PREPARE   (0x0001 | NOTIFY_FORWARD)
+#define CPU_UP_PREPARE    (0x0001 | NOTIFY_FORWARD)
 /* CPU_UP_CANCELED: CPU is no longer being brought online. */
-#define CPU_UP_CANCELED  (0x0002 | NOTIFY_REVERSE)
+#define CPU_UP_CANCELED   (0x0002 | NOTIFY_REVERSE)
 /* CPU_STARTING: CPU nearly online. Runs on new CPU, irqs still disabled. */
-#define CPU_STARTING     (0x0003 | NOTIFY_FORWARD)
+#define CPU_STARTING      (0x0003 | NOTIFY_FORWARD)
 /* CPU_ONLINE: CPU is up. */
-#define CPU_ONLINE       (0x0004 | NOTIFY_FORWARD)
+#define CPU_ONLINE        (0x0004 | NOTIFY_FORWARD)
 /* CPU_DOWN_PREPARE: CPU is going down. */
-#define CPU_DOWN_PREPARE (0x0005 | NOTIFY_REVERSE)
+#define CPU_DOWN_PREPARE  (0x0005 | NOTIFY_REVERSE)
 /* CPU_DOWN_FAILED: CPU is no longer going down. */
-#define CPU_DOWN_FAILED  (0x0006 | NOTIFY_FORWARD)
+#define CPU_DOWN_FAILED   (0x0006 | NOTIFY_FORWARD)
 /* CPU_DYING: CPU is nearly dead (in stop_machine context). */
-#define CPU_DYING        (0x0007 | NOTIFY_REVERSE)
+#define CPU_DYING         (0x0007 | NOTIFY_REVERSE)
 /* CPU_DEAD: CPU is dead. */
-#define CPU_DEAD         (0x0008 | NOTIFY_REVERSE)
+#define CPU_DEAD          (0x0008 | NOTIFY_REVERSE)
 /* CPU_REMOVE: CPU was removed. */
-#define CPU_REMOVE       (0x0009 | NOTIFY_REVERSE)
+#define CPU_REMOVE        (0x0009 | NOTIFY_REVERSE)
+/* CPU_RESUME_FAILED: CPU failed to come up in resume, all other CPUs up. */
+#define CPU_RESUME_FAILED (0x000a | NOTIFY_REVERSE)
 
 /* Perform CPU hotplug. May return -EAGAIN. */
 int cpu_down(unsigned int cpu);
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/6] xen: don't free percpu areas during suspend
  2019-03-28 12:06 [PATCH v2 0/6] xen: simplify suspend/resume handling Juergen Gross
                   ` (2 preceding siblings ...)
  2019-03-28 12:06 ` [PATCH v2 3/6] xen: add new cpu notifier action CPU_RESUME_FAILED Juergen Gross
@ 2019-03-28 12:06 ` Juergen Gross
  2019-03-28 13:39   ` Jan Beulich
  2019-03-28 12:06 ` [PATCH v2 5/6] xen/cpupool: simplify suspend/resume handling Juergen Gross
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 12:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

Instead of freeing percpu areas during suspend and allocating them
again when resuming keep them. Only free an area in case a cpu didn't
come up again when resuming.

It should be noted that there is a potential change in behaviour as
the percpu areas are no longer zeroed out during suspend/resume. While
I have checked the called cpu notifier hooks to cope with that there
might be some well hidden dependency on the previous behaviour. OTOH
a component not registering itself for cpu down/up and expecting to
see a zeroed percpu variable after suspend/resume is kind of broken
already. And the opposite case, where a component is not registered
to be called for cpu down/up and is not expecting a percpu variable
suddenly to be zero due to suspend/resume is much more probable,
especially as the suspend/resume functionality seems not to be tested
that often.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
 xen/arch/x86/percpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/percpu.c b/xen/arch/x86/percpu.c
index 8be4ebddf4..5ea14b6ec3 100644
--- a/xen/arch/x86/percpu.c
+++ b/xen/arch/x86/percpu.c
@@ -76,7 +76,8 @@ static int cpu_percpu_callback(
         break;
     case CPU_UP_CANCELED:
     case CPU_DEAD:
-        if ( !park_offline_cpus )
+    case CPU_RESUME_FAILED:
+        if ( !park_offline_cpus && system_state != SYS_STATE_suspend )
             free_percpu_area(cpu);
         break;
     case CPU_REMOVE:
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/6] xen/cpupool: simplify suspend/resume handling
  2019-03-28 12:06 [PATCH v2 0/6] xen: simplify suspend/resume handling Juergen Gross
                   ` (3 preceding siblings ...)
  2019-03-28 12:06 ` [PATCH v2 4/6] xen: don't free percpu areas during suspend Juergen Gross
@ 2019-03-28 12:06 ` Juergen Gross
  2019-03-28 12:06 ` [PATCH v2 6/6] xen/sched: don't disable scheduler on cpus during suspend Juergen Gross
  2019-03-28 13:01 ` [PATCH v2 0/6] xen: simplify suspend/resume handling Volodymyr Babchuk
  6 siblings, 0 replies; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 12:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

Instead of removing cpus temporarily from cpupools during
suspend/resume only remove cpus finally which didn't come up when
resuming.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V2:
- add comment (George Dunlap)
---
 xen/common/cpupool.c       | 131 ++++++++++++++++++---------------------------
 xen/include/xen/sched-if.h |   1 -
 2 files changed, 52 insertions(+), 80 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index e89bb67e71..31ac323e40 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -47,12 +47,6 @@ static struct cpupool *alloc_cpupool_struct(void)
         xfree(c);
         c = NULL;
     }
-    else if ( !zalloc_cpumask_var(&c->cpu_suspended) )
-    {
-        free_cpumask_var(c->cpu_valid);
-        xfree(c);
-        c = NULL;
-    }
 
     return c;
 }
@@ -60,10 +54,7 @@ static struct cpupool *alloc_cpupool_struct(void)
 static void free_cpupool_struct(struct cpupool *c)
 {
     if ( c )
-    {
-        free_cpumask_var(c->cpu_suspended);
         free_cpumask_var(c->cpu_valid);
-    }
     xfree(c);
 }
 
@@ -477,10 +468,6 @@ void cpupool_rm_domain(struct domain *d)
 /*
  * Called to add a cpu to a pool. CPUs being hot-plugged are added to pool0,
  * as they must have been in there when unplugged.
- *
- * If, on the other hand, we are adding CPUs because we are resuming (e.g.,
- * after ACPI S3) we put the cpu back in the pool where it was in prior when
- * we suspended.
  */
 static int cpupool_cpu_add(unsigned int cpu)
 {
@@ -490,42 +477,15 @@ static int cpupool_cpu_add(unsigned int cpu)
     cpumask_clear_cpu(cpu, &cpupool_locked_cpus);
     cpumask_set_cpu(cpu, &cpupool_free_cpus);
 
-    if ( system_state == SYS_STATE_suspend || system_state == SYS_STATE_resume )
-    {
-        struct cpupool **c;
-
-        for_each_cpupool(c)
-        {
-            if ( cpumask_test_cpu(cpu, (*c)->cpu_suspended ) )
-            {
-                ret = cpupool_assign_cpu_locked(*c, cpu);
-                if ( ret )
-                    goto out;
-                cpumask_clear_cpu(cpu, (*c)->cpu_suspended);
-                break;
-            }
-        }
+    /*
+     * If we are not resuming, we are hot-plugging cpu, and in which case
+     * we add it to pool0, as it certainly was there when hot-unplagged
+     * (or unplugging would have failed) and that is the default behavior
+     * anyway.
+     */
+    per_cpu(cpupool, cpu) = NULL;
+    ret = cpupool_assign_cpu_locked(cpupool0, cpu);
 
-        /*
-         * Either cpu has been found as suspended in a pool, and added back
-         * there, or it stayed free (if it did not belong to any pool when
-         * suspending), and we don't want to do anything.
-         */
-        ASSERT(cpumask_test_cpu(cpu, &cpupool_free_cpus) ||
-               cpumask_test_cpu(cpu, (*c)->cpu_valid));
-    }
-    else
-    {
-        /*
-         * If we are not resuming, we are hot-plugging cpu, and in which case
-         * we add it to pool0, as it certainly was there when hot-unplagged
-         * (or unplugging would have failed) and that is the default behavior
-         * anyway.
-         */
-        per_cpu(cpupool, cpu) = NULL;
-        ret = cpupool_assign_cpu_locked(cpupool0, cpu);
-    }
- out:
     spin_unlock(&cpupool_lock);
 
     return ret;
@@ -535,42 +495,14 @@ static int cpupool_cpu_add(unsigned int cpu)
  * Called to remove a CPU from a pool. The CPU is locked, to forbid removing
  * it from pool0. In fact, if we want to hot-unplug a CPU, it must belong to
  * pool0, or we fail.
- *
- * However, if we are suspending (e.g., to ACPI S3), we mark the CPU in such
- * a way that it can be put back in its pool when resuming.
  */
 static int cpupool_cpu_remove(unsigned int cpu)
 {
     int ret = -ENODEV;
 
     spin_lock(&cpupool_lock);
-    if ( system_state == SYS_STATE_suspend )
-    {
-        struct cpupool **c;
-
-        for_each_cpupool(c)
-        {
-            if ( cpumask_test_cpu(cpu, (*c)->cpu_valid ) )
-            {
-                cpumask_set_cpu(cpu, (*c)->cpu_suspended);
-                cpumask_clear_cpu(cpu, (*c)->cpu_valid);
-                break;
-            }
-        }
 
-        /*
-         * Either we found cpu in a pool, or it must be free (if it has been
-         * hot-unplagged, then we must have found it in pool0). It is, of
-         * course, fine to suspend or shutdown with CPUs not assigned to a
-         * pool, and (in case of suspend) they will stay free when resuming.
-         */
-        ASSERT(cpumask_test_cpu(cpu, &cpupool_free_cpus) ||
-               cpumask_test_cpu(cpu, (*c)->cpu_suspended));
-        ASSERT(cpumask_test_cpu(cpu, &cpu_online_map) ||
-               cpumask_test_cpu(cpu, cpupool0->cpu_suspended));
-        ret = 0;
-    }
-    else if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
+    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
     {
         /*
          * If we are not suspending, we are hot-unplugging cpu, and that is
@@ -587,6 +519,41 @@ static int cpupool_cpu_remove(unsigned int cpu)
     return ret;
 }
 
+/*
+ * Called during resume for all cpus which didn't come up again. The cpu must
+ * be removed from the cpupool it is assigned to. In case a cpupool will be
+ * left without cpu we move all domains of that cpupool to cpupool0.
+ */
+static void cpupool_cpu_remove_forced(unsigned int cpu)
+{
+    struct cpupool **c;
+    struct domain *d;
+
+    spin_lock(&cpupool_lock);
+
+    if ( cpumask_test_cpu(cpu, &cpupool_free_cpus) )
+        cpumask_clear_cpu(cpu, &cpupool_free_cpus);
+    else
+    {
+        for_each_cpupool(c)
+        {
+            if ( cpumask_test_cpu(cpu, (*c)->cpu_valid) )
+            {
+                cpumask_clear_cpu(cpu, (*c)->cpu_valid);
+                if ( cpumask_weight((*c)->cpu_valid) == 0 )
+                {
+                    if ( *c == cpupool0 )
+                        panic("No cpu left in cpupool0\n");
+                    for_each_domain_in_cpupool(d, *c)
+                        cpupool_move_domain_locked(d, cpupool0);
+                }
+            }
+        }
+    }
+
+    spin_unlock(&cpupool_lock);
+}
+
 /*
  * do cpupool related sysctl operations
  */
@@ -774,10 +741,16 @@ static int cpu_callback(
     {
     case CPU_DOWN_FAILED:
     case CPU_ONLINE:
-        rc = cpupool_cpu_add(cpu);
+        if ( system_state <= SYS_STATE_active )
+            rc = cpupool_cpu_add(cpu);
         break;
     case CPU_DOWN_PREPARE:
-        rc = cpupool_cpu_remove(cpu);
+        /* Suspend/Resume don't change assignments of cpus to cpupools. */
+        if ( system_state <= SYS_STATE_active )
+            rc = cpupool_cpu_remove(cpu);
+        break;
+    case CPU_RESUME_FAILED:
+        cpupool_cpu_remove_forced(cpu);
         break;
     default:
         break;
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 9596eae1e2..92bc7a0365 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -214,7 +214,6 @@ struct cpupool
 {
     int              cpupool_id;
     cpumask_var_t    cpu_valid;      /* all cpus assigned to pool */
-    cpumask_var_t    cpu_suspended;  /* cpus in S3 that should be in this pool */
     struct cpupool   *next;
     unsigned int     n_dom;
     struct scheduler *sched;
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 6/6] xen/sched: don't disable scheduler on cpus during suspend
  2019-03-28 12:06 [PATCH v2 0/6] xen: simplify suspend/resume handling Juergen Gross
                   ` (4 preceding siblings ...)
  2019-03-28 12:06 ` [PATCH v2 5/6] xen/cpupool: simplify suspend/resume handling Juergen Gross
@ 2019-03-28 12:06 ` Juergen Gross
  2019-03-29 17:36   ` Dario Faggioli
  2019-03-28 13:01 ` [PATCH v2 0/6] xen: simplify suspend/resume handling Volodymyr Babchuk
  6 siblings, 1 reply; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 12:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, George Dunlap, Dario Faggioli

Today there is special handling in cpu_disable_scheduler() for suspend
by forcing all vcpus to the boot cpu. In fact there is no need for that
as during resume the vcpus are put on the correct cpus again.

So we can just omit the call of cpu_disable_scheduler() when offlining
a cpu due to suspend and on resuming we can omit taking the schedule
lock for selecting the new processor.

In restore_vcpu_affinity() we should be careful when applying affinity
as the cpu might not have come back to life. This in turn enables us
to even support affinity_broken across suspend/resume.

Avoid all other scheduler dealloc - alloc dance when doing suspend and
resume, too. It is enough to react on cpus failing to come up on resume
again.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/common/schedule.c | 161 ++++++++++++++++----------------------------------
 1 file changed, 52 insertions(+), 109 deletions(-)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 5d2bbd5198..6b5d454630 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -560,33 +560,6 @@ static void vcpu_move_locked(struct vcpu *v, unsigned int new_cpu)
         v->processor = new_cpu;
 }
 
-/*
- * Move a vcpu from its current processor to a target new processor,
- * without asking the scheduler to do any placement. This is intended
- * for being called from special contexts, where things are quiet
- * enough that no contention is supposed to happen (i.e., during
- * shutdown or software suspend, like ACPI S3).
- */
-static void vcpu_move_nosched(struct vcpu *v, unsigned int new_cpu)
-{
-    unsigned long flags;
-    spinlock_t *lock, *new_lock;
-
-    ASSERT(system_state == SYS_STATE_suspend);
-    ASSERT(!vcpu_runnable(v) && (atomic_read(&v->pause_count) ||
-                                 atomic_read(&v->domain->pause_count)));
-
-    lock = per_cpu(schedule_data, v->processor).schedule_lock;
-    new_lock = per_cpu(schedule_data, new_cpu).schedule_lock;
-
-    sched_spin_lock_double(lock, new_lock, &flags);
-    ASSERT(new_cpu != v->processor);
-    vcpu_move_locked(v, new_cpu);
-    sched_spin_unlock_double(lock, new_lock, flags);
-
-    sched_move_irqs(v);
-}
-
 /*
  * Initiating migration
  *
@@ -735,31 +708,36 @@ void restore_vcpu_affinity(struct domain *d)
 
         ASSERT(!vcpu_runnable(v));
 
-        lock = vcpu_schedule_lock_irq(v);
-
-        if ( v->affinity_broken )
-        {
-            sched_set_affinity(v, v->cpu_hard_affinity_saved, NULL);
-            v->affinity_broken = 0;
-
-        }
-
         /*
-         * During suspend (in cpu_disable_scheduler()), we moved every vCPU
-         * to BSP (which, as of now, is pCPU 0), as a temporary measure to
-         * allow the nonboot processors to have their data structure freed
-         * and go to sleep. But nothing guardantees that the BSP is a valid
-         * pCPU for a particular domain.
+         * Re-assign the initial processor as after resume we have no
+         * guarantee the old processor has come back to life again.
          *
          * Therefore, here, before actually unpausing the domains, we should
          * set v->processor of each of their vCPUs to something that will
          * make sense for the scheduler of the cpupool in which they are in.
          */
         cpumask_and(cpumask_scratch_cpu(cpu), v->cpu_hard_affinity,
-                    cpupool_domain_cpumask(v->domain));
-        v->processor = cpumask_any(cpumask_scratch_cpu(cpu));
+                    cpupool_domain_cpumask(d));
+        if ( cpumask_empty(cpumask_scratch_cpu(cpu)) )
+        {
+            if ( v->affinity_broken )
+            {
+                sched_set_affinity(v, v->cpu_hard_affinity_saved, NULL);
+                v->affinity_broken = 0;
+                cpumask_and(cpumask_scratch_cpu(cpu), v->cpu_hard_affinity,
+                            cpupool_domain_cpumask(d));
+            }
 
-        spin_unlock_irq(lock);
+            if ( cpumask_empty(cpumask_scratch_cpu(cpu)) )
+            {
+                printk(XENLOG_DEBUG "Breaking affinity for %pv\n", v);
+                sched_set_affinity(v, &cpumask_all, NULL);
+                cpumask_and(cpumask_scratch_cpu(cpu), v->cpu_hard_affinity,
+                            cpupool_domain_cpumask(d));
+            }
+        }
+
+        v->processor = cpumask_any(cpumask_scratch_cpu(cpu));
 
         lock = vcpu_schedule_lock_irq(v);
         v->processor = SCHED_OP(vcpu_scheduler(v), pick_cpu, v);
@@ -783,7 +761,6 @@ int cpu_disable_scheduler(unsigned int cpu)
     struct vcpu *v;
     struct cpupool *c;
     cpumask_t online_affinity;
-    unsigned int new_cpu;
     int ret = 0;
 
     c = per_cpu(cpupool, cpu);
@@ -809,14 +786,7 @@ int cpu_disable_scheduler(unsigned int cpu)
                     break;
                 }
 
-                if (system_state == SYS_STATE_suspend)
-                {
-                    cpumask_copy(v->cpu_hard_affinity_saved,
-                                 v->cpu_hard_affinity);
-                    v->affinity_broken = 1;
-                }
-                else
-                    printk(XENLOG_DEBUG "Breaking affinity for %pv\n", v);
+                printk(XENLOG_DEBUG "Breaking affinity for %pv\n", v);
 
                 sched_set_affinity(v, &cpumask_all, NULL);
             }
@@ -828,60 +798,26 @@ int cpu_disable_scheduler(unsigned int cpu)
                 continue;
             }
 
-            /* If it is on this cpu, we must send it away. */
-            if ( unlikely(system_state == SYS_STATE_suspend) )
-            {
-                vcpu_schedule_unlock_irqrestore(lock, flags, v);
-
-                /*
-                 * If we are doing a shutdown/suspend, it is not necessary to
-                 * ask the scheduler to chime in. In fact:
-                 *  * there is no reason for it: the end result we are after
-                 *    is just 'all the vcpus on the boot pcpu, and no vcpu
-                 *    anywhere else', so let's just go for it;
-                 *  * it's wrong, for cpupools with only non-boot pcpus, as
-                 *    the scheduler would always fail to send the vcpus away
-                 *    from the last online (non boot) pcpu!
-                 *
-                 * Therefore, in the shutdown/suspend case, we just pick up
-                 * one (still) online pcpu. Note that, at this stage, all
-                 * domains (including dom0) have been paused already, so we
-                 * do not expect any vcpu activity at all.
-                 */
-                cpumask_andnot(&online_affinity, &cpu_online_map,
-                               cpumask_of(cpu));
-                BUG_ON(cpumask_empty(&online_affinity));
-                /*
-                 * As boot cpu is, usually, pcpu #0, using cpumask_first()
-                 * will make us converge quicker.
-                 */
-                new_cpu = cpumask_first(&online_affinity);
-                vcpu_move_nosched(v, new_cpu);
-            }
-            else
-            {
-                /*
-                 * OTOH, if the system is still live, and we are here because
-                 * we are doing some cpupool manipulations:
-                 *  * we want to call the scheduler, and let it re-evaluation
-                 *    the placement of the vcpu, taking into account the new
-                 *    cpupool configuration;
-                 *  * the scheduler will always fine a suitable solution, or
-                 *    things would have failed before getting in here.
-                 */
-                vcpu_migrate_start(v);
-                vcpu_schedule_unlock_irqrestore(lock, flags, v);
+            /* If it is on this cpu, we must send it away.
+             * We are doing some cpupool manipulations:
+             *  * we want to call the scheduler, and let it re-evaluation
+             *    the placement of the vcpu, taking into account the new
+             *    cpupool configuration;
+             *  * the scheduler will always find a suitable solution, or
+             *    things would have failed before getting in here.
+             */
+            vcpu_migrate_start(v);
+            vcpu_schedule_unlock_irqrestore(lock, flags, v);
 
-                vcpu_migrate_finish(v);
+            vcpu_migrate_finish(v);
 
-                /*
-                 * The only caveat, in this case, is that if a vcpu active in
-                 * the hypervisor isn't migratable. In this case, the caller
-                 * should try again after releasing and reaquiring all locks.
-                 */
-                if ( v->processor == cpu )
-                    ret = -EAGAIN;
-            }
+            /*
+             * The only caveat, in this case, is that if a vcpu active in
+             * the hypervisor isn't migratable. In this case, the caller
+             * should try again after releasing and reaquiring all locks.
+             */
+            if ( v->processor == cpu )
+                ret = -EAGAIN;
         }
     }
 
@@ -1751,26 +1687,33 @@ static int cpu_schedule_callback(
     switch ( action )
     {
     case CPU_STARTING:
-        SCHED_OP(sched, init_pdata, sd->sched_priv, cpu);
+        if ( system_state != SYS_STATE_resume )
+            SCHED_OP(sched, init_pdata, sd->sched_priv, cpu);
         break;
     case CPU_UP_PREPARE:
-        rc = cpu_schedule_up(cpu);
+        if ( system_state != SYS_STATE_resume )
+            rc = cpu_schedule_up(cpu);
         break;
     case CPU_DOWN_PREPARE:
         rcu_read_lock(&domlist_read_lock);
         rc = cpu_disable_scheduler_check(cpu);
         rcu_read_unlock(&domlist_read_lock);
         break;
+    case CPU_RESUME_FAILED:
     case CPU_DEAD:
+        if ( system_state == SYS_STATE_suspend )
+            break;
         rcu_read_lock(&domlist_read_lock);
         rc = cpu_disable_scheduler(cpu);
         BUG_ON(rc);
         rcu_read_unlock(&domlist_read_lock);
         SCHED_OP(sched, deinit_pdata, sd->sched_priv, cpu);
-        /* Fallthrough */
-    case CPU_UP_CANCELED:
         cpu_schedule_down(cpu);
         break;
+    case CPU_UP_CANCELED:
+        if ( system_state != SYS_STATE_resume )
+            cpu_schedule_down(cpu);
+        break;
     default:
         break;
     }
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 12:06 [PATCH v2 0/6] xen: simplify suspend/resume handling Juergen Gross
                   ` (5 preceding siblings ...)
  2019-03-28 12:06 ` [PATCH v2 6/6] xen/sched: don't disable scheduler on cpus during suspend Juergen Gross
@ 2019-03-28 13:01 ` Volodymyr Babchuk
  2019-03-28 13:21   ` Juergen Gross
  2019-03-28 13:33   ` Julien Grall
  6 siblings, 2 replies; 23+ messages in thread
From: Volodymyr Babchuk @ 2019-03-28 13:01 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Dario Faggioli,
	Julien Grall, Jan Beulich, xen-devel, Ian Jackson,
	Roger Pau Monné

Hello Juergen,

On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
>
> Especially in the scheduler area (schedule.c, cpupool.c) there is a
> rather complex handling involved when doing suspend and resume.
>
> This can be simplified a lot by not performing a complete cpu down and
> up cycle for the non-boot cpus, but keeping the pure software related
> state and freeing it only in case a cpu didn't come up again during
> resume.
>
> In summary not only the complexity can be reduced, but the failure
> tolerance will be even better with this series: With a dedicated hook
> for failing cpus when resuming it is now possible to survive e.g. a
> cpupool being left without any cpu after resume by moving its domains
> to cpupool0.
>
> Juergen Gross (6):
>   xen/sched: call cpu_disable_scheduler() via cpu notifier
>   xen: add helper for calling notifier_call_chain() to common/cpu.c
>   xen: add new cpu notifier action CPU_RESUME_FAILED
>   xen: don't free percpu areas during suspend
>   xen/cpupool: simplify suspend/resume handling
>   xen/sched: don't disable scheduler on cpus during suspend
>
>  xen/arch/arm/smpboot.c     |   4 -
>  xen/arch/x86/percpu.c      |   3 +-
>  xen/arch/x86/smpboot.c     |   3 -
>  xen/common/cpu.c           |  61 +++++++-------
>  xen/common/cpupool.c       | 131 ++++++++++++-----------------
>  xen/common/schedule.c      | 203 +++++++++++++++++++--------------------------
>  xen/include/xen/cpu.h      |  29 ++++---
>  xen/include/xen/sched-if.h |   1 -
>  8 files changed, 190 insertions(+), 245 deletions(-)
>

I tested your patch series on ARM64 platform. We had issue with hard
affinity - there was assertion failure in sched_credit2 code during
suspension if one of the vCPUs is pinned to non-0 pCPU.

Seems, your patch series fixes the issue during suspend. But now I'm
seeing crash during resume:


(XEN) suspend.c:198: Resume
(XEN) Enabling non-boot CPUs  ...
(XEN) Bringing up CPU1
(XEN) CPU1 will call ARM_SMCCC_ARCH_WORKAROUND_1 on exception entry
(XEN) CPU 1 booted.
(XEN) Bringing up CPU2
(XEN) Data Abort Trap. Syndrome=0x6
(XEN) Walking Hypervisor VA 0x0 on CPU1 via TTBR 0x00000000781a8000
(XEN) 0TH[0x0] = 0x00000000781b0f7f
(XEN) 1ST[0x0] = 0x00000000781aaf7f
(XEN) 2ND[0x0] = 0x0000000000000000
(XEN) CPU1: Unexpected Trap: Data Abort
(XEN) ----[ Xen-4.13-unstable  arm64  debug=y   Not tainted ]----
(XEN) CPU:    1
(XEN) PC:     0000000000233660 _spin_lock+0x1c/0x88
(XEN) LR:     000000000023365c
(XEN) SP:     000080037ffc7d50
(XEN) CPSR:   600002c9 MODE:64-bit EL2h (Hypervisor, handler)
(XEN)      X0: 0000000000000006  X1: 0000000000000000  X2: 0000000000000000
(XEN)      X3: 0000000000000002  X4: 000080037fc94480  X5: 0000000000000000
(XEN)      X6: 0000000000000080  X7: 000080037ffb0000  X8: 00000000002a1000
(XEN)      X9: 000000000000000a X10: 000080037ffc7bf8 X11: 0000000000000031
(XEN)     X12: 0000000000000001 X13: 000000000027fff0 X14: 0000000000000020
(XEN)     X15: 0000000000000000 X16: 0000000000000000 X17: 0000000000000000
(XEN)     X18: 0000000000000000 X19: 0000000000000000 X20: 0000000000000000
(XEN)     X21: 000080037ffd0108 X22: 0000000000000001 X23: 000000000033bc88
(XEN)     X24: 0000000000336020 X25: 0000000000000000 X26: 0000000000000001
(XEN)     X27: 0000000000336000 X28: 0000000000000000  FP: 000080037ffc7d50
(XEN)
(XEN)   VTCR_EL2: 80023558
(XEN)  VTTBR_EL2: 0000000000000000
(XEN)
(XEN)  SCTLR_EL2: 30cd183d
(XEN)    HCR_EL2: 0000000000000038
(XEN)  TTBR0_EL2: 00000000781a8000
(XEN)
(XEN)    ESR_EL2: 96000006
(XEN)  HPFAR_EL2: 0000000000000000
(XEN)    FAR_EL2: 0000000000000000
(XEN)
(XEN) Xen stack trace from sp=000080037ffc7d50:
(XEN)    000080037ffc7d70 00000000002336e8 000080037ffd2000 000000000023e00c
(XEN)    000080037ffc7d80 000000000022e90c 000080037ffc7e10 0000000000232af8
(XEN)    0000000000000001 00000000002fbb00 ffffffffffffffff 000000000033cf20
(XEN)    00000000002a0680 0000000000000001 0000000000000001 0000000000000001
(XEN)    0000000000000000 000080037ffc7e90 000080037ffc7e50 00000000ffffffc8
(XEN)    000000000029f008 00000000002ffc41 000080037ffc7e90 0000000000263c68
(XEN)    000080037ffc7e50 0000000000232b6c 0000000000000001 0000000000000002
(XEN)    0000000000000001 00000000002fbb80 0000000000336448 00000000002fbb00
(XEN)    000080037ffc7e60 0000000000257230 000080037ffc7e90 0000000000263c6c
(XEN)    0000000000000001 0000000077e80000 0000000000000000 0000000000000001
(XEN)    0000000000000000 0000000000000001 0000000000000001 0000ffff0000ffff
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<0000000000233660>] _spin_lock+0x1c/0x88 (PC)
(XEN)    [<000000000023365c>] _spin_lock+0x18/0x88 (LR)
(XEN)    [<00000000002336e8>] _spin_lock_irq+0x1c/0x24
(XEN)    [<000000000022e90c>] schedule.c#schedule+0xe8/0x74c
(XEN)    [<0000000000232af8>] softirq.c#__do_softirq+0xcc/0xe4
(XEN)    [<0000000000232b6c>] do_softirq+0x14/0x1c
(XEN)    [<0000000000257230>] idle_loop+0x174/0x188
(XEN)    [<0000000000263c6c>] start_secondary+0x1f4/0x200
(XEN)    [<0000000000000001>] 0000000000000001
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) CPU1: Unexpected Trap: Data Abort
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) PSCI cpu off failed for CPU0 err=-3
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
(XEN) CPU2 will call ARM_SMCCC_ARCH_WORKAROUND_1 on exception entry
(XEN) CPU 2 booted.
(XEN) Data Abort Trap. Syndrome=0x6
(XEN) Walking Hypervisor VA 0x0 on CPU2 via TTBR 0x00000000781a8000
(XEN) 0TH[0x0] = 0x00000000781b0f7f
(XEN) 1ST[0x0] = 0x00000000781aaf7f
(XEN) 2ND[0x0] = 0x0000000000000000
(XEN) CPU2: Unexpected Trap: Data Abort
(XEN) ----[ Xen-4.13-unstable  arm64  debug=y   Not tainted ]----
(XEN) CPU:    2
(XEN) PC:     0000000000233660 _spin_lock+0x1c/0x88
(XEN) LR:     000000000023365c
(XEN) SP:     000080037ff77d50
(XEN) CPSR:   a00002c9 MODE:64-bit EL2h (Hypervisor, handler)
(XEN)      X0: 0000000000000006  X1: 00000000fffffffe  X2: 0000000000000000
(XEN)      X3: 0000000000000002  X4: 000080037fc42480  X5: 0000000000000000
(XEN)      X6: 0000000000000080  X7: 000080037ffb0000  X8: 00000000002a1000
(XEN)      X9: 000000000000000a X10: 000080037ff77bf8 X11: 0000000000000032
(XEN)     X12: 0000000000000001 X13: 000000000027fff0 X14: 0000000000000020
(XEN)     X15: 0000000000000000 X16: 0000000000000000 X17: 0000000000000000
(XEN)     X18: 0000000000000000 X19: 0000000000000000 X20: 0000000000000000
(XEN)     X21: 000080037ff7e108 X22: 0000000000000002 X23: 000000000033bc88
(XEN)     X24: 0000000000336020 X25: 0000000000000000 X26: 0000000000000002
(XEN)     X27: 0000000000336000 X28: 0000000000000000  FP: 000080037ff77d50
(XEN)
(XEN)   VTCR_EL2: 80023558
(XEN)  VTTBR_EL2: 0000000000000000
(XEN)
(XEN)  SCTLR_EL2: 30cd183d
(XEN)    HCR_EL2: 0000000000000038
(XEN)  TTBR0_EL2: 00000000781a8000
(XEN)
(XEN)    ESR_EL2: 96000006
(XEN)  HPFAR_EL2: 0000000000000000
(XEN)    FAR_EL2: 0000000000000000
(XEN)
(XEN) Xen stack trace from sp=000080037ff77d50:
(XEN)    000080037ff77d70 00000000002336e8 000080037ff7d000 000000000023e00c
(XEN)    000080037ff77d80 000000000022e90c 000080037ff77e10 0000000000232af8
(XEN)    0000000000000002 00000000002fbb00 ffffffffffffffff 000000000033cf20
(XEN)    00000000002a0680 0000000000000001 0000000000000001 0000000000000001
(XEN)    0000000000000000 000080037ff77e90 000080037ff77e50 00000000ffffffc8
(XEN)    000000000029f008 00000000002ffc41 000080037ff77e90 0000000000263c68
(XEN)    000080037ff77e50 0000000000232b6c 0000000000000002 0000000000000004
(XEN)    0000000000000002 00000000002fbc00 0000000000336448 00000000002fbb00
(XEN)    000080037ff77e60 0000000000257230 000080037ff77e90 0000000000263c6c
(XEN)    0000000000000002 0000000077e80000 0000000000000000 0000000000000001
(XEN)    0000000000000000 0000000000000002 0000000000000001 effffffffffaffff
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<0000000000233660>] _spin_lock+0x1c/0x88 (PC)
(XEN)    [<000000000023365c>] _spin_lock+0x18/0x88 (LR)
(XEN)    [<00000000002336e8>] _spin_lock_irq+0x1c/0x24
(XEN)    [<000000000022e90c>] schedule.c#schedule+0xe8/0x74c
(XEN)    [<0000000000232af8>] softirq.c#__do_softirq+0xcc/0xe4
(XEN)    [<0000000000232b6c>] do_softirq+0x14/0x1c
(XEN)    [<0000000000257230>] idle_loop+0x174/0x188
(XEN)    [<0000000000263c6c>] start_secondary+0x1f4/0x200
(XEN)    [<0000000000000002>] 0000000000000002
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) CPU2: Unexpected Trap: Data Abort
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...





-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:01 ` [PATCH v2 0/6] xen: simplify suspend/resume handling Volodymyr Babchuk
@ 2019-03-28 13:21   ` Juergen Gross
  2019-03-28 13:23     ` Julien Grall
  2019-03-28 13:34     ` Volodymyr Babchuk
  2019-03-28 13:33   ` Julien Grall
  1 sibling, 2 replies; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 13:21 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Dario Faggioli,
	Julien Grall, Jan Beulich, xen-devel, Ian Jackson,
	Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 2142 bytes --]

On 28/03/2019 14:01, Volodymyr Babchuk wrote:
> Hello Juergen,
> 
> On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
>>
>> Especially in the scheduler area (schedule.c, cpupool.c) there is a
>> rather complex handling involved when doing suspend and resume.
>>
>> This can be simplified a lot by not performing a complete cpu down and
>> up cycle for the non-boot cpus, but keeping the pure software related
>> state and freeing it only in case a cpu didn't come up again during
>> resume.
>>
>> In summary not only the complexity can be reduced, but the failure
>> tolerance will be even better with this series: With a dedicated hook
>> for failing cpus when resuming it is now possible to survive e.g. a
>> cpupool being left without any cpu after resume by moving its domains
>> to cpupool0.
>>
>> Juergen Gross (6):
>>   xen/sched: call cpu_disable_scheduler() via cpu notifier
>>   xen: add helper for calling notifier_call_chain() to common/cpu.c
>>   xen: add new cpu notifier action CPU_RESUME_FAILED
>>   xen: don't free percpu areas during suspend
>>   xen/cpupool: simplify suspend/resume handling
>>   xen/sched: don't disable scheduler on cpus during suspend
>>
>>  xen/arch/arm/smpboot.c     |   4 -
>>  xen/arch/x86/percpu.c      |   3 +-
>>  xen/arch/x86/smpboot.c     |   3 -
>>  xen/common/cpu.c           |  61 +++++++-------
>>  xen/common/cpupool.c       | 131 ++++++++++++-----------------
>>  xen/common/schedule.c      | 203 +++++++++++++++++++--------------------------
>>  xen/include/xen/cpu.h      |  29 ++++---
>>  xen/include/xen/sched-if.h |   1 -
>>  8 files changed, 190 insertions(+), 245 deletions(-)
>>
> 
> I tested your patch series on ARM64 platform. We had issue with hard
> affinity - there was assertion failure in sched_credit2 code during
> suspension if one of the vCPUs is pinned to non-0 pCPU.
> 
> Seems, your patch series fixes the issue during suspend. But now I'm
> seeing crash during resume:

Oh, I wasn't aware the suspend/resume is possible on ARM, too.

I need to modify arch/arm/percpu.c like on x86.

Could you test the attached patch if it is working?


Juergen

[-- Attachment #2: arm.patch --]
[-- Type: text/x-patch, Size: 642 bytes --]

diff --git a/xen/arch/arm/percpu.c b/xen/arch/arm/percpu.c
index 25442c48fe..597027c6c9 100644
--- a/xen/arch/arm/percpu.c
+++ b/xen/arch/arm/percpu.c
@@ -58,11 +58,14 @@ static int cpu_percpu_callback(
     switch ( action )
     {
     case CPU_UP_PREPARE:
-        rc = init_percpu_area(cpu);
+        if ( system_state != SYS_STATE_resume )
+            rc = init_percpu_area(cpu);
         break;
     case CPU_UP_CANCELED:
     case CPU_DEAD:
-        free_percpu_area(cpu);
+    case CPU_RESUME_FAILED:
+        if ( system_state != SYS_STATE_suspend )
+            free_percpu_area(cpu);
         break;
     default:
         break;

[-- Attachment #3: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:21   ` Juergen Gross
@ 2019-03-28 13:23     ` Julien Grall
  2019-03-28 13:34     ` Volodymyr Babchuk
  1 sibling, 0 replies; 23+ messages in thread
From: Julien Grall @ 2019-03-28 13:23 UTC (permalink / raw)
  To: Juergen Gross, Volodymyr Babchuk
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Dario Faggioli,
	Jan Beulich, xen-devel, Ian Jackson, Roger Pau Monné



On 3/28/19 1:21 PM, Juergen Gross wrote:
> On 28/03/2019 14:01, Volodymyr Babchuk wrote:
>> Hello Juergen,
>>
>> On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
>>>
>>> Especially in the scheduler area (schedule.c, cpupool.c) there is a
>>> rather complex handling involved when doing suspend and resume.
>>>
>>> This can be simplified a lot by not performing a complete cpu down and
>>> up cycle for the non-boot cpus, but keeping the pure software related
>>> state and freeing it only in case a cpu didn't come up again during
>>> resume.
>>>
>>> In summary not only the complexity can be reduced, but the failure
>>> tolerance will be even better with this series: With a dedicated hook
>>> for failing cpus when resuming it is now possible to survive e.g. a
>>> cpupool being left without any cpu after resume by moving its domains
>>> to cpupool0.
>>>
>>> Juergen Gross (6):
>>>    xen/sched: call cpu_disable_scheduler() via cpu notifier
>>>    xen: add helper for calling notifier_call_chain() to common/cpu.c
>>>    xen: add new cpu notifier action CPU_RESUME_FAILED
>>>    xen: don't free percpu areas during suspend
>>>    xen/cpupool: simplify suspend/resume handling
>>>    xen/sched: don't disable scheduler on cpus during suspend
>>>
>>>   xen/arch/arm/smpboot.c     |   4 -
>>>   xen/arch/x86/percpu.c      |   3 +-
>>>   xen/arch/x86/smpboot.c     |   3 -
>>>   xen/common/cpu.c           |  61 +++++++-------
>>>   xen/common/cpupool.c       | 131 ++++++++++++-----------------
>>>   xen/common/schedule.c      | 203 +++++++++++++++++++--------------------------
>>>   xen/include/xen/cpu.h      |  29 ++++---
>>>   xen/include/xen/sched-if.h |   1 -
>>>   8 files changed, 190 insertions(+), 245 deletions(-)
>>>
>>
>> I tested your patch series on ARM64 platform. We had issue with hard
>> affinity - there was assertion failure in sched_credit2 code during
>> suspension if one of the vCPUs is pinned to non-0 pCPU.
>>
>> Seems, your patch series fixes the issue during suspend. But now I'm
>> seeing crash during resume:
> 
> Oh, I wasn't aware the suspend/resume is possible on ARM, too.

We don't have suspend/resume on Arm... There are series on the ML but I 
would not consider then bug free (there was the first posting).

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:01 ` [PATCH v2 0/6] xen: simplify suspend/resume handling Volodymyr Babchuk
  2019-03-28 13:21   ` Juergen Gross
@ 2019-03-28 13:33   ` Julien Grall
  2019-03-28 13:37     ` Juergen Gross
  2019-03-28 13:56     ` Volodymyr Babchuk
  1 sibling, 2 replies; 23+ messages in thread
From: Julien Grall @ 2019-03-28 13:33 UTC (permalink / raw)
  To: Volodymyr Babchuk, Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Dario Faggioli,
	Jan Beulich, xen-devel, Ian Jackson, Roger Pau Monné

Hi,

On 3/28/19 1:01 PM, Volodymyr Babchuk wrote:
> Hello Juergen,
> 
> On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
>>
>> Especially in the scheduler area (schedule.c, cpupool.c) there is a
>> rather complex handling involved when doing suspend and resume.
>>
>> This can be simplified a lot by not performing a complete cpu down and
>> up cycle for the non-boot cpus, but keeping the pure software related
>> state and freeing it only in case a cpu didn't come up again during
>> resume.
>>
>> In summary not only the complexity can be reduced, but the failure
>> tolerance will be even better with this series: With a dedicated hook
>> for failing cpus when resuming it is now possible to survive e.g. a
>> cpupool being left without any cpu after resume by moving its domains
>> to cpupool0.
>>
>> Juergen Gross (6):
>>    xen/sched: call cpu_disable_scheduler() via cpu notifier
>>    xen: add helper for calling notifier_call_chain() to common/cpu.c
>>    xen: add new cpu notifier action CPU_RESUME_FAILED
>>    xen: don't free percpu areas during suspend
>>    xen/cpupool: simplify suspend/resume handling
>>    xen/sched: don't disable scheduler on cpus during suspend
>>
>>   xen/arch/arm/smpboot.c     |   4 -
>>   xen/arch/x86/percpu.c      |   3 +-
>>   xen/arch/x86/smpboot.c     |   3 -
>>   xen/common/cpu.c           |  61 +++++++-------
>>   xen/common/cpupool.c       | 131 ++++++++++++-----------------
>>   xen/common/schedule.c      | 203 +++++++++++++++++++--------------------------
>>   xen/include/xen/cpu.h      |  29 ++++---
>>   xen/include/xen/sched-if.h |   1 -
>>   8 files changed, 190 insertions(+), 245 deletions(-)
>>
> 
> I tested your patch series on ARM64 platform. We had issue with hard
> affinity - there was assertion failure in sched_credit2 code during
> suspension if one of the vCPUs is pinned to non-0 pCPU.
When you report an error, please make clear what commit you are using 
and whether you have patches applied on top.

In this case, we have no support of suspend/resume on Arm today. So bug 
report around suspend/resume is a bit confusing to have. It is also more 
difficult to help when you don't have the full picture as a bug may be 
in your code and upstream Xen.

I saw Juergen suggested a fix, please carry it in whatever series you have.

> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) PSCI cpu off failed for CPU0 err=-3
> (XEN) ****************************************

PSCI CPU off failing is never a good news. Here, the command has been 
denied by PSCI monitor. But... why does CPU off is actually called on 
CPU0? Shouldn't we have turned off the platform instead?

> (XEN)
> (XEN) Reboot in five seconds...

Are the logs below actually a mistaken paste?

> (XEN) CPU2 will call ARM_SMCCC_ARCH_WORKAROUND_1 on exception entry
> (XEN) CPU 2 booted.
> (XEN) Data Abort Trap. Syndrome=0x6
> (XEN) Walking Hypervisor VA 0x0 on CPU2 via TTBR 0x00000000781a8000
> (XEN) 0TH[0x0] = 0x00000000781b0f7f
> (XEN) 1ST[0x0] = 0x00000000781aaf7f
> (XEN) 2ND[0x0] = 0x0000000000000000
> (XEN) CPU2: Unexpected Trap: Data Abort
> (XEN) ----[ Xen-4.13-unstable  arm64  debug=y   Not tainted ]----
> (XEN) CPU:    2
> (XEN) PC:     0000000000233660 _spin_lock+0x1c/0x88
> (XEN) LR:     000000000023365c
> (XEN) SP:     000080037ff77d50
> (XEN) CPSR:   a00002c9 MODE:64-bit EL2h (Hypervisor, handler)
> (XEN)      X0: 0000000000000006  X1: 00000000fffffffe  X2: 0000000000000000
> (XEN)      X3: 0000000000000002  X4: 000080037fc42480  X5: 0000000000000000
> (XEN)      X6: 0000000000000080  X7: 000080037ffb0000  X8: 00000000002a1000
> (XEN)      X9: 000000000000000a X10: 000080037ff77bf8 X11: 0000000000000032
> (XEN)     X12: 0000000000000001 X13: 000000000027fff0 X14: 0000000000000020
> (XEN)     X15: 0000000000000000 X16: 0000000000000000 X17: 0000000000000000
> (XEN)     X18: 0000000000000000 X19: 0000000000000000 X20: 0000000000000000
> (XEN)     X21: 000080037ff7e108 X22: 0000000000000002 X23: 000000000033bc88
> (XEN)     X24: 0000000000336020 X25: 0000000000000000 X26: 0000000000000002
> (XEN)     X27: 0000000000336000 X28: 0000000000000000  FP: 000080037ff77d50
> (XEN)
> (XEN)   VTCR_EL2: 80023558
> (XEN)  VTTBR_EL2: 0000000000000000
> (XEN)
> (XEN)  SCTLR_EL2: 30cd183d
> (XEN)    HCR_EL2: 0000000000000038
> (XEN)  TTBR0_EL2: 00000000781a8000
> (XEN)
> (XEN)    ESR_EL2: 96000006
> (XEN)  HPFAR_EL2: 0000000000000000
> (XEN)    FAR_EL2: 0000000000000000
> (XEN)
> (XEN) Xen stack trace from sp=000080037ff77d50:
> (XEN)    000080037ff77d70 00000000002336e8 000080037ff7d000 000000000023e00c
> (XEN)    000080037ff77d80 000000000022e90c 000080037ff77e10 0000000000232af8
> (XEN)    0000000000000002 00000000002fbb00 ffffffffffffffff 000000000033cf20
> (XEN)    00000000002a0680 0000000000000001 0000000000000001 0000000000000001
> (XEN)    0000000000000000 000080037ff77e90 000080037ff77e50 00000000ffffffc8
> (XEN)    000000000029f008 00000000002ffc41 000080037ff77e90 0000000000263c68
> (XEN)    000080037ff77e50 0000000000232b6c 0000000000000002 0000000000000004
> (XEN)    0000000000000002 00000000002fbc00 0000000000336448 00000000002fbb00
> (XEN)    000080037ff77e60 0000000000257230 000080037ff77e90 0000000000263c6c
> (XEN)    0000000000000002 0000000077e80000 0000000000000000 0000000000000001
> (XEN)    0000000000000000 0000000000000002 0000000000000001 effffffffffaffff
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000
> (XEN) Xen call trace:
> (XEN)    [<0000000000233660>] _spin_lock+0x1c/0x88 (PC)
> (XEN)    [<000000000023365c>] _spin_lock+0x18/0x88 (LR)
> (XEN)    [<00000000002336e8>] _spin_lock_irq+0x1c/0x24
> (XEN)    [<000000000022e90c>] schedule.c#schedule+0xe8/0x74c
> (XEN)    [<0000000000232af8>] softirq.c#__do_softirq+0xcc/0xe4
> (XEN)    [<0000000000232b6c>] do_softirq+0x14/0x1c
> (XEN)    [<0000000000257230>] idle_loop+0x174/0x188
> (XEN)    [<0000000000263c6c>] start_secondary+0x1f4/0x200
> (XEN)    [<0000000000000002>] 0000000000000002
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 2:
> (XEN) CPU2: Unexpected Trap: Data Abort
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:21   ` Juergen Gross
  2019-03-28 13:23     ` Julien Grall
@ 2019-03-28 13:34     ` Volodymyr Babchuk
  1 sibling, 0 replies; 23+ messages in thread
From: Volodymyr Babchuk @ 2019-03-28 13:34 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Dario Faggioli,
	Julien Grall, Jan Beulich, xen-devel, Ian Jackson,
	Roger Pau Monné

Hi,

On Thu, 28 Mar 2019 at 15:21, Juergen Gross <jgross@suse.com> wrote:
>
> On 28/03/2019 14:01, Volodymyr Babchuk wrote:
> > Hello Juergen,
> >
> > On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
> >>
> >> Especially in the scheduler area (schedule.c, cpupool.c) there is a
> >> rather complex handling involved when doing suspend and resume.
> >>
> >> This can be simplified a lot by not performing a complete cpu down and
> >> up cycle for the non-boot cpus, but keeping the pure software related
> >> state and freeing it only in case a cpu didn't come up again during
> >> resume.
> >>
> >> In summary not only the complexity can be reduced, but the failure
> >> tolerance will be even better with this series: With a dedicated hook
> >> for failing cpus when resuming it is now possible to survive e.g. a
> >> cpupool being left without any cpu after resume by moving its domains
> >> to cpupool0.
> >>
> >> Juergen Gross (6):
> >>   xen/sched: call cpu_disable_scheduler() via cpu notifier
> >>   xen: add helper for calling notifier_call_chain() to common/cpu.c
> >>   xen: add new cpu notifier action CPU_RESUME_FAILED
> >>   xen: don't free percpu areas during suspend
> >>   xen/cpupool: simplify suspend/resume handling
> >>   xen/sched: don't disable scheduler on cpus during suspend
> >>
> >>  xen/arch/arm/smpboot.c     |   4 -
> >>  xen/arch/x86/percpu.c      |   3 +-
> >>  xen/arch/x86/smpboot.c     |   3 -
> >>  xen/common/cpu.c           |  61 +++++++-------
> >>  xen/common/cpupool.c       | 131 ++++++++++++-----------------
> >>  xen/common/schedule.c      | 203 +++++++++++++++++++--------------------------
> >>  xen/include/xen/cpu.h      |  29 ++++---
> >>  xen/include/xen/sched-if.h |   1 -
> >>  8 files changed, 190 insertions(+), 245 deletions(-)
> >>
> >
> > I tested your patch series on ARM64 platform. We had issue with hard
> > affinity - there was assertion failure in sched_credit2 code during
> > suspension if one of the vCPUs is pinned to non-0 pCPU.
> >
> > Seems, your patch series fixes the issue during suspend. But now I'm
> > seeing crash during resume:
>
> Oh, I wasn't aware the suspend/resume is possible on ARM, too.
It is not upstreamed yet, but yes, it is possible.

> I need to modify arch/arm/percpu.c like on x86.
> Could you test the attached patch if it is working?
Yes, it is fixes the issue. Thank you.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:33   ` Julien Grall
@ 2019-03-28 13:37     ` Juergen Gross
  2019-03-28 13:49       ` Julien Grall
  2019-03-28 13:56     ` Volodymyr Babchuk
  1 sibling, 1 reply; 23+ messages in thread
From: Juergen Gross @ 2019-03-28 13:37 UTC (permalink / raw)
  To: Julien Grall, Volodymyr Babchuk
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Dario Faggioli,
	Jan Beulich, xen-devel, Ian Jackson, Roger Pau Monné

On 28/03/2019 14:33, Julien Grall wrote:
> Hi,
> 
> On 3/28/19 1:01 PM, Volodymyr Babchuk wrote:
>> Hello Juergen,
>>
>> On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
>>>
>>> Especially in the scheduler area (schedule.c, cpupool.c) there is a
>>> rather complex handling involved when doing suspend and resume.
>>>
>>> This can be simplified a lot by not performing a complete cpu down and
>>> up cycle for the non-boot cpus, but keeping the pure software related
>>> state and freeing it only in case a cpu didn't come up again during
>>> resume.
>>>
>>> In summary not only the complexity can be reduced, but the failure
>>> tolerance will be even better with this series: With a dedicated hook
>>> for failing cpus when resuming it is now possible to survive e.g. a
>>> cpupool being left without any cpu after resume by moving its domains
>>> to cpupool0.
>>>
>>> Juergen Gross (6):
>>>    xen/sched: call cpu_disable_scheduler() via cpu notifier
>>>    xen: add helper for calling notifier_call_chain() to common/cpu.c
>>>    xen: add new cpu notifier action CPU_RESUME_FAILED
>>>    xen: don't free percpu areas during suspend
>>>    xen/cpupool: simplify suspend/resume handling
>>>    xen/sched: don't disable scheduler on cpus during suspend
>>>
>>>   xen/arch/arm/smpboot.c     |   4 -
>>>   xen/arch/x86/percpu.c      |   3 +-
>>>   xen/arch/x86/smpboot.c     |   3 -
>>>   xen/common/cpu.c           |  61 +++++++-------
>>>   xen/common/cpupool.c       | 131 ++++++++++++-----------------
>>>   xen/common/schedule.c      | 203
>>> +++++++++++++++++++--------------------------
>>>   xen/include/xen/cpu.h      |  29 ++++---
>>>   xen/include/xen/sched-if.h |   1 -
>>>   8 files changed, 190 insertions(+), 245 deletions(-)
>>>
>>
>> I tested your patch series on ARM64 platform. We had issue with hard
>> affinity - there was assertion failure in sched_credit2 code during
>> suspension if one of the vCPUs is pinned to non-0 pCPU.
> When you report an error, please make clear what commit you are using
> and whether you have patches applied on top.
> 
> In this case, we have no support of suspend/resume on Arm today. So bug
> report around suspend/resume is a bit confusing to have. It is also more
> difficult to help when you don't have the full picture as a bug may be
> in your code and upstream Xen.
> 
> I saw Juergen suggested a fix, please carry it in whatever series you have.
> 
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) PSCI cpu off failed for CPU0 err=-3
>> (XEN) ****************************************
> 
> PSCI CPU off failing is never a good news. Here, the command has been
> denied by PSCI monitor. But... why does CPU off is actually called on
> CPU0? Shouldn't we have turned off the platform instead?

Could it be that a scheduler lock is no longer reachable as the percpu
memory of another cpu has been released and allocated again? That would
be one of the possible results of my series.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/6] xen: don't free percpu areas during suspend
  2019-03-28 12:06 ` [PATCH v2 4/6] xen: don't free percpu areas during suspend Juergen Gross
@ 2019-03-28 13:39   ` Jan Beulich
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2019-03-28 13:39 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Andrew Cooper, Wei Liu, xen-devel, Roger Pau Monne

>>> On 28.03.19 at 13:06, <jgross@suse.com> wrote:
> Instead of freeing percpu areas during suspend and allocating them
> again when resuming keep them. Only free an area in case a cpu didn't
> come up again when resuming.
> 
> It should be noted that there is a potential change in behaviour as
> the percpu areas are no longer zeroed out during suspend/resume. While
> I have checked the called cpu notifier hooks to cope with that there
> might be some well hidden dependency on the previous behaviour. OTOH
> a component not registering itself for cpu down/up and expecting to
> see a zeroed percpu variable after suspend/resume is kind of broken
> already.

Back at the time it was intentional to have this behavior, and code
was in fact written to rely on it. What I can't immediately tell is
whether all such code has gone away.

> And the opposite case, where a component is not registered
> to be called for cpu down/up and is not expecting a percpu variable
> suddenly to be zero due to suspend/resume is much more probable,
> especially as the suspend/resume functionality seems not to be tested
> that often.

Right, but CPU offline/online is pretty easy to test (and
independent of firmware support), and it would continue to
have the observed effect. I agree though that for suspend/
resume it is more likely to be wanted to have the data
retained. But don't forget that it was always implied that APs
would go fully down and then come back up during a suspend/
resume cycle.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:37     ` Juergen Gross
@ 2019-03-28 13:49       ` Julien Grall
  0 siblings, 0 replies; 23+ messages in thread
From: Julien Grall @ 2019-03-28 13:49 UTC (permalink / raw)
  To: Juergen Gross, Volodymyr Babchuk
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Dario Faggioli,
	Jan Beulich, xen-devel, nd, Ian Jackson, Roger Pau Monné



On 28/03/2019 13:37, Juergen Gross wrote:
> On 28/03/2019 14:33, Julien Grall wrote:
>> Hi,
>>
>> On 3/28/19 1:01 PM, Volodymyr Babchuk wrote:
>>> Hello Juergen,
>>>
>>> On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
>>>>
>>>> Especially in the scheduler area (schedule.c, cpupool.c) there is a
>>>> rather complex handling involved when doing suspend and resume.
>>>>
>>>> This can be simplified a lot by not performing a complete cpu down and
>>>> up cycle for the non-boot cpus, but keeping the pure software related
>>>> state and freeing it only in case a cpu didn't come up again during
>>>> resume.
>>>>
>>>> In summary not only the complexity can be reduced, but the failure
>>>> tolerance will be even better with this series: With a dedicated hook
>>>> for failing cpus when resuming it is now possible to survive e.g. a
>>>> cpupool being left without any cpu after resume by moving its domains
>>>> to cpupool0.
>>>>
>>>> Juergen Gross (6):
>>>>     xen/sched: call cpu_disable_scheduler() via cpu notifier
>>>>     xen: add helper for calling notifier_call_chain() to common/cpu.c
>>>>     xen: add new cpu notifier action CPU_RESUME_FAILED
>>>>     xen: don't free percpu areas during suspend
>>>>     xen/cpupool: simplify suspend/resume handling
>>>>     xen/sched: don't disable scheduler on cpus during suspend
>>>>
>>>>    xen/arch/arm/smpboot.c     |   4 -
>>>>    xen/arch/x86/percpu.c      |   3 +-
>>>>    xen/arch/x86/smpboot.c     |   3 -
>>>>    xen/common/cpu.c           |  61 +++++++-------
>>>>    xen/common/cpupool.c       | 131 ++++++++++++-----------------
>>>>    xen/common/schedule.c      | 203
>>>> +++++++++++++++++++--------------------------
>>>>    xen/include/xen/cpu.h      |  29 ++++---
>>>>    xen/include/xen/sched-if.h |   1 -
>>>>    8 files changed, 190 insertions(+), 245 deletions(-)
>>>>
>>>
>>> I tested your patch series on ARM64 platform. We had issue with hard
>>> affinity - there was assertion failure in sched_credit2 code during
>>> suspension if one of the vCPUs is pinned to non-0 pCPU.
>> When you report an error, please make clear what commit you are using
>> and whether you have patches applied on top.
>>
>> In this case, we have no support of suspend/resume on Arm today. So bug
>> report around suspend/resume is a bit confusing to have. It is also more
>> difficult to help when you don't have the full picture as a bug may be
>> in your code and upstream Xen.
>>
>> I saw Juergen suggested a fix, please carry it in whatever series you have.
>>
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) PSCI cpu off failed for CPU0 err=-3
>>> (XEN) ****************************************
>>
>> PSCI CPU off failing is never a good news. Here, the command has been
>> denied by PSCI monitor. But... why does CPU off is actually called on
>> CPU0? Shouldn't we have turned off the platform instead?
> 
> Could it be that a scheduler lock is no longer reachable as the percpu
> memory of another cpu has been released and allocated again? That would
> be one of the possible results of my series.

The data abort shown before the panic is potentially the percpu issue. 
But I don't think it will have the effect to try to turn off CPU0. This 
looks more an issue in the machine_halt/machine_restart path.

Indeed CPU off may rightfully return -3 (DENIED) if the Trusted-OS 
reside on this CPU. We technically should have checked before that the 
CPU could be turned off. But it looks like we are missing this code. I 
vaguely remember to already have pointed out that issue in the past.

Cheers,

> 
> 
> Juergen
> 

-- 
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:33   ` Julien Grall
  2019-03-28 13:37     ` Juergen Gross
@ 2019-03-28 13:56     ` Volodymyr Babchuk
  2019-03-28 14:43       ` Julien Grall
  2019-03-28 14:53       ` Dario Faggioli
  1 sibling, 2 replies; 23+ messages in thread
From: Volodymyr Babchuk @ 2019-03-28 13:56 UTC (permalink / raw)
  To: Julien Grall
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Tim Deegan,
	Dario Faggioli, Jan Beulich, xen-devel, Ian Jackson,
	Roger Pau Monné

Hello Julien,

On Thu, 28 Mar 2019 at 15:33, Julien Grall <julien.grall@arm.com> wrote:
>
> Hi,
>
> On 3/28/19 1:01 PM, Volodymyr Babchuk wrote:
> > Hello Juergen,
> >
> > On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
> >>
> >> Especially in the scheduler area (schedule.c, cpupool.c) there is a
> >> rather complex handling involved when doing suspend and resume.
> >>
> >> This can be simplified a lot by not performing a complete cpu down and
> >> up cycle for the non-boot cpus, but keeping the pure software related
> >> state and freeing it only in case a cpu didn't come up again during
> >> resume.
> >>
> >> In summary not only the complexity can be reduced, but the failure
> >> tolerance will be even better with this series: With a dedicated hook
> >> for failing cpus when resuming it is now possible to survive e.g. a
> >> cpupool being left without any cpu after resume by moving its domains
> >> to cpupool0.
> >>
> >> Juergen Gross (6):
> >>    xen/sched: call cpu_disable_scheduler() via cpu notifier
> >>    xen: add helper for calling notifier_call_chain() to common/cpu.c
> >>    xen: add new cpu notifier action CPU_RESUME_FAILED
> >>    xen: don't free percpu areas during suspend
> >>    xen/cpupool: simplify suspend/resume handling
> >>    xen/sched: don't disable scheduler on cpus during suspend
> >>
> >>   xen/arch/arm/smpboot.c     |   4 -
> >>   xen/arch/x86/percpu.c      |   3 +-
> >>   xen/arch/x86/smpboot.c     |   3 -
> >>   xen/common/cpu.c           |  61 +++++++-------
> >>   xen/common/cpupool.c       | 131 ++++++++++++-----------------
> >>   xen/common/schedule.c      | 203 +++++++++++++++++++--------------------------
> >>   xen/include/xen/cpu.h      |  29 ++++---
> >>   xen/include/xen/sched-if.h |   1 -
> >>   8 files changed, 190 insertions(+), 245 deletions(-)
> >>
> >
> > I tested your patch series on ARM64 platform. We had issue with hard
> > affinity - there was assertion failure in sched_credit2 code during
> > suspension if one of the vCPUs is pinned to non-0 pCPU.
> When you report an error, please make clear what commit you are using
> and whether you have patches applied on top.

Sure.

> In this case, we have no support of suspend/resume on Arm today. So bug
> report around suspend/resume is a bit confusing to have. It is also more
> difficult to help when you don't have the full picture as a bug may be
> in your code and upstream Xen.

Agree. But in this case, changes were done to the common code mostly.
I assumed that it would be good to check and report it for Arm, even
if Arm suspend/resume is not upstreamed yet. Besides, this patch
series fixed another issue in the common suspend/resume code.

> I saw Juergen suggested a fix, please carry it in whatever series you have.
Yes, this patch fixes the issue.

We are using patch series by Mirela, that you mentioned earlier, by the way.

> > (XEN) ****************************************
> > (XEN) Panic on CPU 0:
> > (XEN) PSCI cpu off failed for CPU0 err=-3
> > (XEN) ****************************************
>
> PSCI CPU off failing is never a good news. Here, the command has been
> denied by PSCI monitor. But... why does CPU off is actually called on
> CPU0? Shouldn't we have turned off the platform instead?
I think, this is because CPU1 is performing machine_restart(), so it
asked CPU0 to halt itself.

>
> > (XEN)
> > (XEN) Reboot in five seconds...
>
> Are the logs below actually a mistaken paste?
No, this is what I'm seeing in my serial console.

> > (XEN) CPU2 will call ARM_SMCCC_ARCH_WORKAROUND_1 on exception entry
> > (XEN) CPU 2 booted.
> > (XEN) Data Abort Trap. Syndrome=0x6
> > (XEN) Walking Hypervisor VA 0x0 on CPU2 via TTBR 0x00000000781a8000
> > (XEN) 0TH[0x0] = 0x00000000781b0f7f
> > (XEN) 1ST[0x0] = 0x00000000781aaf7f
> > (XEN) 2ND[0x0] = 0x0000000000000000
> > (XEN) CPU2: Unexpected Trap: Data Abort
> > (XEN) ----[ Xen-4.13-unstable  arm64  debug=y   Not tainted ]----
> > (XEN) CPU:    2
> > (XEN) PC:     0000000000233660 _spin_lock+0x1c/0x88
> > (XEN) LR:     000000000023365c
> > (XEN) SP:     000080037ff77d50
> > (XEN) CPSR:   a00002c9 MODE:64-bit EL2h (Hypervisor, handler)
> > (XEN)      X0: 0000000000000006  X1: 00000000fffffffe  X2: 0000000000000000
> > (XEN)      X3: 0000000000000002  X4: 000080037fc42480  X5: 0000000000000000
> > (XEN)      X6: 0000000000000080  X7: 000080037ffb0000  X8: 00000000002a1000
> > (XEN)      X9: 000000000000000a X10: 000080037ff77bf8 X11: 0000000000000032
> > (XEN)     X12: 0000000000000001 X13: 000000000027fff0 X14: 0000000000000020
> > (XEN)     X15: 0000000000000000 X16: 0000000000000000 X17: 0000000000000000
> > (XEN)     X18: 0000000000000000 X19: 0000000000000000 X20: 0000000000000000
> > (XEN)     X21: 000080037ff7e108 X22: 0000000000000002 X23: 000000000033bc88
> > (XEN)     X24: 0000000000336020 X25: 0000000000000000 X26: 0000000000000002
> > (XEN)     X27: 0000000000336000 X28: 0000000000000000  FP: 000080037ff77d50
> > (XEN)
> > (XEN)   VTCR_EL2: 80023558
> > (XEN)  VTTBR_EL2: 0000000000000000
> > (XEN)
> > (XEN)  SCTLR_EL2: 30cd183d
> > (XEN)    HCR_EL2: 0000000000000038
> > (XEN)  TTBR0_EL2: 00000000781a8000
> > (XEN)
> > (XEN)    ESR_EL2: 96000006
> > (XEN)  HPFAR_EL2: 0000000000000000
> > (XEN)    FAR_EL2: 0000000000000000
> > (XEN)
> > (XEN) Xen stack trace from sp=000080037ff77d50:
> > (XEN)    000080037ff77d70 00000000002336e8 000080037ff7d000 000000000023e00c
> > (XEN)    000080037ff77d80 000000000022e90c 000080037ff77e10 0000000000232af8
> > (XEN)    0000000000000002 00000000002fbb00 ffffffffffffffff 000000000033cf20
> > (XEN)    00000000002a0680 0000000000000001 0000000000000001 0000000000000001
> > (XEN)    0000000000000000 000080037ff77e90 000080037ff77e50 00000000ffffffc8
> > (XEN)    000000000029f008 00000000002ffc41 000080037ff77e90 0000000000263c68
> > (XEN)    000080037ff77e50 0000000000232b6c 0000000000000002 0000000000000004
> > (XEN)    0000000000000002 00000000002fbc00 0000000000336448 00000000002fbb00
> > (XEN)    000080037ff77e60 0000000000257230 000080037ff77e90 0000000000263c6c
> > (XEN)    0000000000000002 0000000077e80000 0000000000000000 0000000000000001
> > (XEN)    0000000000000000 0000000000000002 0000000000000001 effffffffffaffff
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000
> > (XEN) Xen call trace:
> > (XEN)    [<0000000000233660>] _spin_lock+0x1c/0x88 (PC)
> > (XEN)    [<000000000023365c>] _spin_lock+0x18/0x88 (LR)
> > (XEN)    [<00000000002336e8>] _spin_lock_irq+0x1c/0x24
> > (XEN)    [<000000000022e90c>] schedule.c#schedule+0xe8/0x74c
> > (XEN)    [<0000000000232af8>] softirq.c#__do_softirq+0xcc/0xe4
> > (XEN)    [<0000000000232b6c>] do_softirq+0x14/0x1c
> > (XEN)    [<0000000000257230>] idle_loop+0x174/0x188
> > (XEN)    [<0000000000263c6c>] start_secondary+0x1f4/0x200
> > (XEN)    [<0000000000000002>] 0000000000000002
> > (XEN)
> > (XEN)
> > (XEN) ****************************************
> > (XEN) Panic on CPU 2:
> > (XEN) CPU2: Unexpected Trap: Data Abort
> > (XEN) ****************************************
> > (XEN)
> > (XEN) Reboot in five seconds...
>
> Cheers,
>
> --
> Julien Grall



-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:56     ` Volodymyr Babchuk
@ 2019-03-28 14:43       ` Julien Grall
  2019-03-28 14:53       ` Dario Faggioli
  1 sibling, 0 replies; 23+ messages in thread
From: Julien Grall @ 2019-03-28 14:43 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Tim Deegan,
	Dario Faggioli, Jan Beulich, xen-devel, Ian Jackson,
	Roger Pau Monné

Hi,

On 3/28/19 1:56 PM, Volodymyr Babchuk wrote:
> On Thu, 28 Mar 2019 at 15:33, Julien Grall <julien.grall@arm.com> wrote:
>> On 3/28/19 1:01 PM, Volodymyr Babchuk wrote:
>>> On Thu, 28 Mar 2019 at 14:09, Juergen Gross <jgross@suse.com> wrote:
> 
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) PSCI cpu off failed for CPU0 err=-3
>>> (XEN) ****************************************
>>
>> PSCI CPU off failing is never a good news. Here, the command has been
>> denied by PSCI monitor. But... why does CPU off is actually called on
>> CPU0? Shouldn't we have turned off the platform instead?
> I think, this is because CPU1 is performing machine_restart(), so it
> asked CPU0 to halt itself.

The PSCI call SYSTEM_OFF/SYSTEM_RESET requires all the CPUs to be in a 
known state. The PSCI spec actually suggest to turn all the CPUs but one 
off. Another alternative is to put them in a quiescent state.

CPU_OFF will return -3 (i.e DENIED) if the Trusted OS is Uniprocessor 
and resides on the core that you are about to turn off. I assume your 
platform have a TOS UP and currently resides on CPU0.

SYSTEM_OFF/SYSTEM_RESET call can be done from any CPUs. If we want to 
avoid the problem with CPU_OFF, then the best options is to put the all 
CPUs but one in a quiescent state (something like while (1) cpu_relax/wfi).

On a side node, we should probably want to remove the panic in 
call_psci_cpu_off as this will be used by the suspend code. Indeed, the 
trusted OS may not reside on the boot CPU, so you may hit the panic as well.

> 
>>
>>> (XEN)
>>> (XEN) Reboot in five seconds...
>>
>> Are the logs below actually a mistaken paste?
> No, this is what I'm seeing in my serial console.

I guess this is happening because we recurse in machine_halt(). We 
probably want to drop any panic in that code path.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 13:56     ` Volodymyr Babchuk
  2019-03-28 14:43       ` Julien Grall
@ 2019-03-28 14:53       ` Dario Faggioli
  2019-03-28 14:57         ` Volodymyr Babchuk
  1 sibling, 1 reply; 23+ messages in thread
From: Dario Faggioli @ 2019-03-28 14:53 UTC (permalink / raw)
  To: Volodymyr Babchuk, Julien Grall
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Tim Deegan,
	Jan Beulich, xen-devel, Ian Jackson, Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 769 bytes --]

On Thu, 2019-03-28 at 15:56 +0200, Volodymyr Babchuk wrote:
> On Thu, 28 Mar 2019 at 15:33, Julien Grall <julien.grall@arm.com>
> wrote:
> > 
> Are the logs below actually a mistaken paste?
> No, this is what I'm seeing in my serial console.
> 
Err, sorry, I'm a bit confused now...

So, if you use the ARM suspend/resume patches + Juergen series +
Juergen fixup patch in this subthread, do you still have issues or is
everything allright?

If you still have issues, which splat are you seeing?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] xen: simplify suspend/resume handling
  2019-03-28 14:53       ` Dario Faggioli
@ 2019-03-28 14:57         ` Volodymyr Babchuk
  0 siblings, 0 replies; 23+ messages in thread
From: Volodymyr Babchuk @ 2019-03-28 14:57 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich, xen-devel, Ian Jackson,
	Roger Pau Monné

Hello Dario,

On Thu, 28 Mar 2019 at 16:54, Dario Faggioli <dfaggioli@suse.com> wrote:
>
> On Thu, 2019-03-28 at 15:56 +0200, Volodymyr Babchuk wrote:
> > On Thu, 28 Mar 2019 at 15:33, Julien Grall <julien.grall@arm.com>
> > wrote:
> > >
> > Are the logs below actually a mistaken paste?
> > No, this is what I'm seeing in my serial console.
> >
> Err, sorry, I'm a bit confused now...
>
> So, if you use the ARM suspend/resume patches + Juergen series +
> Juergen fixup patch in this subthread, do you still have issues or is
> everything allright?
No, Juergen's fixup fixed the issue. So, everything is perfectly fine.

That was discussion of unrelated issue in ARM platform code. I think,
we need to drop off x86 folks from CC. Sorry for the noise.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier
  2019-03-28 12:06 ` [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier Juergen Gross
@ 2019-03-29 17:19   ` Dario Faggioli
  2019-04-01 10:34   ` Julien Grall
  1 sibling, 0 replies; 23+ messages in thread
From: Dario Faggioli @ 2019-03-29 17:19 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Julien Grall, Jan Beulich, Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 1274 bytes --]

On Thu, 2019-03-28 at 13:06 +0100, Juergen Gross wrote:
> cpu_disable_scheduler() is being called from __cpu_disable() today.
> There is no need to execute it on the cpu just being disabled, so use
> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
> stop_machine() context is fine, as we just need to hold the domain
> RCU
> lock and need the scheduler percpu data to be still allocated.
> 
> Add another hook for CPU_DOWN_PREPARE to bail out early in case
> cpu_disable_scheduler() would fail. This will avoid crashes in rare
> cases for cpu hotplug or suspend.
> 
> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
> incarnation.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V2:
> - add CPU_DOWN_PREPARE hook
> - BUG() in case of cpu_disable_scheduler() failing in CPU_DEAD
>   (Jan Beulich)
> - modify ARM __cpu_disable(), too (Andrew Cooper)
> ---
>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

Regards,
Dario
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/6] xen: add helper for calling notifier_call_chain() to common/cpu.c
  2019-03-28 12:06 ` [PATCH v2 2/6] xen: add helper for calling notifier_call_chain() to common/cpu.c Juergen Gross
@ 2019-03-29 17:33   ` Dario Faggioli
  0 siblings, 0 replies; 23+ messages in thread
From: Dario Faggioli @ 2019-03-29 17:33 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 898 bytes --]

On Thu, 2019-03-28 at 13:06 +0100, Juergen Gross wrote:
> Add a helper cpu_notifier_call_chain() to call notifier_call_chain()
> for a cpu with a specified action, returning an errno value.
> 
> This avoids coding the same pattern multiple times.
> 
> While at it avoid side effects from using BUG_ON() by not using
> cpu_online(cpu) as a parameter.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V2:
> - add nofail parameter to cpu_notifier_call_chain()
> - avoid side effects from using BUG_ON() macro (Andrew Cooper)
> ---
>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

Regards,
Dario
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] xen/sched: don't disable scheduler on cpus during suspend
  2019-03-28 12:06 ` [PATCH v2 6/6] xen/sched: don't disable scheduler on cpus during suspend Juergen Gross
@ 2019-03-29 17:36   ` Dario Faggioli
  0 siblings, 0 replies; 23+ messages in thread
From: Dario Faggioli @ 2019-03-29 17:36 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 632 bytes --]

On Thu, 2019-03-28 at 13:06 +0100, Juergen Gross wrote:
> [...]
>
> Avoid all other scheduler dealloc - alloc dance when doing suspend
> and
> resume, too. It is enough to react on cpus failing to come up on
> resume
> again.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

Regards,
Dario
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier
  2019-03-28 12:06 ` [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier Juergen Gross
  2019-03-29 17:19   ` Dario Faggioli
@ 2019-04-01 10:34   ` Julien Grall
  1 sibling, 0 replies; 23+ messages in thread
From: Julien Grall @ 2019-04-01 10:34 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Roger Pau Monné

Hi Juergen,

On 3/28/19 12:06 PM, Juergen Gross wrote:
> cpu_disable_scheduler() is being called from __cpu_disable() today.
> There is no need to execute it on the cpu just being disabled, so use
> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
> stop_machine() context is fine, as we just need to hold the domain RCU
> lock and need the scheduler percpu data to be still allocated.
> 
> Add another hook for CPU_DOWN_PREPARE to bail out early in case
> cpu_disable_scheduler() would fail. This will avoid crashes in rare
> cases for cpu hotplug or suspend.
> 
> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
> incarnation.

Can we avoid sending the same patches in the different series in just a 
day interval? This is making difficult to know which one to review...

It would be easier if you tell in the cover letter that you depend on 
series Foo and provide a branch with the two series in applied.

Anyway, I have commented on <20190329150934.17694-2-jgross@suse.com>.


Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-04-01 10:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-28 12:06 [PATCH v2 0/6] xen: simplify suspend/resume handling Juergen Gross
2019-03-28 12:06 ` [PATCH v2 1/6] xen/sched: call cpu_disable_scheduler() via cpu notifier Juergen Gross
2019-03-29 17:19   ` Dario Faggioli
2019-04-01 10:34   ` Julien Grall
2019-03-28 12:06 ` [PATCH v2 2/6] xen: add helper for calling notifier_call_chain() to common/cpu.c Juergen Gross
2019-03-29 17:33   ` Dario Faggioli
2019-03-28 12:06 ` [PATCH v2 3/6] xen: add new cpu notifier action CPU_RESUME_FAILED Juergen Gross
2019-03-28 12:06 ` [PATCH v2 4/6] xen: don't free percpu areas during suspend Juergen Gross
2019-03-28 13:39   ` Jan Beulich
2019-03-28 12:06 ` [PATCH v2 5/6] xen/cpupool: simplify suspend/resume handling Juergen Gross
2019-03-28 12:06 ` [PATCH v2 6/6] xen/sched: don't disable scheduler on cpus during suspend Juergen Gross
2019-03-29 17:36   ` Dario Faggioli
2019-03-28 13:01 ` [PATCH v2 0/6] xen: simplify suspend/resume handling Volodymyr Babchuk
2019-03-28 13:21   ` Juergen Gross
2019-03-28 13:23     ` Julien Grall
2019-03-28 13:34     ` Volodymyr Babchuk
2019-03-28 13:33   ` Julien Grall
2019-03-28 13:37     ` Juergen Gross
2019-03-28 13:49       ` Julien Grall
2019-03-28 13:56     ` Volodymyr Babchuk
2019-03-28 14:43       ` Julien Grall
2019-03-28 14:53       ` Dario Faggioli
2019-03-28 14:57         ` Volodymyr Babchuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.