All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure
@ 2020-02-12 11:21 Sergey Dyasli
  2020-02-12 12:24 ` Jürgen Groß
  0 siblings, 1 reply; 4+ messages in thread
From: Sergey Dyasli @ 2020-02-12 11:21 UTC (permalink / raw)
  To: Xen-devel, Juergen Gross
  Cc: George Dunlap, sergey.dyasli@citrix.com >> Sergey Dyasli,
	Dario Faggioli

Hi Juergen,

Recently our testing has found a host crash which is reproducible.
Do you have any idea what might be going on here?

(XEN) [175654.165126] Assertion 'lock == get_sched_res(i->res->master_cpu)->schedule_lock' failed at ...ild/BUILD/xen-4.13.1/xen/include/xen/sched-if.h:269
(XEN) [175654.165133] ----[ Xen-4.13.1-9.0.3-d  x86_64  debug=y   Not tainted ]----
(XEN) [175654.165136] CPU:    28
(XEN) [175654.165138] RIP:    e008:[<ffff82d08023d2d2>] vcpu_runstate_get+0x11e/0x14f
(XEN) [175654.165146] RFLAGS: 0000000000010083   CONTEXT: hypervisor (d0v4)
(XEN) [175654.165151] rax: ffff83403ff0d340   rbx: ffff83807cc97ac8   rcx: 0000000000000006
(XEN) [175654.165154] rdx: 0000006fbf942000   rsi: ffff83400f8e1cd8   rdi: 00000000107898e2
(XEN) [175654.165158] rbp: ffff83807cc97ab8   rsp: ffff83807cc97a88   r8:  deadbeefdeadf00d
(XEN) [175654.165160] r9:  deadbeefdeadf00d   r10: 0000000000000000   r11: 0000000000000000
(XEN) [175654.165164] r12: ffff83400fa6f000   r13: ffff83400f8c9778   r14: ffff82d0805c8008
(XEN) [175654.165167] r15: ffff832e30854ae0   cr0: 0000000080050033   cr4: 0000000000362660
(XEN) [175654.165170] cr3: 0000002130811000   cr2: ffff88817f50b728
(XEN) [175654.165172] fsb: 00007f40a40da740   gsb: ffff88831d300000   gss: 0000000000000000
(XEN) [175654.165175] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) [175654.165179] Xen code around <ffff82d08023d2d2> (vcpu_runstate_get+0x11e/0x14f):
(XEN) [175654.165181]  04 10 4c 3b 68 10 74 02 <0f> 0b 4c 89 ef e8 7e 5d 00 00 48 8d 05 41 9d 38
(XEN) [175654.165192] Xen stack trace from rsp=ffff83807cc97a88:
(XEN) [175654.165194]    ffff83807cc97aa8 ffff83400fa75a60 0000000000000000 ffff83807cc97da0
(XEN) [175654.165199]    0000000000000230 ffff83807cc97fff ffff83807cc97af8 ffff82d08023d41f
(XEN) [175654.165204]    0000000000000001 00009fc1ac1cb2f4 00004840c423acdc 00005780e7f9735a
(XEN) [175654.165207]    0000000000000000 0000000000000000 ffff83807cc97c98 ffff82d0802ea9f7
(XEN) [175654.165211]    0000000000000000 00009fc1ac1c6b99 0000000500000007 ffff83807cc97c10
(XEN) [175654.165215]    ffff83807cc97bb0 0000000000000020 0000000000000000 0000000000000000
(XEN) [175654.165251]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [175654.165254]    0000000000000000 0000000000000000 0000000000000000 aaaaaaaaaaaaaaaa
(XEN) [175654.165258]    ffff82d0805c8038 ffff82d0805c74a0 aaaaaaaa00000000 aaaaaaaaaaaaaa00
(XEN) [175654.165263]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [175654.165266]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [175654.165269]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [175654.165273]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [175654.165276]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [175654.165279]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [175654.165283]    ffff83400f813000 ffff83807cc97d98 0000000000000000 ffff82d0805cda80
(XEN) [175654.165287]    0000000000000230 ffff83807cc97fff ffff83807cc97cc8 ffff82d08026d99b
(XEN) [175654.165291]    ffff83807cc97ef8 ffff83400f813000 ffff82d0805cda80 0000000000000230
(XEN) [175654.165295]    ffff83807cc97e48 ffff82d080244573 00007f40a40e6000 0000000000000206
(XEN) [175654.165300]    ffff82004006c000 0000000000000000 0000000000000000 ffff82e08a815e80
(XEN) [175654.165304] Xen call trace:
(XEN) [175654.165306]    [<ffff82d08023d2d2>] R vcpu_runstate_get+0x11e/0x14f
(XEN) [175654.165310]    [<ffff82d08023d41f>] F get_cpu_idle_time+0x4d/0x53
(XEN) [175654.165315]    [<ffff82d0802ea9f7>] F pmstat_get_cx_stat+0x82/0x8e7
(XEN) [175654.165319]    [<ffff82d08026d99b>] F do_get_pm_info+0x27b/0x2d4
(XEN) [175654.165322]    [<ffff82d080244573>] F do_sysctl+0x633/0x14e0
(XEN) [175654.165327]    [<ffff82d080382335>] F pv_hypercall+0x1f5/0x567
(XEN) [175654.165330]    [<ffff82d080389432>] F lstar_enter+0x112/0x120
(XEN) [175654.165332]
(XEN) [175654.550916]
(XEN) [175654.553243] ****************************************
(XEN) [175654.559449] Panic on CPU 28:
(XEN) [175654.563328] Assertion 'lock == get_sched_res(i->res->master_cpu)->schedule_lock' failed at ...ild/BUILD/xen-4.13.1/xen/include/xen/sched-if****************************************
(XEN) [175654.581847]
(XEN) [175654.584173] Reboot in five seconds...
(XEN) [175654.588925] Executing kexec image on cpu28
(XEN) [175654.594987] Shot down all CPUs


The state of the sibling was:


  PCPU 29 Host state:
	RIP:    e008:[<ffff82d080219fb0>] Ring 0
	RFLAGS: 0000000000040002  AC IOPL0

	rax: ffff83400f8c91e4   rbx: 000000000000001d   rcx: ffff83400f8c91f4
	rdx: ffff83400f8c9104   rsi: ffff83400f8c9094   rdi: 0000000000000004
	rbp: ffff83807cc89f28   rsp: ffff83807cc89f28   r8:  0000000000000000
	r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
	r12: 0000000000000000   r13: 0000000000000000   r14: ffff83807cc8ffff
	r15: 0000000000000000

	cr0: 0000000080050033   PG AM WP NE ET MP PE
	cr3: 000000406e5ff000   cr2: 0000000000900030
	cr4: 0000000000362660   SMEP OSXSAVE PCIDE VMXE OSXMMEXCPT OSFXSR MCE PAE

	ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008

	stack current VCPU  ffff83400f80f000 DOM0 VCPU5
	percpu current VCPU ffff83400f80f000 DOM0 VCPU5
	VCPU was RUNNING

	Stack at ffff83807cc89f28:
	  ffff83807cc89f20:                  ffff83807cc89f48 ffff82d0802758bb ffff82d080389d84
	  ffff83807cc89f40: 0000000000000000 00007c7f83376087 ffff82d080389e21 ffff83400f861060
	  ffff83807cc89f60: 000000000000001d ffff82d0805ec5a0 ffff83400f8f09ae ffff83807cc8fd78
	  ffff83807cc89f80: ffff83400f8f09a8 0000000000000000 0000000000000000 ffff83400f8e1c20
	  ffff83807cc89fa0: 0000000000000000 0000000000008326 0000000000000000 0000000000000001
	  ffff83807cc89fc0: ffff82d0805c8326 ffff83400f8f09ae 0000000200000000 ffff82d080242e50
	  ffff83807cc89fe0: 000000000000e008 0000000000000046 ffff83807cc8fd60 000000000000e010

	Code:
	   5b 41 5c 5d c3 66 2e 0f 1f 84 00 00 00 00 00 <55> 48 89 e5 4c 89 3f 4c 89 77 08 4c 89 6f 10 4c 89

	Call Trace:
	 [ffff82d080219fb0] elf_core_save_regs+0/0xae
	  ffff82d0802758bb  do_nmi_crash+0x8b/0xf4
	  ffff82d080389d84  handle_ist_exception+0xaa/0x1b6
	  ffff82d080389e21  handle_ist_exception+0x147/0x1b6

	      NMI interrupted Code at e008:ffff82d080242e50 and Stack at e010:ffff83807cc8fd60

	 [ffff82d080242e50] got_lock+0/0x23
	  ffff82d080242fcb  _spin_lock+0x41/0x5e
	  ffff82d080242ffb  _spin_lock_irq+0x13/0x15
	  ffff82d080240bc5  sched_wait_rendezvous_in+0x25a/0x2cc
	  ffff82d08024109b  schedule+0x1bc/0x2b4
	  ffff82d0803893d4  lstar_enter+0xb4/0x120
	  ffff82d080382335  pv_hypercall+0x1f5/0x567
	  ffff82d0803893d4  lstar_enter+0xb4/0x120
	  ffff82d0802425f5  __do_softirq+0x85/0x90
	  ffff82d08024264a  do_softirq+0x13/0x15
	  ffff82d080386c76  process_softirqs+0x6/0x20

--
Thanks,
Sergey

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure
  2020-02-12 11:21 [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure Sergey Dyasli
@ 2020-02-12 12:24 ` Jürgen Groß
  2020-02-13 14:19   ` Sergey Dyasli
  0 siblings, 1 reply; 4+ messages in thread
From: Jürgen Groß @ 2020-02-12 12:24 UTC (permalink / raw)
  To: Sergey Dyasli, Xen-devel; +Cc: George Dunlap, Dario Faggioli

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

On 12.02.20 12:21, Sergey Dyasli wrote:
> Hi Juergen,
> 
> Recently our testing has found a host crash which is reproducible.
> Do you have any idea what might be going on here?

Oh, nice catch!

The problem is that get_cpu_idle_time() is calling vcpu_runstate_get()
for an idle vcpu. This is fragile as idle vcpus are sometimes assigned
temporarily to normal scheduling units, thus the ASSERT() in the unlock
function is failing when the assignment of the idle vcpu is modified
under the feet of vcpu_runstate_get() and the unit it has been assigned
to before is already scheduled on another cpu.

The patch is rather easy, though. Can you try it, please?


Juergen

[-- Attachment #2: 0001-xen-sched-fix-get_cpu_idle_time-with-core-scheduling.patch --]
[-- Type: text/x-patch, Size: 2273 bytes --]

From 0236aee221409fa826a81395f2f3e8b15d5128de Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
To: xen-devel@lists.xenproject.org
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Dario Faggioli <dfaggioli@suse.com>
Date: Wed, 12 Feb 2020 13:04:16 +0100
Subject: [PATCH] xen/sched: fix get_cpu_idle_time() with core scheduling

get_cpu_idle_time() is calling vcpu_runstate_get() for an idle vcpu.
With core scheduling active this is fragile, as idle vcpus are assigned
to other scheduling units temporarily, and that assignment is changed
in some cases without holding the scheduling lock, and
vcpu_runstate_get() is using v->sched_unit as parameter for
unit_schedule_[un]lock_irq(), resulting in an ASSERT() triggering in
unlock in case v->sched_unit has changed meanwhile.

Fix that by using a local unit variable holding the correct unit.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/common/sched/core.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 2e43f8029f..de5a6b1a57 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -308,17 +308,26 @@ void vcpu_runstate_get(const struct vcpu *v,
 {
     spinlock_t *lock;
     s_time_t delta;
+    struct sched_unit *unit;
 
     rcu_read_lock(&sched_res_rculock);
 
-    lock = likely(v == current) ? NULL : unit_schedule_lock_irq(v->sched_unit);
+    /*
+     * Be careful in case of an idle vcpu: the assignment to a unit might
+     * change even with the scheduling lock held, so be sure to use the
+     * correct unit for locking in order to avoid triggering an ASSERT() in
+     * the unlock function.
+     */
+    unit = is_idle_vcpu(v) ? get_sched_res(v->processor)->sched_unit_idle
+                           : v->sched_unit;
+    lock = likely(v == current) ? NULL : unit_schedule_lock_irq(unit);
     memcpy(runstate, &v->runstate, sizeof(*runstate));
     delta = NOW() - runstate->state_entry_time;
     if ( delta > 0 )
         runstate->time[runstate->state] += delta;
 
     if ( unlikely(lock != NULL) )
-        unit_schedule_unlock_irq(lock, v->sched_unit);
+        unit_schedule_unlock_irq(lock, unit);
 
     rcu_read_unlock(&sched_res_rculock);
 }
-- 
2.16.4


[-- Attachment #3: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure
  2020-02-12 12:24 ` Jürgen Groß
@ 2020-02-13 14:19   ` Sergey Dyasli
  2020-02-13 14:20     ` Jürgen Groß
  0 siblings, 1 reply; 4+ messages in thread
From: Sergey Dyasli @ 2020-02-13 14:19 UTC (permalink / raw)
  To: Jürgen Groß, Xen-devel
  Cc: George Dunlap, sergey.dyasli@citrix.com >> Sergey Dyasli,
	Dario Faggioli

On 12/02/2020 12:24, Jürgen Groß wrote:
> On 12.02.20 12:21, Sergey Dyasli wrote:
>> Hi Juergen,
>>
>> Recently our testing has found a host crash which is reproducible.
>> Do you have any idea what might be going on here?
>
> Oh, nice catch!
>
> The problem is that get_cpu_idle_time() is calling vcpu_runstate_get()
> for an idle vcpu. This is fragile as idle vcpus are sometimes assigned
> temporarily to normal scheduling units, thus the ASSERT() in the unlock
> function is failing when the assignment of the idle vcpu is modified
> under the feet of vcpu_runstate_get() and the unit it has been assigned
> to before is already scheduled on another cpu.
>
> The patch is rather easy, though. Can you try it, please?

Thank you for the patch! I put it into testing yesterday and it looks
good so far. It also seems that the issue is well understood and the
patch should go into the main tree.

--
Sergey

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure
  2020-02-13 14:19   ` Sergey Dyasli
@ 2020-02-13 14:20     ` Jürgen Groß
  0 siblings, 0 replies; 4+ messages in thread
From: Jürgen Groß @ 2020-02-13 14:20 UTC (permalink / raw)
  To: Sergey Dyasli, Xen-devel; +Cc: George Dunlap, Dario Faggioli

On 13.02.20 15:19, Sergey Dyasli wrote:
> On 12/02/2020 12:24, Jürgen Groß wrote:
>> On 12.02.20 12:21, Sergey Dyasli wrote:
>>> Hi Juergen,
>>>
>>> Recently our testing has found a host crash which is reproducible.
>>> Do you have any idea what might be going on here?
>>
>> Oh, nice catch!
>>
>> The problem is that get_cpu_idle_time() is calling vcpu_runstate_get()
>> for an idle vcpu. This is fragile as idle vcpus are sometimes assigned
>> temporarily to normal scheduling units, thus the ASSERT() in the unlock
>> function is failing when the assignment of the idle vcpu is modified
>> under the feet of vcpu_runstate_get() and the unit it has been assigned
>> to before is already scheduled on another cpu.
>>
>> The patch is rather easy, though. Can you try it, please?
> 
> Thank you for the patch! I put it into testing yesterday and it looks
> good so far. It also seems that the issue is well understood and the
> patch should go into the main tree.

Just wanted to make sure it really fixes your problem. :-)


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-02-13 14:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12 11:21 [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure Sergey Dyasli
2020-02-12 12:24 ` Jürgen Groß
2020-02-13 14:19   ` Sergey Dyasli
2020-02-13 14:20     ` Jürgen Groß

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.