All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
@ 2021-06-14 11:15 Igor Druzhinin
  2021-06-14 11:53 ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Igor Druzhinin @ 2021-06-14 11:15 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, roger.pau, stephen.s.brennan

Hi, Boris, Stephen, Roger,

We have stress tested recent changes on staging-4.13 which includes a
backport of the subject. Since the backport is identical to the
master branch and all of the pre-reqs are in place - we have no reason
to believe the issue is not the same on master.

Here is what we got by running heavy stress testing including multiple
repeated VM lifecycle operations with storage and network load:


Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
CPU:    17
RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
cr3: 00000013c1a32000   cr2: 0000000000000000
fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
  0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
Xen stack trace from rsp=ffff83303fff7cf8:
    ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
    000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
    ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
    ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
    ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
    ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
    0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
    ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
    ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
    ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
    ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
    ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
    ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
    ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
    ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
    ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
    ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
    000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
    0000000000000005 000000000003d91d 0000000000000000 0000000000000000
    00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
Xen call trace:
    [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
    [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
    [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
    [<ffff82d08027df19>] F context_switch+0xf9/0xee0
    [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
    [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
    [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
    [<ffff82d08024324a>] F do_softirq+0x13/0x15
    [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30

****************************************
Panic on CPU 17:
Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
****************************************

That looks like a race opened by this change - we didn't see the problem
before while running with all of the pre-req patches.

Could you please analyse this assertion failure?

Igor


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
  2021-06-14 11:15 BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases") Igor Druzhinin
@ 2021-06-14 11:53 ` Jan Beulich
  2021-06-14 13:27   ` Roger Pau Monné
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2021-06-14 11:53 UTC (permalink / raw)
  To: Igor Druzhinin; +Cc: xen-devel, boris.ostrovsky, stephen.s.brennan, roger.pau

On 14.06.2021 13:15, Igor Druzhinin wrote:
> Hi, Boris, Stephen, Roger,
> 
> We have stress tested recent changes on staging-4.13 which includes a
> backport of the subject. Since the backport is identical to the
> master branch and all of the pre-reqs are in place - we have no reason
> to believe the issue is not the same on master.
> 
> Here is what we got by running heavy stress testing including multiple
> repeated VM lifecycle operations with storage and network load:
> 
> 
> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
> ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
> CPU:    17
> RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
> RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
> rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
> rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
> rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
> r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
> r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
> r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
> cr3: 00000013c1a32000   cr2: 0000000000000000
> fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
>   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
> Xen stack trace from rsp=ffff83303fff7cf8:
>     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
>     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
>     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
>     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
>     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
>     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
>     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
>     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
>     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
>     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
>     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
>     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
>     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
>     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
>     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
>     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
>     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
>     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
>     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
>     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
> Xen call trace:
>     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
>     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
>     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
>     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
>     [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
>     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
>     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
>     [<ffff82d08024324a>] F do_softirq+0x13/0x15
>     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
> 
> ****************************************
> Panic on CPU 17:
> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
> ****************************************

Since this suggests a timer was found on the list without ever having been
initialized, I've spotted a case where this indeed could now happen. Could
you give the patch below a try?

Jan

x86/vpt: fully init timers before putting onto list

With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
iterating the list and acting on the timers of the list entries will no
longer be kept from entering their loops by create_periodic_time()'s
holding of that lock. Therefore at least init_timer() needs calling
ahead of list insertion, but keep this and set_timer() together.

Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- unstable.orig/xen/arch/x86/hvm/vpt.c
+++ unstable/xen/arch/x86/hvm/vpt.c
@@ -554,14 +554,14 @@ void create_periodic_time(
     pt->cb = cb;
     pt->priv = data;
 
+    init_timer(&pt->timer, pt_timer_fn, pt, v->processor);
+    set_timer(&pt->timer, pt->scheduled);
+
     pt_vcpu_lock(v);
     pt->on_list = 1;
     list_add(&pt->list, &v->arch.hvm.tm_list);
     pt_vcpu_unlock(v);
 
-    init_timer(&pt->timer, pt_timer_fn, pt, v->processor);
-    set_timer(&pt->timer, pt->scheduled);
-
     write_unlock(&v->domain->arch.hvm.pl_time->pt_migrate);
 }
 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
  2021-06-14 11:53 ` Jan Beulich
@ 2021-06-14 13:27   ` Roger Pau Monné
  2021-06-14 16:01     ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Pau Monné @ 2021-06-14 13:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Igor Druzhinin, xen-devel, boris.ostrovsky, stephen.s.brennan

On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
> On 14.06.2021 13:15, Igor Druzhinin wrote:
> > Hi, Boris, Stephen, Roger,
> > 
> > We have stress tested recent changes on staging-4.13 which includes a
> > backport of the subject. Since the backport is identical to the
> > master branch and all of the pre-reqs are in place - we have no reason
> > to believe the issue is not the same on master.
> > 
> > Here is what we got by running heavy stress testing including multiple
> > repeated VM lifecycle operations with storage and network load:
> > 
> > 
> > Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
> > ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
> > CPU:    17
> > RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
> > RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
> > rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
> > rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
> > rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
> > r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
> > r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
> > r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
> > cr3: 00000013c1a32000   cr2: 0000000000000000
> > fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> > ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> > Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
> >   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
> > Xen stack trace from rsp=ffff83303fff7cf8:
> >     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
> >     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
> >     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
> >     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
> >     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
> >     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
> >     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
> >     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
> >     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
> >     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
> >     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
> >     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
> >     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
> >     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
> >     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
> >     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
> >     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
> >     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
> >     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
> >     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
> > Xen call trace:
> >     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
> >     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
> >     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
> >     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
> >     [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
> >     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
> >     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
> >     [<ffff82d08024324a>] F do_softirq+0x13/0x15
> >     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
> > 
> > ****************************************
> > Panic on CPU 17:
> > Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
> > ****************************************
> 
> Since this suggests a timer was found on the list without ever having been
> initialized, I've spotted a case where this indeed could now happen. Could
> you give the patch below a try?
> 
> Jan
> 
> x86/vpt: fully init timers before putting onto list
> 
> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
> iterating the list and acting on the timers of the list entries will no
> longer be kept from entering their loops by create_periodic_time()'s
> holding of that lock. Therefore at least init_timer() needs calling
> ahead of list insertion, but keep this and set_timer() together.
> 
> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Thanks for looking into this so quickly, and sorry for not realizing
myself when relaxing the locking. Adding the timer to the list without
it being fully initialized was a latent issue even if protected by the
lock initially.

Provided testing shows the issue is fixed:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Roger.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
  2021-06-14 13:27   ` Roger Pau Monné
@ 2021-06-14 16:01     ` Jan Beulich
  2021-06-14 16:10       ` Andrew Cooper
  2021-06-15  8:12       ` Roger Pau Monné
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Beulich @ 2021-06-14 16:01 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Igor Druzhinin, xen-devel, boris.ostrovsky, stephen.s.brennan

On 14.06.2021 15:27, Roger Pau Monné wrote:
> On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
>> On 14.06.2021 13:15, Igor Druzhinin wrote:
>>> Hi, Boris, Stephen, Roger,
>>>
>>> We have stress tested recent changes on staging-4.13 which includes a
>>> backport of the subject. Since the backport is identical to the
>>> master branch and all of the pre-reqs are in place - we have no reason
>>> to believe the issue is not the same on master.
>>>
>>> Here is what we got by running heavy stress testing including multiple
>>> repeated VM lifecycle operations with storage and network load:
>>>
>>>
>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>> ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
>>> CPU:    17
>>> RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
>>> RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
>>> rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
>>> rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
>>> rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
>>> r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
>>> r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
>>> r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
>>> cr3: 00000013c1a32000   cr2: 0000000000000000
>>> fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>> ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
>>>   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
>>> Xen stack trace from rsp=ffff83303fff7cf8:
>>>     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
>>>     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
>>>     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
>>>     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
>>>     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
>>>     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
>>>     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
>>>     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
>>>     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
>>>     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
>>>     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
>>>     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
>>>     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
>>>     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
>>>     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
>>>     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
>>>     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
>>>     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
>>>     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
>>>     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
>>> Xen call trace:
>>>     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
>>>     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
>>>     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
>>>     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
>>>     [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
>>>     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
>>>     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
>>>     [<ffff82d08024324a>] F do_softirq+0x13/0x15
>>>     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
>>>
>>> ****************************************
>>> Panic on CPU 17:
>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>> ****************************************
>>
>> Since this suggests a timer was found on the list without ever having been
>> initialized, I've spotted a case where this indeed could now happen. Could
>> you give the patch below a try?
>>
>> Jan
>>
>> x86/vpt: fully init timers before putting onto list
>>
>> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
>> iterating the list and acting on the timers of the list entries will no
>> longer be kept from entering their loops by create_periodic_time()'s
>> holding of that lock. Therefore at least init_timer() needs calling
>> ahead of list insertion, but keep this and set_timer() together.
>>
>> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
>> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Thanks for looking into this so quickly, and sorry for not realizing
> myself when relaxing the locking. Adding the timer to the list without
> it being fully initialized was a latent issue even if protected by the
> lock initially.
> 
> Provided testing shows the issue is fixed:

I guess the change here is needed anyway, even if testing finds there's
still something amiss?

> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

Jan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
  2021-06-14 16:01     ` Jan Beulich
@ 2021-06-14 16:10       ` Andrew Cooper
  2021-06-14 17:06         ` Boris Ostrovsky
  2021-06-15  8:12       ` Roger Pau Monné
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Cooper @ 2021-06-14 16:10 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: Igor Druzhinin, xen-devel, boris.ostrovsky, stephen.s.brennan

On 14/06/2021 17:01, Jan Beulich wrote:
> On 14.06.2021 15:27, Roger Pau Monné wrote:
>> On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
>>> On 14.06.2021 13:15, Igor Druzhinin wrote:
>>>> Hi, Boris, Stephen, Roger,
>>>>
>>>> We have stress tested recent changes on staging-4.13 which includes a
>>>> backport of the subject. Since the backport is identical to the
>>>> master branch and all of the pre-reqs are in place - we have no reason
>>>> to believe the issue is not the same on master.
>>>>
>>>> Here is what we got by running heavy stress testing including multiple
>>>> repeated VM lifecycle operations with storage and network load:
>>>>
>>>>
>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>> ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
>>>> CPU:    17
>>>> RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
>>>> RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
>>>> rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
>>>> rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
>>>> rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
>>>> r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
>>>> r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
>>>> r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
>>>> cr3: 00000013c1a32000   cr2: 0000000000000000
>>>> fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>> ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
>>>>   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
>>>> Xen stack trace from rsp=ffff83303fff7cf8:
>>>>     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
>>>>     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
>>>>     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
>>>>     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
>>>>     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
>>>>     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
>>>>     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
>>>>     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
>>>>     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
>>>>     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
>>>>     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
>>>>     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
>>>>     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
>>>>     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
>>>>     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
>>>>     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
>>>>     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
>>>>     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
>>>>     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
>>>>     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
>>>> Xen call trace:
>>>>     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
>>>>     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
>>>>     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
>>>>     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
>>>>     [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
>>>>     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
>>>>     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
>>>>     [<ffff82d08024324a>] F do_softirq+0x13/0x15
>>>>     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
>>>>
>>>> ****************************************
>>>> Panic on CPU 17:
>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>> ****************************************
>>> Since this suggests a timer was found on the list without ever having been
>>> initialized, I've spotted a case where this indeed could now happen. Could
>>> you give the patch below a try?
>>>
>>> Jan
>>>
>>> x86/vpt: fully init timers before putting onto list
>>>
>>> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
>>> iterating the list and acting on the timers of the list entries will no
>>> longer be kept from entering their loops by create_periodic_time()'s
>>> holding of that lock. Therefore at least init_timer() needs calling
>>> ahead of list insertion, but keep this and set_timer() together.
>>>
>>> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
>>> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Thanks for looking into this so quickly, and sorry for not realizing
>> myself when relaxing the locking. Adding the timer to the list without
>> it being fully initialized was a latent issue even if protected by the
>> lock initially.
>>
>> Provided testing shows the issue is fixed:
> I guess the change here is needed anyway, even if testing finds there's
> still something amiss?

We've put this patch in for testing, but results will take a while,
because it only showed up in our weekend stress testing.

~Andrew


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
  2021-06-14 16:10       ` Andrew Cooper
@ 2021-06-14 17:06         ` Boris Ostrovsky
  0 siblings, 0 replies; 9+ messages in thread
From: Boris Ostrovsky @ 2021-06-14 17:06 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Roger Pau Monné
  Cc: Igor Druzhinin, xen-devel, stephen.s.brennan


On 6/14/21 12:10 PM, Andrew Cooper wrote:
> On 14/06/2021 17:01, Jan Beulich wrote:
>> On 14.06.2021 15:27, Roger Pau Monné wrote:
>>> On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
>>>> On 14.06.2021 13:15, Igor Druzhinin wrote:
>>>>> Hi, Boris, Stephen, Roger,
>>>>>
>>>>> We have stress tested recent changes on staging-4.13 which includes a
>>>>> backport of the subject. Since the backport is identical to the
>>>>> master branch and all of the pre-reqs are in place - we have no reason
>>>>> to believe the issue is not the same on master.
>>>>>
>>>>> Here is what we got by running heavy stress testing including multiple
>>>>> repeated VM lifecycle operations with storage and network load:
>>>>>
>>>>>
>>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>>> ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
>>>>> CPU:    17
>>>>> RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
>>>>> RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
>>>>> rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
>>>>> rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
>>>>> rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
>>>>> r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
>>>>> r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
>>>>> r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
>>>>> cr3: 00000013c1a32000   cr2: 0000000000000000
>>>>> fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>>> ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>>> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
>>>>>   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
>>>>> Xen stack trace from rsp=ffff83303fff7cf8:
>>>>>     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
>>>>>     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
>>>>>     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
>>>>>     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
>>>>>     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
>>>>>     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
>>>>>     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
>>>>>     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
>>>>>     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
>>>>>     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
>>>>>     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
>>>>>     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
>>>>>     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
>>>>>     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
>>>>>     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
>>>>>     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
>>>>>     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
>>>>>     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
>>>>>     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
>>>>>     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
>>>>> Xen call trace:
>>>>>     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
>>>>>     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
>>>>>     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
>>>>>     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
>>>>>     [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
>>>>>     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
>>>>>     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
>>>>>     [<ffff82d08024324a>] F do_softirq+0x13/0x15
>>>>>     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
>>>>>
>>>>> ****************************************
>>>>> Panic on CPU 17:
>>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>>> ****************************************
>>>> Since this suggests a timer was found on the list without ever having been
>>>> initialized, I've spotted a case where this indeed could now happen. Could
>>>> you give the patch below a try?
>>>>
>>>> Jan
>>>>
>>>> x86/vpt: fully init timers before putting onto list
>>>>
>>>> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
>>>> iterating the list and acting on the timers of the list entries will no
>>>> longer be kept from entering their loops by create_periodic_time()'s
>>>> holding of that lock. Therefore at least init_timer() needs calling
>>>> ahead of list insertion, but keep this and set_timer() together.
>>>>
>>>> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
>>>> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Thanks for looking into this so quickly, and sorry for not realizing
>>> myself when relaxing the locking. Adding the timer to the list without
>>> it being fully initialized was a latent issue even if protected by the
>>> lock initially.
>>>
>>> Provided testing shows the issue is fixed:
>> I guess the change here is needed anyway, even if testing finds there's
>> still something amiss?


Yes, I think so. Thanks for finding this.



Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>


> We've put this patch in for testing, but results will take a while,
> because it only showed up in our weekend stress testing.
>
> ~Andrew


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
  2021-06-14 16:01     ` Jan Beulich
  2021-06-14 16:10       ` Andrew Cooper
@ 2021-06-15  8:12       ` Roger Pau Monné
  2021-06-15  9:17         ` Andrew Cooper
  1 sibling, 1 reply; 9+ messages in thread
From: Roger Pau Monné @ 2021-06-15  8:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Igor Druzhinin, xen-devel, boris.ostrovsky, stephen.s.brennan

On Mon, Jun 14, 2021 at 06:01:17PM +0200, Jan Beulich wrote:
> On 14.06.2021 15:27, Roger Pau Monné wrote:
> > On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
> >> On 14.06.2021 13:15, Igor Druzhinin wrote:
> >>> Hi, Boris, Stephen, Roger,
> >>>
> >>> We have stress tested recent changes on staging-4.13 which includes a
> >>> backport of the subject. Since the backport is identical to the
> >>> master branch and all of the pre-reqs are in place - we have no reason
> >>> to believe the issue is not the same on master.
> >>>
> >>> Here is what we got by running heavy stress testing including multiple
> >>> repeated VM lifecycle operations with storage and network load:
> >>>
> >>>
> >>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
> >>> ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
> >>> CPU:    17
> >>> RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
> >>> RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
> >>> rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
> >>> rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
> >>> rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
> >>> r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
> >>> r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
> >>> r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
> >>> cr3: 00000013c1a32000   cr2: 0000000000000000
> >>> fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> >>> ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> >>> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
> >>>   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
> >>> Xen stack trace from rsp=ffff83303fff7cf8:
> >>>     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
> >>>     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
> >>>     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
> >>>     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
> >>>     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
> >>>     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
> >>>     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
> >>>     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
> >>>     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
> >>>     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
> >>>     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
> >>>     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
> >>>     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
> >>>     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
> >>>     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
> >>>     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
> >>>     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
> >>>     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
> >>>     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
> >>>     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
> >>> Xen call trace:
> >>>     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
> >>>     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
> >>>     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
> >>>     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
> >>>     [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
> >>>     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
> >>>     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
> >>>     [<ffff82d08024324a>] F do_softirq+0x13/0x15
> >>>     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
> >>>
> >>> ****************************************
> >>> Panic on CPU 17:
> >>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
> >>> ****************************************
> >>
> >> Since this suggests a timer was found on the list without ever having been
> >> initialized, I've spotted a case where this indeed could now happen. Could
> >> you give the patch below a try?
> >>
> >> Jan
> >>
> >> x86/vpt: fully init timers before putting onto list
> >>
> >> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
> >> iterating the list and acting on the timers of the list entries will no
> >> longer be kept from entering their loops by create_periodic_time()'s
> >> holding of that lock. Therefore at least init_timer() needs calling
> >> ahead of list insertion, but keep this and set_timer() together.
> >>
> >> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
> >> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Thanks for looking into this so quickly, and sorry for not realizing
> > myself when relaxing the locking. Adding the timer to the list without
> > it being fully initialized was a latent issue even if protected by the
> > lock initially.
> > 
> > Provided testing shows the issue is fixed:
> 
> I guess the change here is needed anyway, even if testing finds there's
> still something amiss?

Indeed, just wondered whether there might be other instances using a
similar pattern, but I'm not able to spot any.

It might even be better to fix other issues (if any) on a different
commit.

Thanks.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
  2021-06-15  8:12       ` Roger Pau Monné
@ 2021-06-15  9:17         ` Andrew Cooper
  2021-06-15 13:17           ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Cooper @ 2021-06-15  9:17 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Igor Druzhinin, xen-devel, boris.ostrovsky, stephen.s.brennan

On 15/06/2021 09:12, Roger Pau Monné wrote:
> On Mon, Jun 14, 2021 at 06:01:17PM +0200, Jan Beulich wrote:
>> On 14.06.2021 15:27, Roger Pau Monné wrote:
>>> On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
>>>> On 14.06.2021 13:15, Igor Druzhinin wrote:
>>>>> Hi, Boris, Stephen, Roger,
>>>>>
>>>>> We have stress tested recent changes on staging-4.13 which includes a
>>>>> backport of the subject. Since the backport is identical to the
>>>>> master branch and all of the pre-reqs are in place - we have no reason
>>>>> to believe the issue is not the same on master.
>>>>>
>>>>> Here is what we got by running heavy stress testing including multiple
>>>>> repeated VM lifecycle operations with storage and network load:
>>>>>
>>>>>
>>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>>> ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
>>>>> CPU:    17
>>>>> RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
>>>>> RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
>>>>> rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
>>>>> rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
>>>>> rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
>>>>> r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
>>>>> r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
>>>>> r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
>>>>> cr3: 00000013c1a32000   cr2: 0000000000000000
>>>>> fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>>> ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>>> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
>>>>>   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
>>>>> Xen stack trace from rsp=ffff83303fff7cf8:
>>>>>     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
>>>>>     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
>>>>>     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
>>>>>     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
>>>>>     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
>>>>>     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
>>>>>     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
>>>>>     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
>>>>>     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
>>>>>     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
>>>>>     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
>>>>>     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
>>>>>     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
>>>>>     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
>>>>>     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
>>>>>     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
>>>>>     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
>>>>>     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
>>>>>     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
>>>>>     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
>>>>> Xen call trace:
>>>>>     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
>>>>>     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
>>>>>     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
>>>>>     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
>>>>>     [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
>>>>>     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
>>>>>     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
>>>>>     [<ffff82d08024324a>] F do_softirq+0x13/0x15
>>>>>     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
>>>>>
>>>>> ****************************************
>>>>> Panic on CPU 17:
>>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>>> ****************************************
>>>> Since this suggests a timer was found on the list without ever having been
>>>> initialized, I've spotted a case where this indeed could now happen. Could
>>>> you give the patch below a try?
>>>>
>>>> Jan
>>>>
>>>> x86/vpt: fully init timers before putting onto list
>>>>
>>>> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
>>>> iterating the list and acting on the timers of the list entries will no
>>>> longer be kept from entering their loops by create_periodic_time()'s
>>>> holding of that lock. Therefore at least init_timer() needs calling
>>>> ahead of list insertion, but keep this and set_timer() together.
>>>>
>>>> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
>>>> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Thanks for looking into this so quickly, and sorry for not realizing
>>> myself when relaxing the locking. Adding the timer to the list without
>>> it being fully initialized was a latent issue even if protected by the
>>> lock initially.
>>>
>>> Provided testing shows the issue is fixed:
>> I guess the change here is needed anyway, even if testing finds there's
>> still something amiss?
> Indeed, just wondered whether there might be other instances using a
> similar pattern, but I'm not able to spot any.
>
> It might even be better to fix other issues (if any) on a different
> commit.

To be honest, this change is clearly good, and necessary.  I'd be
tempted to commit it now, as is, irrespective of whether there are
further bugs in this area.

~Andrew


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
  2021-06-15  9:17         ` Andrew Cooper
@ 2021-06-15 13:17           ` Jan Beulich
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-06-15 13:17 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Igor Druzhinin, xen-devel, boris.ostrovsky, stephen.s.brennan,
	Roger Pau Monné

On 15.06.2021 11:17, Andrew Cooper wrote:
> On 15/06/2021 09:12, Roger Pau Monné wrote:
>> On Mon, Jun 14, 2021 at 06:01:17PM +0200, Jan Beulich wrote:
>>> On 14.06.2021 15:27, Roger Pau Monné wrote:
>>>> On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
>>>>> x86/vpt: fully init timers before putting onto list
>>>>>
>>>>> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
>>>>> iterating the list and acting on the timers of the list entries will no
>>>>> longer be kept from entering their loops by create_periodic_time()'s
>>>>> holding of that lock. Therefore at least init_timer() needs calling
>>>>> ahead of list insertion, but keep this and set_timer() together.
>>>>>
>>>>> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
>>>>> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> Thanks for looking into this so quickly, and sorry for not realizing
>>>> myself when relaxing the locking. Adding the timer to the list without
>>>> it being fully initialized was a latent issue even if protected by the
>>>> lock initially.
>>>>
>>>> Provided testing shows the issue is fixed:
>>> I guess the change here is needed anyway, even if testing finds there's
>>> still something amiss?
>> Indeed, just wondered whether there might be other instances using a
>> similar pattern, but I'm not able to spot any.
>>
>> It might even be better to fix other issues (if any) on a different
>> commit.
> 
> To be honest, this change is clearly good, and necessary.  I'd be
> tempted to commit it now, as is, irrespective of whether there are
> further bugs in this area.

Done.

Jan



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-15 13:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-14 11:15 BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases") Igor Druzhinin
2021-06-14 11:53 ` Jan Beulich
2021-06-14 13:27   ` Roger Pau Monné
2021-06-14 16:01     ` Jan Beulich
2021-06-14 16:10       ` Andrew Cooper
2021-06-14 17:06         ` Boris Ostrovsky
2021-06-15  8:12       ` Roger Pau Monné
2021-06-15  9:17         ` Andrew Cooper
2021-06-15 13:17           ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.