From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To: "Andrew Cooper" <andrew.cooper3@citrix.com>,
"Jan Beulich" <jbeulich@suse.com>,
"Roger Pau Monné" <roger.pau@citrix.com>
Cc: Igor Druzhinin <igor.druzhinin@citrix.com>,
xen-devel@lists.xen.org, stephen.s.brennan@oracle.com
Subject: Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
Date: Mon, 14 Jun 2021 13:06:12 -0400 [thread overview]
Message-ID: <45cd516b-326f-a0e6-325a-fc0debd48571@oracle.com> (raw)
In-Reply-To: <a7ec2d98-465a-9691-ab73-bef5b45a6cfd@citrix.com>
On 6/14/21 12:10 PM, Andrew Cooper wrote:
> On 14/06/2021 17:01, Jan Beulich wrote:
>> On 14.06.2021 15:27, Roger Pau Monné wrote:
>>> On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
>>>> On 14.06.2021 13:15, Igor Druzhinin wrote:
>>>>> Hi, Boris, Stephen, Roger,
>>>>>
>>>>> We have stress tested recent changes on staging-4.13 which includes a
>>>>> backport of the subject. Since the backport is identical to the
>>>>> master branch and all of the pre-reqs are in place - we have no reason
>>>>> to believe the issue is not the same on master.
>>>>>
>>>>> Here is what we got by running heavy stress testing including multiple
>>>>> repeated VM lifecycle operations with storage and network load:
>>>>>
>>>>>
>>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>>> ----[ Xen-4.13.3-10.7-d x86_64 debug=y Not tainted ]----
>>>>> CPU: 17
>>>>> RIP: e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
>>>>> RFLAGS: 0000000000010046 CONTEXT: hypervisor (d675v0)
>>>>> rax: 0000000000000000 rbx: ffff83137a8ed300 rcx: 0000000000000000
>>>>> rdx: ffff83303fff7fff rsi: ffff83303fff2549 rdi: ffff83137a8ed300
>>>>> rbp: ffff83303fff7cf8 rsp: ffff83303fff7cf8 r8: 0000000000000001
>>>>> r9: 0000000000000000 r10: 0000000000000011 r11: 0000168b0cc08083
>>>>> r12: 0000000000000000 r13: ffff82d0805cf300 r14: ffff82d0805cf300
>>>>> r15: 0000000000000292 cr0: 0000000080050033 cr4: 00000000000426e0
>>>>> cr3: 00000013c1a32000 cr2: 0000000000000000
>>>>> fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000
>>>>> ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008
>>>>> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
>>>>> 0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
>>>>> Xen stack trace from rsp=ffff83303fff7cf8:
>>>>> ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
>>>>> 000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
>>>>> ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
>>>>> ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
>>>>> ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
>>>>> ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
>>>>> 0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
>>>>> ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
>>>>> ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
>>>>> ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
>>>>> ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
>>>>> ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
>>>>> ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
>>>>> ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
>>>>> ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
>>>>> ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
>>>>> ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
>>>>> 000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
>>>>> 0000000000000005 000000000003d91d 0000000000000000 0000000000000000
>>>>> 00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
>>>>> Xen call trace:
>>>>> [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
>>>>> [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
>>>>> [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
>>>>> [<ffff82d08027df19>] F context_switch+0xf9/0xee0
>>>>> [<ffff82d0802414e0>] F common/schedule.c#sched_context_switch+0x146/0x151
>>>>> [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
>>>>> [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
>>>>> [<ffff82d08024324a>] F do_softirq+0x13/0x15
>>>>> [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
>>>>>
>>>>> ****************************************
>>>>> Panic on CPU 17:
>>>>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
>>>>> ****************************************
>>>> Since this suggests a timer was found on the list without ever having been
>>>> initialized, I've spotted a case where this indeed could now happen. Could
>>>> you give the patch below a try?
>>>>
>>>> Jan
>>>>
>>>> x86/vpt: fully init timers before putting onto list
>>>>
>>>> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
>>>> iterating the list and acting on the timers of the list entries will no
>>>> longer be kept from entering their loops by create_periodic_time()'s
>>>> holding of that lock. Therefore at least init_timer() needs calling
>>>> ahead of list insertion, but keep this and set_timer() together.
>>>>
>>>> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
>>>> Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Thanks for looking into this so quickly, and sorry for not realizing
>>> myself when relaxing the locking. Adding the timer to the list without
>>> it being fully initialized was a latent issue even if protected by the
>>> lock initially.
>>>
>>> Provided testing shows the issue is fixed:
>> I guess the change here is needed anyway, even if testing finds there's
>> still something amiss?
Yes, I think so. Thanks for finding this.
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> We've put this patch in for testing, but results will take a while,
> because it only showed up in our weekend stress testing.
>
> ~Andrew
next prev parent reply other threads:[~2021-06-14 17:06 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-14 11:15 BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases") Igor Druzhinin
2021-06-14 11:53 ` Jan Beulich
2021-06-14 13:27 ` Roger Pau Monné
2021-06-14 16:01 ` Jan Beulich
2021-06-14 16:10 ` Andrew Cooper
2021-06-14 17:06 ` Boris Ostrovsky [this message]
2021-06-15 8:12 ` Roger Pau Monné
2021-06-15 9:17 ` Andrew Cooper
2021-06-15 13:17 ` Jan Beulich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45cd516b-326f-a0e6-325a-fc0debd48571@oracle.com \
--to=boris.ostrovsky@oracle.com \
--cc=andrew.cooper3@citrix.com \
--cc=igor.druzhinin@citrix.com \
--cc=jbeulich@suse.com \
--cc=roger.pau@citrix.com \
--cc=stephen.s.brennan@oracle.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).