xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Jan Beulich <jbeulich@suse.com>
To: "Jürgen Groß" <jgross@suse.com>
Cc: Sergey Dyasli <sergey.dyasli@citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	George Dunlap <George.Dunlap@citrix.com>,
	Dario Faggioli <dfaggioli@suse.com>,
	Ross Lagerwall <ross.lagerwall@citrix.com>,
	Xen-devel <xen-devel@lists.xen.org>
Subject: Re: [Xen-devel] Live-Patch application failure in core-scheduling mode
Date: Fri, 7 Feb 2020 09:49:59 +0100	[thread overview]
Message-ID: <bfb81466-4cf8-c57f-b7cb-e07d1fc58351@suse.com> (raw)
In-Reply-To: <f7814499-920b-6d7f-1a39-bb4bfb4d69c6@suse.com>

On 07.02.2020 09:42, Jürgen Groß wrote:
> On 07.02.20 09:23, Jan Beulich wrote:
>> On 07.02.2020 09:04, Jürgen Groß wrote:
>>> On 06.02.20 15:02, Sergey Dyasli wrote:
>>>> On 06/02/2020 11:05, Sergey Dyasli wrote:
>>>>> On 06/02/2020 09:57, Jürgen Groß wrote:
>>>>>> On 05.02.20 17:03, Sergey Dyasli wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm currently investigating a Live-Patch application failure in core-
>>>>>>> scheduling mode and this is an example of what I usually get:
>>>>>>> (it's easily reproducible)
>>>>>>>
>>>>>>>        (XEN) [  342.528305] livepatch: lp: CPU8 - IPIing the other 15 CPUs
>>>>>>>        (XEN) [  342.558340] livepatch: lp: Timed out on semaphore in CPU quiesce phase 13/15
>>>>>>>        (XEN) [  342.558343] bad cpus: 6 9
>>>>>>>
>>>>>>>        (XEN) [  342.559293] CPU:    6
>>>>>>>        (XEN) [  342.559562] Xen call trace:
>>>>>>>        (XEN) [  342.559565]    [<ffff82d08023f304>] R common/schedule.c#sched_wait_rendezvous_in+0xa4/0x270
>>>>>>>        (XEN) [  342.559568]    [<ffff82d08023f8aa>] F common/schedule.c#schedule+0x17a/0x260
>>>>>>>        (XEN) [  342.559571]    [<ffff82d080240d5a>] F common/softirq.c#__do_softirq+0x5a/0x90
>>>>>>>        (XEN) [  342.559574]    [<ffff82d080278ec5>] F arch/x86/domain.c#guest_idle_loop+0x35/0x60
>>>>>>>
>>>>>>>        (XEN) [  342.559761] CPU:    9
>>>>>>>        (XEN) [  342.560026] Xen call trace:
>>>>>>>        (XEN) [  342.560029]    [<ffff82d080241661>] R _spin_lock_irq+0x11/0x40
>>>>>>>        (XEN) [  342.560032]    [<ffff82d08023f323>] F common/schedule.c#sched_wait_rendezvous_in+0xc3/0x270
>>>>>>>        (XEN) [  342.560036]    [<ffff82d08023f8aa>] F common/schedule.c#schedule+0x17a/0x260
>>>>>>>        (XEN) [  342.560039]    [<ffff82d080240d5a>] F common/softirq.c#__do_softirq+0x5a/0x90
>>>>>>>        (XEN) [  342.560042]    [<ffff82d080279db5>] F arch/x86/domain.c#idle_loop+0x55/0xb0
>>>>>>>
>>>>>>> The first HT sibling is waiting for the second in the LP-application
>>>>>>> context while the second waits for the first in the scheduler context.
>>>>>>>
>>>>>>> Any suggestions on how to improve this situation are welcome.
>>>>>>
>>>>>> Can you test the attached patch, please? It is only tested to boot, so
>>>>>> I did no livepatch tests with it.
>>>>>
>>>>> Thank you for the patch! It seems to fix the issue in my manual testing.
>>>>> I'm going to submit automatic LP testing for both thread/core modes.
>>>>
>>>> Andrew suggested to test late ucode loading as well and so I did.
>>>> It uses stop_machine() to rendezvous cpus and it failed with a similar
>>>> backtrace for a problematic CPU. But in this case the system crashed
>>>> since there is no timeout involved:
>>>>
>>>>       (XEN) [  155.025168] Xen call trace:
>>>>       (XEN) [  155.040095]    [<ffff82d0802417f2>] R _spin_unlock_irq+0x22/0x30
>>>>       (XEN) [  155.069549]    [<ffff82d08023f3c2>] S common/schedule.c#sched_wait_rendezvous_in+0xa2/0x270
>>>>       (XEN) [  155.109696]    [<ffff82d08023f728>] F common/schedule.c#sched_slave+0x198/0x260
>>>>       (XEN) [  155.145521]    [<ffff82d080240e1a>] F common/softirq.c#__do_softirq+0x5a/0x90
>>>>       (XEN) [  155.180223]    [<ffff82d0803716f6>] F x86_64/entry.S#process_softirqs+0x6/0x20
>>>>
>>>> It looks like your patch provides a workaround for LP case, but other
>>>> cases like stop_machine() remain broken since the underlying issue with
>>>> the scheduler is still there.
>>>
>>> And here is the fix for ucode loading (that was in fact the only case
>>> where stop_machine_run() wasn't already called in a tasklet).
>>
>> This is a rather odd restriction, and hence will need explaining.
> 
> stop_machine_run() is using a tasklet on each online cpu (excluding the
> one it was called one) for doing a rendezvous of all cpus. With tasklets
> always being executed on idle vcpus it is mandatory for
> stop_machine_run() to be called on an idle vcpu as well when core
> scheduling is active, as otherwise a deadlock will occur. This is being
> accomplished by the use of continue_hypercall_on_cpu().

Well, it's this "a deadlock" which is too vague for me. What exactly is
it that deadlocks, and where (if not obvious from the description of
that case) is the connection to core scheduling? Fundamentally such an
issue would seem to call for an adjustment to core scheduling logic,
not placing of new restrictions on other pre-existing code.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

  reply	other threads:[~2020-02-07  8:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-05 16:03 [Xen-devel] Live-Patch application failure in core-scheduling mode Sergey Dyasli
2020-02-05 16:35 ` Jürgen Groß
2020-02-06  9:57 ` Jürgen Groß
2020-02-06 11:05   ` Sergey Dyasli
2020-02-06 14:02     ` Sergey Dyasli
2020-02-06 14:29       ` Jürgen Groß
2020-02-07  8:04       ` Jürgen Groß
2020-02-07  8:23         ` Jan Beulich
2020-02-07  8:42           ` Jürgen Groß
2020-02-07  8:49             ` Jan Beulich [this message]
2020-02-07  9:25               ` Jürgen Groß
2020-02-07  9:51                 ` Jan Beulich
2020-02-07  9:58                   ` Jürgen Groß
2020-02-07 11:44                 ` Roger Pau Monné
2020-02-07 12:58                   ` Jürgen Groß
2020-02-08 12:19             ` Andrew Cooper
2020-02-08 12:29               ` Jürgen Groß
2020-02-11  9:07         ` Sergey Dyasli
2020-02-11  9:23           ` Jürgen Groß

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bfb81466-4cf8-c57f-b7cb-e07d1fc58351@suse.com \
    --to=jbeulich@suse.com \
    --cc=George.Dunlap@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=dfaggioli@suse.com \
    --cc=jgross@suse.com \
    --cc=ross.lagerwall@citrix.com \
    --cc=sergey.dyasli@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).