xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: "Jürgen Groß" <jgross@suse.com>
To: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
Cc: Juergen Gross <jgross@suse.de>,
	Dario Faggioli <dfaggioli@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13
Date: Tue, 29 Sep 2020 17:27:48 +0200	[thread overview]
Message-ID: <ea53b845-5edf-a61e-62ae-7ababc30b3e0@suse.com> (raw)
In-Reply-To: <20200929151627.GE1482@mail-itl>

On 29.09.20 17:16, Marek Marczykowski-Górecki wrote:
> On Tue, Sep 29, 2020 at 05:07:11PM +0200, Jürgen Groß wrote:
>> On 29.09.20 16:27, Marek Marczykowski-Górecki wrote:
>>> On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki wrote:
>>>> On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote:
>>>>> [Adding Juergen]
>>>>>
>>>>> On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki wrote:
>>>>>> On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote:
>>>>>>> On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> In my test setup (inside KVM with nested virt enabled), I rather
>>>>>>>> frequently get Xen crash on resume from S3. Full message below.
>>>>>>>>
>>>>>>>> This is Xen 4.13.0, with some patches, including "sched: fix
>>>>>>>> resuming
>>>>>>>> from S3 with smt=0".
>>>>>>>>
>>>>>>>> Contrary to the previous issue, this one does not happen always -
>>>>>>>> I
>>>>>>>> would say in about 40% cases on this setup, but very rarely on
>>>>>>>> physical
>>>>>>>> setup.
>>>>>>>>
>>>>>>>> This is _without_ core scheduling enabled, and also with smt=off.
>>>>>>>>
>>>>>>>> Do you think it would be any different on xen-unstable? I cat
>>>>>>>> try, but
>>>>>>>> it isn't trivial in this setup, so I'd ask first.
>>>>>>>>
>>>>> Well, Juergen has fixed quite a few issues.
>>>>>
>>>>> Most of them where triggering with core-scheduling enabled, and I don't
>>>>> recall any of them which looked similar or related to this.
>>>>>
>>>>> Still, it's possible that the same issue causes different symptoms, and
>>>>> hence that maybe one of the patches would fix this too.
>>>>
>>>> I've tested on master (d094e95fb7c), and reproduced exactly the same crash
>>>> (pasted below for the completeness).
>>>> But there is more: additionally, in most (all?) cases after resume I've got
>>>> soft lockup in Linux dom0 in smp_call_function_single() - see below. It
>>>> didn't happened before and the only change was Xen 4.13 -> master.
>>>>
>>>> Xen crash:
>>>>
>>>> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at credit2.c:2133
>>>
>>> Juergen, any idea about this one? This is also happening on the current
>>> stable-4.14 (28855ebcdbfa).
>>>
>>
>> Oh, sorry I didn't come back to this issue.
>>
>> I suspect this is related to stop_machine_run() being called during
>> suspend(), as I'm seeing very sporadic issues when offlining and then
>> onlining cpus with core scheduling being active (it seems as if the
>> dom0 vcpu doing the cpu online activity sometimes is using an old
>> vcpu state).
> 
> Note this is default Xen 4.14 start, so core scheduling is _not_ active:

The similarity in the two failure cases is that multiple cpus are
affected by the operations during stop_machine_run().

> 
>      (XEN) Brought up 2 CPUs
>      (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
>      (XEN) Adding cpu 0 to runqueue 0
>      (XEN)  First cpu on runqueue, activating
>      (XEN) Adding cpu 1 to runqueue 1
>      (XEN)  First cpu on runqueue, activating
> 
>> I wasn't able to catch the real problem despite of having tried lots
>> of approaches using debug patches.
>>
>> Recently I suspected the whole problem could be somehow related to
>> RCU handling, as stop_machine_run() is relying on tasklets which are
>> executing in idle context, and RCU handling is done in idle context,
>> too. So there might be some kind of use after free scenario in case
>> some memory is freed via RCU despite it still being used by a tasklet.
> 
> That sounds plausible, even though I don't really know this area of Xen.
> 
>> I "just" need to find some time to verify this suspicion. Any help doing
>> this would be appreciated. :-)
> 
> I do have a setup where I can easily-ish reproduce the issue. If there
> is some debug patch you'd like me to try, I can do that.

Thanks. I might come back to that offer as you are seeing a crash which
will be much easier to analyze. Catching my error case is much harder as
it surfaces some time after the real problem in a non destructive way
(usually I'm seeing a failure to load a library in the program which
just did its job via exactly the library claiming not being loadable).


Juergen


  reply	other threads:[~2020-09-29 15:28 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-18 14:16 [Xen-devel] Xen crash after S3 suspend - Xen 4.13 Marek Marczykowski-Górecki
2020-03-18 14:50 ` Andrew Cooper
2020-03-18 22:10   ` Marek Marczykowski-Górecki
2020-03-19  0:28     ` Dario Faggioli
2020-03-19  0:59       ` Marek Marczykowski-Górecki
2020-03-23  0:09       ` Marek Marczykowski-Górecki
2020-03-23  8:14         ` Jan Beulich
2020-09-29 14:27         ` Marek Marczykowski-Górecki
2020-09-29 15:07           ` Jürgen Groß
2020-09-29 15:16             ` Marek Marczykowski-Górecki
2020-09-29 15:27               ` Jürgen Groß [this message]
2021-01-31  2:15                 ` [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer Marek Marczykowski-Górecki
2021-10-09 16:28                   ` Marek Marczykowski-Górecki
2022-08-21 16:14                     ` Marek Marczykowski-Górecki
2022-08-22  9:53                       ` Jan Beulich
2022-08-22 10:00                         ` Marek Marczykowski-Górecki
2022-09-20 10:22                           ` Marek Marczykowski-Górecki
2022-09-20 14:30                             ` Jan Beulich
2022-10-11 11:22                               ` Marek Marczykowski-Górecki
2022-10-14 16:42                             ` George Dunlap
2022-10-21  6:41                             ` Juergen Gross
2022-08-22 15:34                       ` Juergen Gross
2022-09-06 11:46                         ` Juergen Gross
2022-09-06 12:35                           ` Marek Marczykowski-Górecki
2022-09-07 12:21                             ` Dario Faggioli
2022-09-07 15:07                               ` marmarek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea53b845-5edf-a61e-62ae-7ababc30b3e0@suse.com \
    --to=jgross@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=dfaggioli@suse.com \
    --cc=jgross@suse.de \
    --cc=marmarek@invisiblethingslab.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).