xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: "Jürgen Groß" <jgross@suse.com>
Cc: Juergen Gross <jgross@suse.de>,
	Dario Faggioli <dfaggioli@suse.com>,
	Jan Beulich <jbeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer
Date: Sat, 9 Oct 2021 18:28:17 +0200	[thread overview]
Message-ID: <YWHDIQC3K8J3LD8+@mail-itl> (raw)
In-Reply-To: <20210131021526.GB6354@mail-itl>

[-- Attachment #1: Type: text/plain, Size: 6033 bytes --]

On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki wrote:
> On Tue, Sep 29, 2020 at 05:27:48PM +0200, Jürgen Groß wrote:
> > On 29.09.20 17:16, Marek Marczykowski-Górecki wrote:
> > > On Tue, Sep 29, 2020 at 05:07:11PM +0200, Jürgen Groß wrote:
> > > > On 29.09.20 16:27, Marek Marczykowski-Górecki wrote:
> > > > > On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki wrote:
> > > > > > On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote:
> > > > > > > [Adding Juergen]
> > > > > > > 
> > > > > > > On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki wrote:
> > > > > > > > On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote:
> > > > > > > > > On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote:
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > In my test setup (inside KVM with nested virt enabled), I rather
> > > > > > > > > > frequently get Xen crash on resume from S3. Full message below.
> > > > > > > > > > 
> > > > > > > > > > This is Xen 4.13.0, with some patches, including "sched: fix
> > > > > > > > > > resuming
> > > > > > > > > > from S3 with smt=0".
> > > > > > > > > > 
> > > > > > > > > > Contrary to the previous issue, this one does not happen always -
> > > > > > > > > > I
> > > > > > > > > > would say in about 40% cases on this setup, but very rarely on
> > > > > > > > > > physical
> > > > > > > > > > setup.
> > > > > > > > > > 
> > > > > > > > > > This is _without_ core scheduling enabled, and also with smt=off.
> > > > > > > > > > 
> > > > > > > > > > Do you think it would be any different on xen-unstable? I cat
> > > > > > > > > > try, but
> > > > > > > > > > it isn't trivial in this setup, so I'd ask first.
> > > > > > > > > > 
> > > > > > > Well, Juergen has fixed quite a few issues.
> > > > > > > 
> > > > > > > Most of them where triggering with core-scheduling enabled, and I don't
> > > > > > > recall any of them which looked similar or related to this.
> > > > > > > 
> > > > > > > Still, it's possible that the same issue causes different symptoms, and
> > > > > > > hence that maybe one of the patches would fix this too.
> > > > > > 
> > > > > > I've tested on master (d094e95fb7c), and reproduced exactly the same crash
> > > > > > (pasted below for the completeness).
> > > > > > But there is more: additionally, in most (all?) cases after resume I've got
> > > > > > soft lockup in Linux dom0 in smp_call_function_single() - see below. It
> > > > > > didn't happened before and the only change was Xen 4.13 -> master.
> > > > > > 
> > > > > > Xen crash:
> > > > > > 
> > > > > > (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at credit2.c:2133
> > > > > 
> > > > > Juergen, any idea about this one? This is also happening on the current
> > > > > stable-4.14 (28855ebcdbfa).
> > > > > 
> > > > 
> > > > Oh, sorry I didn't come back to this issue.
> > > > 
> > > > I suspect this is related to stop_machine_run() being called during
> > > > suspend(), as I'm seeing very sporadic issues when offlining and then
> > > > onlining cpus with core scheduling being active (it seems as if the
> > > > dom0 vcpu doing the cpu online activity sometimes is using an old
> > > > vcpu state).
> > > 
> > > Note this is default Xen 4.14 start, so core scheduling is _not_ active:
> > 
> > The similarity in the two failure cases is that multiple cpus are
> > affected by the operations during stop_machine_run().
> > 
> > > 
> > >      (XEN) Brought up 2 CPUs
> > >      (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
> > >      (XEN) Adding cpu 0 to runqueue 0
> > >      (XEN)  First cpu on runqueue, activating
> > >      (XEN) Adding cpu 1 to runqueue 1
> > >      (XEN)  First cpu on runqueue, activating
> > > 
> > > > I wasn't able to catch the real problem despite of having tried lots
> > > > of approaches using debug patches.
> > > > 
> > > > Recently I suspected the whole problem could be somehow related to
> > > > RCU handling, as stop_machine_run() is relying on tasklets which are
> > > > executing in idle context, and RCU handling is done in idle context,
> > > > too. So there might be some kind of use after free scenario in case
> > > > some memory is freed via RCU despite it still being used by a tasklet.
> > > 
> > > That sounds plausible, even though I don't really know this area of Xen.
> > > 
> > > > I "just" need to find some time to verify this suspicion. Any help doing
> > > > this would be appreciated. :-)
> > > 
> > > I do have a setup where I can easily-ish reproduce the issue. If there
> > > is some debug patch you'd like me to try, I can do that.
> > 
> > Thanks. I might come back to that offer as you are seeing a crash which
> > will be much easier to analyze. Catching my error case is much harder as
> > it surfaces some time after the real problem in a non destructive way
> > (usually I'm seeing a failure to load a library in the program which
> > just did its job via exactly the library claiming not being loadable).
> 
> Hi,
> 
> I'm resurrecting this thread as it was recently mentioned elsewhere. I
> can still reproduce the issue on the recent staging branch (9dc687f155).
> 
> It fails after the first resume (not always, but frequent enough to
> debug it). At least one guest needs to be running - with just (PV) dom0
> the crash doesn't happen (at least for the ~8 times in a row I tried).
> If the first resume works, the second (almost?) always will fail but
> with a different symptoms - dom0 kernel lockups (at least some of its
> vcpus). I haven't debugged this one yet at all.
> 
> Any help will be appreciated, I can apply some debug patches, change
> configuration etc.

This still happens on 4.14.3. Maybe it is related to freeing percpu
areas, as it caused other issues with suspend too? Just a thought...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2021-10-09 16:28 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-18 14:16 [Xen-devel] Xen crash after S3 suspend - Xen 4.13 Marek Marczykowski-Górecki
2020-03-18 14:50 ` Andrew Cooper
2020-03-18 22:10   ` Marek Marczykowski-Górecki
2020-03-19  0:28     ` Dario Faggioli
2020-03-19  0:59       ` Marek Marczykowski-Górecki
2020-03-23  0:09       ` Marek Marczykowski-Górecki
2020-03-23  8:14         ` Jan Beulich
2020-09-29 14:27         ` Marek Marczykowski-Górecki
2020-09-29 15:07           ` Jürgen Groß
2020-09-29 15:16             ` Marek Marczykowski-Górecki
2020-09-29 15:27               ` Jürgen Groß
2021-01-31  2:15                 ` [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer Marek Marczykowski-Górecki
2021-10-09 16:28                   ` Marek Marczykowski-Górecki [this message]
2022-08-21 16:14                     ` Marek Marczykowski-Górecki
2022-08-22  9:53                       ` Jan Beulich
2022-08-22 10:00                         ` Marek Marczykowski-Górecki
2022-09-20 10:22                           ` Marek Marczykowski-Górecki
2022-09-20 14:30                             ` Jan Beulich
2022-10-11 11:22                               ` Marek Marczykowski-Górecki
2022-10-14 16:42                             ` George Dunlap
2022-10-21  6:41                             ` Juergen Gross
2022-08-22 15:34                       ` Juergen Gross
2022-09-06 11:46                         ` Juergen Gross
2022-09-06 12:35                           ` Marek Marczykowski-Górecki
2022-09-07 12:21                             ` Dario Faggioli
2022-09-07 15:07                               ` marmarek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YWHDIQC3K8J3LD8+@mail-itl \
    --to=marmarek@invisiblethingslab.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=dfaggioli@suse.com \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=jgross@suse.de \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).