All of lore.kernel.org
 help / color / mirror / Atom feed
From: Anchal Agarwal <anchalag@amazon.com>
To: <boris.ostrovsky@oracle.com>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"bp@alien8.de" <bp@alien8.de>, "hpa@zytor.com" <hpa@zytor.com>,
	"jgross@suse.com" <jgross@suse.com>,
	"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"sstabellini@kernel.org" <sstabellini@kernel.org>,
	"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
	"roger.pau@citrix.com" <roger.pau@citrix.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
	"len.brown@intel.com" <len.brown@intel.com>,
	"pavel@ucw.cz" <pavel@ucw.cz>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	"vkuznets@redhat.com" <vkuznets@redhat.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	<Woodhouse@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>,
	David <dwmw@amazon.co.uk>,
	"benh@kernel.crashing.org" <benh@kernel.crashing.org>,
	<anchalag@amazon.com>, <aams@amazon.com>
Subject: Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode
Date: Fri, 21 May 2021 05:26:50 +0000	[thread overview]
Message-ID: <20210521052650.GA19056@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> (raw)
In-Reply-To: <8cd59d9c-36b1-21cf-e59f-40c5c20c65f8@oracle.com>

On Thu, Oct 01, 2020 at 08:43:58AM -0400, boris.ostrovsky@oracle.com wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> >>>>>>> Also, wrt KASLR stuff, that issue is still seen sometimes but I haven't had
> >>>>>>> bandwidth to dive deep into the issue and fix it.
> >>>> So what's the plan there? You first mentioned this issue early this year and judged by your response it is not clear whether you will ever spend time looking at it.
> >>>>
> >>> I do want to fix it and did do some debugging earlier this year just haven't
> >>> gotten back to it. Also, wanted to understand if the issue is a blocker to this
> >>> series?
> >>
> >> Integrating code with known bugs is less than ideal.
> >>
> > So for this series to be accepted, KASLR needs to be fixed along with other
> > comments of course?
> 
> 
> Yes, please.
> 
> 
> 
> >>> I had some theories when debugging around this like if the random base address picked by kaslr for the
> >>> resuming kernel mismatches the suspended kernel and just jogging my memory, I didn't find that as the case.
> >>> Another hunch was if physical address of registered vcpu info at boot is different from what suspended kernel
> >>> has and that can cause CPU's to get stuck when coming online.
> >>
> >> I'd think if this were the case you'd have 100% failure rate. And we are also re-registering vcpu info on xen restore and I am not aware of any failures due to KASLR.
> >>
> > What I meant there wrt VCPU info was that VCPU info is not unregistered during hibernation,
> > so Xen still remembers the old physical addresses for the VCPU information, created by the
> > booting kernel. But since the hibernation kernel may have different physical
> > addresses for VCPU info and if mismatch happens, it may cause issues with resume.
> > During hibernation, the VCPU info register hypercall is not invoked again.
> 
> 
> I still don't think that's the cause but it's certainly worth having a look.
> 
Hi Boris,
Apologies for picking this up after last year. 
I did some dive deep on the above statement and that is indeed the case that's happening. 
I did some debugging around KASLR and hibernation using reboot mode.
I observed in my debug prints that whenever vcpu_info* address for secondary vcpu assigned 
in xen_vcpu_setup at boot is different than what is in the image, resume gets stuck for that vcpu
in bringup_cpu(). That means we have different addresses for &per_cpu(xen_vcpu_info, cpu) at boot and after
control jumps into the image. 

I failed to get any prints after it got stuck in bringup_cpu() and
I do not have an option to send a sysrq signal to the guest or rather get a kdump.
This change is not observed in every hibernate-resume cycle. I am not sure if this is a bug or an 
expected behavior. 
Also, I am contemplating the idea that it may be a bug in xen code getting triggered only when
KASLR is enabled but I do not have substantial data to prove that.
Is this a coincidence that this always happens for 1st vcpu?
Moreover, since hypervisor is not aware that guest is hibernated and it looks like a regular shutdown to dom0 during reboot mode,
will re-registering vcpu_info for secondary vcpu's even plausible? I could definitely use some advice to debug this further.

 
Some printk's from my debugging:

At Boot:

xen_vcpu_setup: xen_have_vcpu_info_placement=1 cpu=1, vcpup=0xffff9e548fa560e0, info.mfn=3996246 info.offset=224,

Image Loads:
It ends up in the condition:
 xen_vcpu_setup()
 {
 ...
 if (xen_hvm_domain()) {
        if (per_cpu(xen_vcpu, cpu) == &per_cpu(xen_vcpu_info, cpu))
                return 0; 
 }
 ...
 }

xen_vcpu_setup: checking mfn on resume cpu=1, info.mfn=3934806 info.offset=224, &per_cpu(xen_vcpu_info, cpu)=0xffff9d7240a560e0

This is tested on c4.2xlarge [8vcpu 15GB mem] instance with 5.10 kernel running
in the guest.

Thanks,
Anchal.
> 
> -boris
> 
> 

  reply	other threads:[~2021-05-21  5:27 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-21 22:22 [PATCH v3 00/11] Fix PM hibernation in Xen guests Anchal Agarwal
2020-08-21 22:22 ` Anchal Agarwal
2020-08-21 22:25 ` [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode Anchal Agarwal
2020-08-21 22:25   ` Anchal Agarwal
2020-09-13 15:43   ` boris.ostrovsky
2020-09-14 21:47     ` Anchal Agarwal
2020-09-14 21:47       ` Anchal Agarwal
2020-09-15  0:24       ` boris.ostrovsky
2020-09-15 18:00         ` Anchal Agarwal
2020-09-15 18:00           ` Anchal Agarwal
2020-09-15 19:58           ` boris.ostrovsky
2020-09-21 21:54             ` Anchal Agarwal
2020-09-21 21:54               ` Anchal Agarwal
2020-09-22 16:18               ` boris.ostrovsky
2020-09-22 23:17                 ` Anchal Agarwal
2020-09-22 23:17                   ` Anchal Agarwal
2020-09-25 19:04                   ` Anchal Agarwal
2020-09-25 19:04                     ` Anchal Agarwal
2020-09-25 20:02                     ` boris.ostrovsky
2020-09-25 22:28                       ` Anchal Agarwal
2020-09-25 22:28                         ` Anchal Agarwal
2020-09-28 18:49                         ` boris.ostrovsky
2020-09-30 21:29                           ` Anchal Agarwal
2020-10-01 12:43                             ` boris.ostrovsky
2021-05-21  5:26                               ` Anchal Agarwal [this message]
2021-05-25 22:23                                 ` Boris Ostrovsky
2021-05-26  4:40                                   ` Anchal Agarwal
2021-05-26 18:29                                     ` Boris Ostrovsky
2021-05-28 21:50                                       ` Anchal Agarwal
2021-06-01 14:18                                         ` Boris Ostrovsky
2021-06-02 19:37                                           ` Anchal Agarwal
2021-06-03 20:11                                             ` Boris Ostrovsky
2021-06-03 23:27                                               ` Anchal Agarwal
2021-06-04  1:49                                                 ` Boris Ostrovsky
2020-09-13 17:07   ` boris.ostrovsky
2020-08-21 22:26 ` [PATCH v3 02/11] xenbus: add freeze/thaw/restore callbacks support Anchal Agarwal
2020-08-21 22:26   ` Anchal Agarwal
2020-09-13 16:11   ` boris.ostrovsky
2020-09-15 19:56     ` Anchal Agarwal
2020-09-15 19:56       ` Anchal Agarwal
2020-08-21 22:26 ` [PATCH v3 03/11] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume Anchal Agarwal
2020-08-21 22:26   ` Anchal Agarwal
2020-08-21 22:27 ` [PATCH v3 04/11] x86/xen: add system core suspend and resume callbacks Anchal Agarwal
2020-08-21 22:27   ` Anchal Agarwal
2020-09-13 17:25   ` boris.ostrovsky
2020-08-21 22:27 ` [PATCH v3 05/11] genirq: Shutdown irq chips in suspend/resume during hibernation Thomas Gleixner
2020-08-22  0:36   ` Thomas Gleixner
2020-08-24 17:25     ` Anchal Agarwal
2020-08-24 17:25       ` Anchal Agarwal
2020-08-25 13:20     ` Christoph Hellwig
2020-08-25 15:25       ` Thomas Gleixner
2020-08-21 22:28 ` [PATCH v3 06/11] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
2020-08-21 22:28   ` Anchal Agarwal
2020-08-21 22:29 ` [PATCH v3 07/11] xen-netfront: " Anchal Agarwal
2020-08-21 22:29   ` Anchal Agarwal
2020-08-21 22:29 ` [PATCH v3 08/11] x86/xen: save and restore steal clock during PM hibernation Anchal Agarwal
2020-08-21 22:29   ` Anchal Agarwal
2020-08-21 22:30 ` [PATCH v3 09/11] xen: Introduce wrapper for save/restore sched clock offset Anchal Agarwal
2020-08-21 22:30   ` Anchal Agarwal
2020-08-21 22:30 ` [PATCH v3 10/11] xen: Update sched clock offset to avoid system instability in hibernation Anchal Agarwal
2020-08-21 22:30   ` Anchal Agarwal
2020-09-13 17:52   ` boris.ostrovsky
2020-08-21 22:31 ` [PATCH v3 11/11] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA Anchal Agarwal
2020-08-21 22:31   ` Anchal Agarwal
2020-08-28 18:26 ` [PATCH v3 00/11] Fix PM hibernation in Xen guests Anchal Agarwal
2020-08-28 18:26   ` Anchal Agarwal
2020-08-28 18:29   ` Rafael J. Wysocki
2020-08-28 18:29     ` Rafael J. Wysocki
2020-08-28 18:39     ` Anchal Agarwal
2020-09-11 20:44       ` Anchal Agarwal
2020-09-11 15:19 ` boris.ostrovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210521052650.GA19056@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com \
    --to=anchalag@amazon.com \
    --cc=Woodhouse@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com \
    --cc=aams@amazon.com \
    --cc=axboe@kernel.dk \
    --cc=benh@kernel.crashing.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=davem@davemloft.net \
    --cc=dwmw@amazon.co.uk \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=konrad.wilk@oracle.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pavel@ucw.cz \
    --cc=peterz@infradead.org \
    --cc=rjw@rjwysocki.net \
    --cc=roger.pau@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=vkuznets@redhat.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.