Linux-PM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 00/12] Fix PM hibernation in Xen guests
@ 2020-05-19 23:24 Anchal Agarwal
  2020-05-19 23:24 ` [PATCH 01/12] xen/manage: keep track of the on-going suspend mode Anchal Agarwal
                   ` (12 more replies)
  0 siblings, 13 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:24 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

Hello,
This series fixes PM hibernation for hvm guests running on xen hypervisor.
The running guest could now be hibernated and resumed successfully at a
later time. The fixes for PM hibernation are added to block and
network device drivers i.e xen-blkfront and xen-netfront. Any other driver
that needs to add S4 support if not already, can follow same method of
introducing freeze/thaw/restore callbacks.
The patches had been tested against upstream kernel and xen4.11. Large
scale testing is also done on Xen based Amazon EC2 instances. All this testing
involved running memory exhausting workload in the background.

Doing guest hibernation does not involve any support from hypervisor and
this way guest has complete control over its state. Infrastructure
restrictions for saving up guest state can be overcome by guest initiated
hibernation.

These patches were send out as RFC before and all the feedback had been
incorporated in the patches. The last RFCV3 could be found here:
https://lkml.org/lkml/2020/2/14/2789

Known issues:
1.KASLR causes intermittent hibernation failures. VM fails to resumes and
has to be restarted. I will investigate this issue separately and shouldn't
be a blocker for this patch series.
2. During hibernation, I observed sometimes that freezing of tasks fails due
to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
may work. Also, this is a known issue with hibernation and some
filesystems like XFS has been discussed by the community for years with not an
effectve resolution at this point.

Testing How to:
---------------
1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
xen-4.11]
2. Bring up a HVM guest w/t kernel compiled with hibernation patches
[I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
3. Create a swap file size=RAM size
4. Update grub parameters and reboot
5. Trigger pm-hibernation from within the VM

Example:
Set up a file-backed swap space. Swap file size>=Total memory on the system
sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap

Update resume device/resume offset in grub if using swap file:
resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1

Execute:
--------
sudo pm-hibernate
OR
echo disk > /sys/power/state && echo reboot > /sys/power/disk

Compute resume offset code:
"
#!/usr/bin/env python
import sys
import array
import fcntl

#swap file
f = open(sys.argv[1], 'r')
buf = array.array('L', [0])

#FIBMAP
ret = fcntl.ioctl(f.fileno(), 0x01, buf)
print buf[0]
"


Anchal Agarwal (5):
  x86/xen: Introduce new function to map HYPERVISOR_shared_info on
    Resume
  genirq: Shutdown irq chips in suspend/resume during hibernation
  xen: Introduce wrapper for save/restore sched clock offset
  xen: Update sched clock offset to avoid system instability in
    hibernation
  PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

Munehisa Kamata (7):
  xen/manage: keep track of the on-going suspend mode
  xenbus: add freeze/thaw/restore callbacks support
  x86/xen: add system core suspend and resume callbacks
  xen-blkfront: add callbacks for PM suspend and hibernation
  xen-netfront: add callbacks for PM suspend and hibernation
  xen/time: introduce xen_{save,restore}_steal_clock
  x86/xen: save and restore steal clock

 arch/x86/xen/enlighten_hvm.c      |   8 ++
 arch/x86/xen/suspend.c            |  72 ++++++++++++++++++
 arch/x86/xen/time.c               |  18 ++++-
 arch/x86/xen/xen-ops.h            |   3 +
 drivers/block/xen-blkfront.c      | 122 ++++++++++++++++++++++++++++--
 drivers/net/xen-netfront.c        |  98 +++++++++++++++++++++++-
 drivers/xen/events/events_base.c  |   1 +
 drivers/xen/manage.c              |  73 ++++++++++++++++++
 drivers/xen/time.c                |  29 ++++++-
 drivers/xen/xenbus/xenbus_probe.c |  99 +++++++++++++++++++-----
 include/linux/irq.h               |   2 +
 include/xen/xen-ops.h             |   8 ++
 include/xen/xenbus.h              |   3 +
 kernel/irq/chip.c                 |   2 +-
 kernel/irq/internals.h            |   1 +
 kernel/irq/pm.c                   |  31 +++++---
 kernel/power/user.c               |   6 +-
 17 files changed, 536 insertions(+), 40 deletions(-)

-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/12] xen/manage: keep track of the on-going suspend mode
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
@ 2020-05-19 23:24 ` Anchal Agarwal
  2020-05-30 22:26   ` Boris Ostrovsky
  2020-05-19 23:25 ` [PATCH 02/12] xenbus: add freeze/thaw/restore callbacks support Anchal Agarwal
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:24 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Munehisa Kamata <kamatam@amazon.com>

Guest hibernation is different from xen suspend/resume/live migration.
Xen save/restore does not use pm_ops as is needed by guest hibernation.
Hibernation in guest follows ACPI path and is guest inititated , the
hibernation image is saved within guest as compared to later modes
which are xen toolstack assisted and image creation/storage is in
control of hypervisor/host machine.
To differentiate between Xen suspend and PM hibernation, keep track
of the on-going suspend mode by mainly using a new PM notifier.
Introduce simple functions which help to know the on-going suspend mode
so that other Xen-related code can behave differently according to the
current suspend mode.
Since Xen suspend doesn't have corresponding PM event, its main logic
is modfied to acquire pm_mutex and set the current mode.

Though, acquirng pm_mutex is still right thing to do, we may
see deadlock if PM hibernation is interrupted by Xen suspend.
PM hibernation depends on xenwatch thread to process xenbus state
transactions, but the thread will sleep to wait pm_mutex which is
already held by PM hibernation context in the scenario. Xen shutdown
code may need some changes to avoid the issue.

[Anchal Changelog: Code refactoring]
Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
---
 drivers/xen/manage.c  | 73 +++++++++++++++++++++++++++++++++++++++++++
 include/xen/xen-ops.h |  3 ++
 2 files changed, 76 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..0b30ab522b77 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -14,6 +14,7 @@
 #include <linux/freezer.h>
 #include <linux/syscore_ops.h>
 #include <linux/export.h>
+#include <linux/suspend.h>
 
 #include <xen/xen.h>
 #include <xen/xenbus.h>
@@ -40,6 +41,31 @@ enum shutdown_state {
 /* Ignore multiple shutdown requests. */
 static enum shutdown_state shutting_down = SHUTDOWN_INVALID;
 
+enum suspend_modes {
+	NO_SUSPEND = 0,
+	XEN_SUSPEND,
+	PM_SUSPEND,
+	PM_HIBERNATION,
+};
+
+/* Protected by pm_mutex */
+static enum suspend_modes suspend_mode = NO_SUSPEND;
+
+bool xen_suspend_mode_is_xen_suspend(void)
+{
+	return suspend_mode == XEN_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_suspend(void)
+{
+	return suspend_mode == PM_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_hibernation(void)
+{
+	return suspend_mode == PM_HIBERNATION;
+}
+
 struct suspend_info {
 	int cancelled;
 };
@@ -99,6 +125,10 @@ static void do_suspend(void)
 	int err;
 	struct suspend_info si;
 
+	lock_system_sleep();
+
+	suspend_mode = XEN_SUSPEND;
+
 	shutting_down = SHUTDOWN_SUSPEND;
 
 	err = freeze_processes();
@@ -162,6 +192,10 @@ static void do_suspend(void)
 	thaw_processes();
 out:
 	shutting_down = SHUTDOWN_INVALID;
+
+	suspend_mode = NO_SUSPEND;
+
+	unlock_system_sleep();
 }
 #endif	/* CONFIG_HIBERNATE_CALLBACKS */
 
@@ -387,3 +421,42 @@ int xen_setup_shutdown_event(void)
 EXPORT_SYMBOL_GPL(xen_setup_shutdown_event);
 
 subsys_initcall(xen_setup_shutdown_event);
+
+static int xen_pm_notifier(struct notifier_block *notifier,
+			   unsigned long pm_event, void *unused)
+{
+	switch (pm_event) {
+	case PM_SUSPEND_PREPARE:
+		suspend_mode = PM_SUSPEND;
+		break;
+	case PM_HIBERNATION_PREPARE:
+	case PM_RESTORE_PREPARE:
+		suspend_mode = PM_HIBERNATION;
+		break;
+	case PM_POST_SUSPEND:
+	case PM_POST_RESTORE:
+	case PM_POST_HIBERNATION:
+		/* Set back to the default */
+		suspend_mode = NO_SUSPEND;
+		break;
+	default:
+		pr_warn("Receive unknown PM event 0x%lx\n", pm_event);
+		return -EINVAL;
+	}
+
+	return 0;
+};
+
+static struct notifier_block xen_pm_notifier_block = {
+	.notifier_call = xen_pm_notifier
+};
+
+static int xen_setup_pm_notifier(void)
+{
+	if (!xen_hvm_domain())
+		return -ENODEV;
+
+	return register_pm_notifier(&xen_pm_notifier_block);
+}
+
+subsys_initcall(xen_setup_pm_notifier);
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 095be1d66f31..4ffe031adfc7 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -40,6 +40,9 @@ u64 xen_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
+bool xen_suspend_mode_is_xen_suspend(void);
+bool xen_suspend_mode_is_pm_suspend(void);
+bool xen_suspend_mode_is_pm_hibernation(void);
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 02/12] xenbus: add freeze/thaw/restore callbacks support
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
  2020-05-19 23:24 ` [PATCH 01/12] xen/manage: keep track of the on-going suspend mode Anchal Agarwal
@ 2020-05-19 23:25 ` Anchal Agarwal
  2020-05-30 22:56   ` Boris Ostrovsky
  2020-05-19 23:25 ` [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume Anchal Agarwal
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:25 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Munehisa Kamata <kamatam@amazon.com>

Since commit b3e96c0c7562 ("xen: use freeze/restore/thaw PM events for
suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and
PMSG_RESTORE events for Xen suspend. However, they're actually assigned
to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume()
respectively, and only suspend and resume callbacks are supported at
driver level. To support PM suspend and PM hibernation, modify the bus
level PM callbacks to invoke not only device driver's suspend/resume but
also freeze/thaw/restore.

Note that we'll use freeze/restore callbacks even for PM suspend whereas
suspend/resume callbacks are normally used in the case, becausae the
existing xenbus device drivers already have suspend/resume callbacks
specifically designed for Xen suspend. So we can allow the device
drivers to keep the existing callbacks wihtout modification.

[Anchal Changelog: Refactored the callbacks code]
Signed-off-by: Agarwal Anchal <anchalag@amazon.com>
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
---
 drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++++++++++++------
 include/xen/xenbus.h              |  3 +
 2 files changed, 84 insertions(+), 18 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index 8c4d05b687b7..1589b9b2cb56 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -49,6 +49,7 @@
 #include <linux/io.h>
 #include <linux/slab.h>
 #include <linux/module.h>
+#include <linux/suspend.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -599,27 +600,44 @@ int xenbus_dev_suspend(struct device *dev)
 	struct xenbus_driver *drv;
 	struct xenbus_device *xdev
 		= container_of(dev, struct xenbus_device, dev);
-
+	bool xen_suspend = xen_suspend_mode_is_xen_suspend();
 	DPRINTK("%s", xdev->nodename);
 
 	if (dev->driver == NULL)
 		return 0;
 	drv = to_xenbus_driver(dev->driver);
-	if (drv->suspend)
-		err = drv->suspend(xdev);
-	if (err)
-		pr_warn("suspend %s failed: %i\n", dev_name(dev), err);
+
+	if (xen_suspend) {
+		if (drv->suspend)
+			err = drv->suspend(xdev);
+	} else {
+		if (drv->freeze) {
+			err = drv->freeze(xdev);
+			if (!err) {
+				free_otherend_watch(xdev);
+				free_otherend_details(xdev);
+				return 0;
+			}
+		}
+	}
+
+	if (err) {
+		pr_warn("%s %s failed: %i\n", xen_suspend ?
+			"suspend" : "freeze", dev_name(dev), err);
+		return err;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_suspend);
 
 int xenbus_dev_resume(struct device *dev)
 {
-	int err;
+	int err = 0;
 	struct xenbus_driver *drv;
 	struct xenbus_device *xdev
 		= container_of(dev, struct xenbus_device, dev);
-
+	bool xen_suspend = xen_suspend_mode_is_xen_suspend();
 	DPRINTK("%s", xdev->nodename);
 
 	if (dev->driver == NULL)
@@ -627,24 +645,32 @@ int xenbus_dev_resume(struct device *dev)
 	drv = to_xenbus_driver(dev->driver);
 	err = talk_to_otherend(xdev);
 	if (err) {
-		pr_warn("resume (talk_to_otherend) %s failed: %i\n",
+		pr_warn("%s (talk_to_otherend) %s failed: %i\n",
+			xen_suspend ? "resume" : "restore",
 			dev_name(dev), err);
 		return err;
 	}
 
-	xdev->state = XenbusStateInitialising;
+	if (xen_suspend) {
+		xdev->state = XenbusStateInitialising;
+		if (drv->resume)
+			err = drv->resume(xdev);
+	} else {
+		if (drv->restore)
+			err = drv->restore(xdev);
+	}
 
-	if (drv->resume) {
-		err = drv->resume(xdev);
-		if (err) {
-			pr_warn("resume %s failed: %i\n", dev_name(dev), err);
-			return err;
-		}
+	if (err) {
+		pr_warn("%s %s failed: %i\n",
+			xen_suspend ? "resume" : "restore",
+			dev_name(dev), err);
+		return err;
 	}
 
 	err = watch_otherend(xdev);
 	if (err) {
-		pr_warn("resume (watch_otherend) %s failed: %d.\n",
+		pr_warn("%s (watch_otherend) %s failed: %d.\n",
+			xen_suspend ? "resume" : "restore",
 			dev_name(dev), err);
 		return err;
 	}
@@ -655,8 +681,45 @@ EXPORT_SYMBOL_GPL(xenbus_dev_resume);
 
 int xenbus_dev_cancel(struct device *dev)
 {
-	/* Do nothing */
-	DPRINTK("cancel");
+	int err = 0;
+	struct xenbus_driver *drv;
+	struct xenbus_device *xdev
+		= container_of(dev, struct xenbus_device, dev);
+	bool xen_suspend = xen_suspend_mode_is_xen_suspend();
+
+	if (xen_suspend) {
+		/* Do nothing */
+		DPRINTK("cancel");
+		return 0;
+	}
+
+	DPRINTK("%s", xdev->nodename);
+
+	if (dev->driver == NULL)
+		return 0;
+	drv = to_xenbus_driver(dev->driver);
+	err = talk_to_otherend(xdev);
+	if (err) {
+		pr_warn("thaw (talk_to_otherend) %s failed: %d.\n",
+			dev_name(dev), err);
+		return err;
+	}
+
+	if (drv->thaw) {
+		err = drv->thaw(xdev);
+		if (err) {
+			pr_warn("thaw %s failed: %i\n", dev_name(dev), err);
+			return err;
+		}
+	}
+
+	err = watch_otherend(xdev);
+	if (err) {
+		pr_warn("thaw (watch_otherend) %s failed: %d.\n",
+			dev_name(dev), err);
+		return err;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_cancel);
diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index 5a8315e6d8a6..8da964763255 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -104,6 +104,9 @@ struct xenbus_driver {
 	int (*remove)(struct xenbus_device *dev);
 	int (*suspend)(struct xenbus_device *dev);
 	int (*resume)(struct xenbus_device *dev);
+	int (*freeze)(struct xenbus_device *dev);
+	int (*thaw)(struct xenbus_device *dev);
+	int (*restore)(struct xenbus_device *dev);
 	int (*uevent)(struct xenbus_device *, struct kobj_uevent_env *);
 	struct device_driver driver;
 	int (*read_otherend_details)(struct xenbus_device *dev);
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
  2020-05-19 23:24 ` [PATCH 01/12] xen/manage: keep track of the on-going suspend mode Anchal Agarwal
  2020-05-19 23:25 ` [PATCH 02/12] xenbus: add freeze/thaw/restore callbacks support Anchal Agarwal
@ 2020-05-19 23:25 ` Anchal Agarwal
  2020-05-30 23:02   ` Boris Ostrovsky
  2020-05-19 23:26 ` [PATCH 04/12] x86/xen: add system core suspend and resume callbacks Anchal Agarwal
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:25 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

Introduce a small function which re-uses shared page's PA allocated
during guest initialization time in reserve_shared_info() and not
allocate new page during resume flow.
It also  does the mapping of shared_info_page by calling
xen_hvm_init_shared_info() to use the function.

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
---
 arch/x86/xen/enlighten_hvm.c | 7 +++++++
 arch/x86/xen/xen-ops.h       | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index e138f7de52d2..75b1ec7a0fcd 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -27,6 +27,13 @@
 
 static unsigned long shared_info_pfn;
 
+void xen_hvm_map_shared_info(void)
+{
+	xen_hvm_init_shared_info();
+	if (shared_info_pfn)
+		HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
+}
+
 void xen_hvm_init_shared_info(void)
 {
 	struct xen_add_to_physmap xatp;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 45a441c33d6d..d84c357994bd 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -56,6 +56,7 @@ void xen_enable_syscall(void);
 void xen_vcpu_restore(void);
 
 void xen_callback_vector(void);
+void xen_hvm_map_shared_info(void);
 void xen_hvm_init_shared_info(void);
 void xen_unplug_emulated_devices(void);
 
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 04/12] x86/xen: add system core suspend and resume callbacks
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (2 preceding siblings ...)
  2020-05-19 23:25 ` [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume Anchal Agarwal
@ 2020-05-19 23:26 ` Anchal Agarwal
  2020-05-30 23:10   ` Boris Ostrovsky
  2020-05-19 23:26 ` [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation Anchal Agarwal
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:26 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Munehisa Kamata <kamatam@amazon.com>

Add Xen PVHVM specific system core callbacks for PM suspend and
hibernation support. The callbacks suspend and resume Xen
primitives,like shared_info, pvclock and grant table. Note that
Xen suspend can handle them in a different manner, but system
core callbacks are called from the context. So if the callbacks
are called from Xen suspend context, return immediately.

Signed-off-by: Agarwal Anchal <anchalag@amazon.com>
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
---
 arch/x86/xen/enlighten_hvm.c |  1 +
 arch/x86/xen/suspend.c       | 53 ++++++++++++++++++++++++++++++++++++
 include/xen/xen-ops.h        |  3 ++
 3 files changed, 57 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 75b1ec7a0fcd..138e71786e03 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -204,6 +204,7 @@ static void __init xen_hvm_guest_init(void)
 	if (xen_feature(XENFEAT_hvm_callback_vector))
 		xen_have_vector_callback = 1;
 
+	xen_setup_syscore_ops();
 	xen_hvm_smp_init();
 	WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
 	xen_unplug_emulated_devices();
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 1d83152c761b..784c4484100b 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -2,17 +2,22 @@
 #include <linux/types.h>
 #include <linux/tick.h>
 #include <linux/percpu-defs.h>
+#include <linux/syscore_ops.h>
+#include <linux/kernel_stat.h>
 
 #include <xen/xen.h>
 #include <xen/interface/xen.h>
+#include <xen/interface/memory.h>
 #include <xen/grant_table.h>
 #include <xen/events.h>
+#include <xen/xen-ops.h>
 
 #include <asm/cpufeatures.h>
 #include <asm/msr-index.h>
 #include <asm/xen/hypercall.h>
 #include <asm/xen/page.h>
 #include <asm/fixmap.h>
+#include <asm/pvclock.h>
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -82,3 +87,51 @@ void xen_arch_suspend(void)
 
 	on_each_cpu(xen_vcpu_notify_suspend, NULL, 1);
 }
+
+static int xen_syscore_suspend(void)
+{
+	struct xen_remove_from_physmap xrfp;
+	int ret;
+
+	/* Xen suspend does similar stuffs in its own logic */
+	if (xen_suspend_mode_is_xen_suspend())
+		return 0;
+
+	xrfp.domid = DOMID_SELF;
+	xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
+
+	ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
+	if (!ret)
+		HYPERVISOR_shared_info = &xen_dummy_shared_info;
+
+	return ret;
+}
+
+static void xen_syscore_resume(void)
+{
+	/* Xen suspend does similar stuffs in its own logic */
+	if (xen_suspend_mode_is_xen_suspend())
+		return;
+
+	/* No need to setup vcpu_info as it's already moved off */
+	xen_hvm_map_shared_info();
+
+	pvclock_resume();
+
+	gnttab_resume();
+}
+
+/*
+ * These callbacks will be called with interrupts disabled and when having only
+ * one CPU online.
+ */
+static struct syscore_ops xen_hvm_syscore_ops = {
+	.suspend = xen_syscore_suspend,
+	.resume = xen_syscore_resume
+};
+
+void __init xen_setup_syscore_ops(void)
+{
+	if (xen_hvm_domain())
+		register_syscore_ops(&xen_hvm_syscore_ops);
+}
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 4ffe031adfc7..89b1e88712d6 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -43,6 +43,9 @@ int xen_setup_shutdown_event(void);
 bool xen_suspend_mode_is_xen_suspend(void);
 bool xen_suspend_mode_is_pm_suspend(void);
 bool xen_suspend_mode_is_pm_hibernation(void);
+
+void xen_setup_syscore_ops(void);
+
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (3 preceding siblings ...)
  2020-05-19 23:26 ` [PATCH 04/12] x86/xen: add system core suspend and resume callbacks Anchal Agarwal
@ 2020-05-19 23:26 ` Anchal Agarwal
  2020-05-19 23:29   ` Singh, Balbir
                     ` (2 more replies)
  2020-05-19 23:27 ` [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
                   ` (7 subsequent siblings)
  12 siblings, 3 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:26 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

Many legacy device drivers do not implement power management (PM)
functions which means that interrupts requested by these drivers stay
in active state when the kernel is hibernated.

This does not matter on bare metal and on most hypervisors because the
interrupt is restored on resume without any noticable side effects as
it stays connected to the same physical or virtual interrupt line.

The XEN interrupt mechanism is different as it maintains a mapping
between the Linux interrupt number and a XEN event channel. If the
interrupt stays active on hibernation this mapping is preserved but
there is unfortunately no guarantee that on resume the same event
channels are reassigned to these devices. This can result in event
channel conflicts which prevent the affected devices from being
restored correctly.

One way to solve this would be to add the necessary power management
functions to all affected legacy device drivers, but that's a
questionable effort which does not provide any benefits on non-XEN
environments.

The least intrusive and most efficient solution is to provide a
mechanism which allows the core interrupt code to tear down these
interrupts on hibernation and bring them back up again on resume. This
allows the XEN event channel mechanism to assign an arbitrary event
channel on resume without affecting the functionality of these
devices.

Fortunately all these device interrupts are handled by a dedicated XEN
interrupt chip so the chip can be marked that all interrupts connected
to it are handled this way. This is pretty much in line with the other
interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.

Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
it the core interrupt suspend/resume paths.

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off--by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/xen/events/events_base.c |  1 +
 include/linux/irq.h              |  2 ++
 kernel/irq/chip.c                |  2 +-
 kernel/irq/internals.h           |  1 +
 kernel/irq/pm.c                  | 31 ++++++++++++++++++++++---------
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 3a791c8485d0..decf65bd3451 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1613,6 +1613,7 @@ static struct irq_chip xen_pirq_chip __read_mostly = {
 	.irq_set_affinity	= set_affinity_irq,
 
 	.irq_retrigger		= retrigger_dynirq,
+	.flags                  = IRQCHIP_SHUTDOWN_ON_SUSPEND,
 };
 
 static struct irq_chip xen_percpu_chip __read_mostly = {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 8d5bc2c237d7..94cb8c994d06 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -542,6 +542,7 @@ struct irq_chip {
  * IRQCHIP_EOI_THREADED:	Chip requires eoi() on unmask in threaded mode
  * IRQCHIP_SUPPORTS_LEVEL_MSI	Chip can provide two doorbells for Level MSIs
  * IRQCHIP_SUPPORTS_NMI:	Chip can deliver NMIs, only for root irqchips
+ * IRQCHIP_SHUTDOWN_ON_SUSPEND: Shutdown non wake irqs in the suspend path
  */
 enum {
 	IRQCHIP_SET_TYPE_MASKED		= (1 <<  0),
@@ -553,6 +554,7 @@ enum {
 	IRQCHIP_EOI_THREADED		= (1 <<  6),
 	IRQCHIP_SUPPORTS_LEVEL_MSI	= (1 <<  7),
 	IRQCHIP_SUPPORTS_NMI		= (1 <<  8),
+	IRQCHIP_SHUTDOWN_ON_SUSPEND     = (1 <<  9),
 };
 
 #include <linux/irqdesc.h>
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 41e7e37a0928..fd59489ff14b 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -233,7 +233,7 @@ __irq_startup_managed(struct irq_desc *desc, struct cpumask *aff, bool force)
 }
 #endif
 
-static int __irq_startup(struct irq_desc *desc)
+int __irq_startup(struct irq_desc *desc)
 {
 	struct irq_data *d = irq_desc_get_irq_data(desc);
 	int ret = 0;
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 7db284b10ac9..b6fca5eacff7 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -80,6 +80,7 @@ extern void __enable_irq(struct irq_desc *desc);
 extern int irq_activate(struct irq_desc *desc);
 extern int irq_activate_and_startup(struct irq_desc *desc, bool resend);
 extern int irq_startup(struct irq_desc *desc, bool resend, bool force);
+extern int __irq_startup(struct irq_desc *desc);
 
 extern void irq_shutdown(struct irq_desc *desc);
 extern void irq_shutdown_and_deactivate(struct irq_desc *desc);
diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c
index 8f557fa1f4fe..dc48a25f1756 100644
--- a/kernel/irq/pm.c
+++ b/kernel/irq/pm.c
@@ -85,16 +85,25 @@ static bool suspend_device_irq(struct irq_desc *desc)
 	}
 
 	desc->istate |= IRQS_SUSPENDED;
-	__disable_irq(desc);
-
 	/*
-	 * Hardware which has no wakeup source configuration facility
-	 * requires that the non wakeup interrupts are masked at the
-	 * chip level. The chip implementation indicates that with
-	 * IRQCHIP_MASK_ON_SUSPEND.
+	 * Some irq chips (e.g. XEN PIRQ) require a full shutdown on suspend
+	 * as some of the legacy drivers(e.g. floppy) do nothing during the
+	 * suspend path
 	 */
-	if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
-		mask_irq(desc);
+	if (irq_desc_get_chip(desc)->flags & IRQCHIP_SHUTDOWN_ON_SUSPEND) {
+		irq_shutdown(desc);
+	} else {
+		__disable_irq(desc);
+
+	       /*
+		* Hardware which has no wakeup source configuration facility
+		* requires that the non wakeup interrupts are masked at the
+		* chip level. The chip implementation indicates that with
+		* IRQCHIP_MASK_ON_SUSPEND.
+		*/
+		if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
+			mask_irq(desc);
+	}
 	return true;
 }
 
@@ -152,7 +161,11 @@ static void resume_irq(struct irq_desc *desc)
 	irq_state_set_masked(desc);
 resume:
 	desc->istate &= ~IRQS_SUSPENDED;
-	__enable_irq(desc);
+
+	if (irq_desc_get_chip(desc)->flags & IRQCHIP_SHUTDOWN_ON_SUSPEND)
+		__irq_startup(desc);
+	else
+		__enable_irq(desc);
 }
 
 static void resume_irqs(bool want_early)
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (4 preceding siblings ...)
  2020-05-19 23:26 ` [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation Anchal Agarwal
@ 2020-05-19 23:27 ` Anchal Agarwal
  2020-05-20  5:00   ` kbuild test robot
                     ` (3 more replies)
  2020-05-19 23:28 ` [PATCH 07/12] xen-netfront: " Anchal Agarwal
                   ` (6 subsequent siblings)
  12 siblings, 4 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:27 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Munehisa Kamata <kamatam@amazon.com>

S4 power transition states are much different than xen
suspend/resume. Former is visible to the guest and frontend drivers should
be aware of the state transitions and should be able to take appropriate
actions when needed. In transition to S4 we need to make sure that at least
all the in-flight blkif requests get completed, since they probably contain
bits of the guest's memory image and that's not going to get saved any
other way. Hence, re-issuing of in-flight requests as in case of xen resume
will not work here. This is in contrast to xen-suspend where we need to
freeze with as little processing as possible to avoid dirtying RAM late in
the migration cycle and we know that in-flight data can wait.

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
events, need to implement these xenbus_driver callbacks. The freeze handler
stops block-layer queue and disconnect the frontend from the backend while
freeing ring_info and associated resources. Before disconnecting from the
backend, we need to prevent any new IO from being queued and wait for existing
IO to complete. Freeze/unfreeze of the queues will guarantee that there are no
requests in use on the shared ring. However, for sanity we should check
state of the ring before disconnecting to make sure that there are no
outstanding requests to be processed on the ring. The restore handler
re-allocates ring_info, unquiesces and unfreezes the queue and re-connect to
the backend, so that rest of the kernel can continue to use the block device
transparently.

Note:For older backends,if a backend doesn't have commit'12ea729645ace'
xen/blkback: unmap all persistent grants when frontend gets disconnected,
the frontend may see massive amount of grant table warning when freeing
resources.
[   36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
[   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!

In this case, persistent grants would need to be disabled.

[Anchal Changelog: Removed timeout/request during blkfront freeze.
Reworked the whole patch to work with blk-mq and incorporate upstream's
comments]

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
---
 drivers/block/xen-blkfront.c | 122 +++++++++++++++++++++++++++++++++--
 1 file changed, 115 insertions(+), 7 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3b889ea950c2..464863ed7093 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -48,6 +48,8 @@
 #include <linux/list.h>
 #include <linux/workqueue.h>
 #include <linux/sched/mm.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
 
 #include <xen/xen.h>
 #include <xen/xenbus.h>
@@ -80,6 +82,8 @@ enum blkif_state {
 	BLKIF_STATE_DISCONNECTED,
 	BLKIF_STATE_CONNECTED,
 	BLKIF_STATE_SUSPENDED,
+	BLKIF_STATE_FREEZING,
+	BLKIF_STATE_FROZEN
 };
 
 struct grant {
@@ -219,6 +223,7 @@ struct blkfront_info
 	struct list_head requests;
 	struct bio_list bio_list;
 	struct list_head info_list;
+	struct completion wait_backend_disconnected;
 };
 
 static unsigned int nr_minors;
@@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 	info->sector_size = sector_size;
 	info->physical_sector_size = physical_sector_size;
 	blkif_set_queue_limits(info);
+	init_completion(&info->wait_backend_disconnected);
 
 	return 0;
 }
@@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
 		case XEN_SCSI_DISK5_MAJOR:
 		case XEN_SCSI_DISK6_MAJOR:
 		case XEN_SCSI_DISK7_MAJOR:
-			*offset = (*minor / PARTS_PER_DISK) + 
+			*offset = (*minor / PARTS_PER_DISK) +
 				((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
 				EMULATED_SD_DISK_NAME_OFFSET;
 			*minor = *minor +
@@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
 		case XEN_SCSI_DISK13_MAJOR:
 		case XEN_SCSI_DISK14_MAJOR:
 		case XEN_SCSI_DISK15_MAJOR:
-			*offset = (*minor / PARTS_PER_DISK) + 
+			*offset = (*minor / PARTS_PER_DISK) +
 				((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
 				EMULATED_SD_DISK_NAME_OFFSET;
 			*minor = *minor +
@@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 	unsigned int i;
 	struct blkfront_ring_info *rinfo;
 
+	if (info->connected == BLKIF_STATE_FREEZING)
+		goto free_rings;
 	/* Prevent new requests being issued until we fix things up. */
 	info->connected = suspend ?
 		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 	if (info->rq)
 		blk_mq_stop_hw_queues(info->rq);
 
+free_rings:
 	for_each_rinfo(info, rinfo, i)
 		blkif_free_ring(rinfo);
 
@@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 	struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
 	struct blkfront_info *info = rinfo->dev_info;
 
-	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
-		return IRQ_HANDLED;
+	if (unlikely(info->connected != BLKIF_STATE_CONNECTED
+		    && info->connected != BLKIF_STATE_FREEZING)){
+	    return IRQ_HANDLED;
+	}
 
 	spin_lock_irqsave(&rinfo->ring_lock, flags);
  again:
@@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info)
 	unsigned int segs;
 	struct blkfront_ring_info *rinfo;
 
+	bool frozen = info->connected == BLKIF_STATE_FROZEN;
 	blkfront_gather_backend_features(info);
 	/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
 	blkif_set_queue_limits(info);
@@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info)
 		kick_pending_request_queues(rinfo);
 	}
 
+	if (frozen)
+		return 0;
+
 	list_for_each_entry_safe(req, n, &info->requests, queuelist) {
 		/* Requeue pending requests (flush or discard) */
 		list_del_init(&req->queuelist);
@@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
 
 		return;
 	case BLKIF_STATE_SUSPENDED:
+	case BLKIF_STATE_FROZEN:
 		/*
 		 * If we are recovering from suspension, we need to wait
 		 * for the backend to announce it's features before
@@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev,
 		break;
 
 	case XenbusStateClosed:
-		if (dev->state == XenbusStateClosed)
+		if (dev->state == XenbusStateClosed) {
+			if (info->connected == BLKIF_STATE_FREEZING) {
+				blkif_free(info, 0);
+				info->connected = BLKIF_STATE_FROZEN;
+				complete(&info->wait_backend_disconnected);
+				break;
+			}
+
 			break;
+		}
+
+		/*
+		 * We may somehow receive backend's Closed again while thawing
+		 * or restoring and it causes thawing or restoring to fail.
+		 * Ignore such unexpected state regardless of the backend state.
+		 */
+		if (info->connected == BLKIF_STATE_FROZEN) {
+			dev_dbg(&dev->dev,
+					"ignore the backend's Closed state: %s",
+					dev->nodename);
+			break;
+		}
 		/* fall through */
 	case XenbusStateClosing:
-		if (info)
-			blkfront_closing(info);
+		if (info) {
+			if (info->connected == BLKIF_STATE_FREEZING)
+				xenbus_frontend_closed(dev);
+			else
+				blkfront_closing(info);
+		}
 		break;
 	}
 }
@@ -2630,6 +2670,71 @@ static void blkif_release(struct gendisk *disk, fmode_t mode)
 	mutex_unlock(&blkfront_mutex);
 }
 
+static int blkfront_freeze(struct xenbus_device *dev)
+{
+	unsigned int i;
+	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
+	struct blkfront_ring_info *rinfo;
+	/* This would be reasonable timeout as used in xenbus_dev_shutdown() */
+	unsigned int timeout = 5 * HZ;
+	unsigned long flags;
+	int err = 0;
+
+	info->connected = BLKIF_STATE_FREEZING;
+
+	blk_mq_freeze_queue(info->rq);
+	blk_mq_quiesce_queue(info->rq);
+
+	for_each_rinfo(info, rinfo, i) {
+	    /* No more gnttab callback work. */
+	    gnttab_cancel_free_callback(&rinfo->callback);
+	    /* Flush gnttab callback work. Must be done with no locks held. */
+	    flush_work(&rinfo->work);
+	}
+
+	for_each_rinfo(info, rinfo, i) {
+	    spin_lock_irqsave(&rinfo->ring_lock, flags);
+	    if (RING_FULL(&rinfo->ring)
+		    || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
+		xenbus_dev_error(dev, err, "Hibernation Failed.
+			The ring is still busy");
+		info->connected = BLKIF_STATE_CONNECTED;
+		spin_unlock_irqrestore(&rinfo->ring_lock, flags);
+		return -EBUSY;
+	}
+	    spin_unlock_irqrestore(&rinfo->ring_lock, flags);
+	}
+	/* Kick the backend to disconnect */
+	xenbus_switch_state(dev, XenbusStateClosing);
+
+	/*
+	 * We don't want to move forward before the frontend is diconnected
+	 * from the backend cleanly.
+	 */
+	timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
+					      timeout);
+	if (!timeout) {
+		err = -EBUSY;
+		xenbus_dev_error(dev, err, "Freezing timed out;"
+				 "the device may become inconsistent state");
+	}
+
+	return err;
+}
+
+static int blkfront_restore(struct xenbus_device *dev)
+{
+	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
+	int err = 0;
+
+	err = talk_to_blkback(dev, info);
+	blk_mq_unquiesce_queue(info->rq);
+	blk_mq_unfreeze_queue(info->rq);
+	if (!err)
+	    blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);
+	return err;
+}
+
 static const struct block_device_operations xlvbd_block_fops =
 {
 	.owner = THIS_MODULE,
@@ -2653,6 +2758,9 @@ static struct xenbus_driver blkfront_driver = {
 	.resume = blkfront_resume,
 	.otherend_changed = blkback_changed,
 	.is_ready = blkfront_is_ready,
+	.freeze = blkfront_freeze,
+	.thaw = blkfront_restore,
+	.restore = blkfront_restore
 };
 
 static void purge_persistent_grants(struct blkfront_info *info)
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 07/12] xen-netfront: add callbacks for PM suspend and hibernation
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (5 preceding siblings ...)
  2020-05-19 23:27 ` [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
@ 2020-05-19 23:28 ` Anchal Agarwal
  2020-05-19 23:28 ` [PATCH 08/12] xen/time: introduce xen_{save,restore}_steal_clock Anchal Agarwal
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:28 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Munehisa Kamata <kamatam@amazon.com>

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. The freeze handler simply disconnects the frotnend from the
backend and frees resources associated with queues after disabling the
net_device from the system. The restore handler just changes the
frontend state and let the xenbus handler to re-allocate the resources
and re-connect to the backend. This can be performed transparently to
the rest of the system. The handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks
for Xen suspend without modification. Freezing netfront devices is
normally expected to finish within a few hundred milliseconds, but it
can rarely take more than 5 seconds and hit the hard coded timeout,
it would depend on backend state which may be congested and/or have
complex configuration. While it's rare case, longer default timeout
seems a bit more reasonable here to avoid hitting the timeout.
Also, make it configurable via module parameter so that we can cover
broader setups than what we know currently.

[Anchal changelog: Variable name fix and checkpatch.pl fixes]
Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
---
 drivers/net/xen-netfront.c | 98 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 482c6c8b0fb7..65edcdd6e05f 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -43,6 +43,7 @@
 #include <linux/moduleparam.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
+#include <linux/completion.h>
 #include <net/ip.h>
 
 #include <xen/xen.h>
@@ -56,6 +57,12 @@
 #include <xen/interface/memory.h>
 #include <xen/interface/grant_table.h>
 
+enum netif_freeze_state {
+	NETIF_FREEZE_STATE_UNFROZEN,
+	NETIF_FREEZE_STATE_FREEZING,
+	NETIF_FREEZE_STATE_FROZEN,
+};
+
 /* Module parameters */
 #define MAX_QUEUES_DEFAULT 8
 static unsigned int xennet_max_queues;
@@ -63,6 +70,12 @@ module_param_named(max_queues, xennet_max_queues, uint, 0644);
 MODULE_PARM_DESC(max_queues,
 		 "Maximum number of queues per virtual interface");
 
+static unsigned int netfront_freeze_timeout_secs = 10;
+module_param_named(freeze_timeout_secs,
+		   netfront_freeze_timeout_secs, uint, 0644);
+MODULE_PARM_DESC(freeze_timeout_secs,
+		 "timeout when freezing netfront device in seconds");
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -160,6 +173,10 @@ struct netfront_info {
 	struct netfront_stats __percpu *tx_stats;
 
 	atomic_t rx_gso_checksum_fixup;
+
+	int freeze_state;
+
+	struct completion wait_backend_disconnected;
 };
 
 struct netfront_rx_info {
@@ -721,6 +738,21 @@ static int xennet_close(struct net_device *dev)
 	return 0;
 }
 
+static int xennet_disable_interrupts(struct net_device *dev)
+{
+	struct netfront_info *np = netdev_priv(dev);
+	unsigned int num_queues = dev->real_num_tx_queues;
+	unsigned int queue_index;
+	struct netfront_queue *queue;
+
+	for (queue_index = 0; queue_index < num_queues; ++queue_index) {
+		queue = &np->queues[queue_index];
+		disable_irq(queue->tx_irq);
+		disable_irq(queue->rx_irq);
+	}
+	return 0;
+}
+
 static void xennet_move_rx_slot(struct netfront_queue *queue, struct sk_buff *skb,
 				grant_ref_t ref)
 {
@@ -1301,6 +1333,8 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev)
 
 	np->queues = NULL;
 
+	init_completion(&np->wait_backend_disconnected);
+
 	err = -ENOMEM;
 	np->rx_stats = netdev_alloc_pcpu_stats(struct netfront_stats);
 	if (np->rx_stats == NULL)
@@ -1794,6 +1828,50 @@ static int xennet_create_queues(struct netfront_info *info,
 	return 0;
 }
 
+static int netfront_freeze(struct xenbus_device *dev)
+{
+	struct netfront_info *info = dev_get_drvdata(&dev->dev);
+	unsigned long timeout = netfront_freeze_timeout_secs * HZ;
+	int err = 0;
+
+	xennet_disable_interrupts(info->netdev);
+
+	netif_device_detach(info->netdev);
+
+	info->freeze_state = NETIF_FREEZE_STATE_FREEZING;
+
+	/* Kick the backend to disconnect */
+	xenbus_switch_state(dev, XenbusStateClosing);
+
+	/* We don't want to move forward before the frontend is diconnected
+	 * from the backend cleanly.
+	 */
+	timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
+					      timeout);
+	if (!timeout) {
+		err = -EBUSY;
+		xenbus_dev_error(dev, err, "Freezing timed out;"
+				 "the device may become inconsistent state");
+		return err;
+	}
+
+	/* Tear down queues */
+	xennet_disconnect_backend(info);
+	xennet_destroy_queues(info);
+
+	info->freeze_state = NETIF_FREEZE_STATE_FROZEN;
+
+	return err;
+}
+
+static int netfront_restore(struct xenbus_device *dev)
+{
+	/* Kick the backend to re-connect */
+	xenbus_switch_state(dev, XenbusStateInitialising);
+
+	return 0;
+}
+
 /* Common code used when first setting up, and when resuming. */
 static int talk_to_netback(struct xenbus_device *dev,
 			   struct netfront_info *info)
@@ -1999,6 +2077,8 @@ static int xennet_connect(struct net_device *dev)
 		spin_unlock_bh(&queue->rx_lock);
 	}
 
+	np->freeze_state = NETIF_FREEZE_STATE_UNFROZEN;
+
 	return 0;
 }
 
@@ -2036,10 +2116,23 @@ static void netback_changed(struct xenbus_device *dev,
 		break;
 
 	case XenbusStateClosed:
-		if (dev->state == XenbusStateClosed)
+		if (dev->state == XenbusStateClosed) {
+		     /* dpm context is waiting for the backend */
+			if (np->freeze_state == NETIF_FREEZE_STATE_FREEZING)
+				complete(&np->wait_backend_disconnected);
 			break;
+		}
+
 		/* Fall through - Missed the backend's CLOSING state. */
 	case XenbusStateClosing:
+	       /* We may see unexpected Closed or Closing from the backend.
+		* Just ignore it not to prevent the frontend from being
+		* re-connected in the case of PM suspend or hibernation.
+		*/
+		if (np->freeze_state == NETIF_FREEZE_STATE_FROZEN &&
+		    dev->state == XenbusStateInitialising) {
+			break;
+		}
 		xenbus_frontend_closed(dev);
 		break;
 	}
@@ -2186,6 +2279,9 @@ static struct xenbus_driver netfront_driver = {
 	.probe = netfront_probe,
 	.remove = xennet_remove,
 	.resume = netfront_resume,
+	.freeze = netfront_freeze,
+	.thaw	= netfront_restore,
+	.restore = netfront_restore,
 	.otherend_changed = netback_changed,
 };
 
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 08/12] xen/time: introduce xen_{save,restore}_steal_clock
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (6 preceding siblings ...)
  2020-05-19 23:28 ` [PATCH 07/12] xen-netfront: " Anchal Agarwal
@ 2020-05-19 23:28 ` Anchal Agarwal
  2020-05-30 23:32   ` Boris Ostrovsky
  2020-05-19 23:28 ` [PATCH 09/12] x86/xen: save and restore steal clock Anchal Agarwal
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:28 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Munehisa Kamata <kamatam@amazon.com>

Currently, steal time accounting code in scheduler expects steal clock
callback to provide monotonically increasing value. If the accounting
code receives a smaller value than previous one, it uses a negative
value to calculate steal time and results in incorrectly updated idle
and steal time accounting. This breaks userspace tools which read
/proc/stat.

top - 08:05:35 up  2:12,  3 users,  load average: 0.00, 0.07, 0.23
Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,30100.0%id,  0.0%wa,  0.0%hi, 0.0%si,-1253874204672.0%st

This can actually happen when a Xen PVHVM guest gets restored from
hibernation, because such a restored guest is just a fresh domain from
Xen perspective and the time information in runstate info starts over
from scratch.

This patch introduces xen_save_steal_clock() which saves current values
in runstate info into per-cpu variables. Its couterpart,
xen_restore_steal_clock(), sets offset if it found the current values in
runstate info are smaller than previous ones. xen_steal_clock() is also
modified to use the offset to ensure that scheduler only sees
monotonically increasing number.

Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
---
 drivers/xen/time.c    | 29 ++++++++++++++++++++++++++++-
 include/xen/xen-ops.h |  2 ++
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 0968859c29d0..3560222cc0dd 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -23,6 +23,9 @@ static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
 
 static DEFINE_PER_CPU(u64[4], old_runstate_time);
 
+static DEFINE_PER_CPU(u64, xen_prev_steal_clock);
+static DEFINE_PER_CPU(u64, xen_steal_clock_offset);
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -149,7 +152,7 @@ bool xen_vcpu_stolen(int vcpu)
 	return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
 }
 
-u64 xen_steal_clock(int cpu)
+static u64 __xen_steal_clock(int cpu)
 {
 	struct vcpu_runstate_info state;
 
@@ -157,6 +160,30 @@ u64 xen_steal_clock(int cpu)
 	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
 }
 
+u64 xen_steal_clock(int cpu)
+{
+	return __xen_steal_clock(cpu) + per_cpu(xen_steal_clock_offset, cpu);
+}
+
+void xen_save_steal_clock(int cpu)
+{
+	per_cpu(xen_prev_steal_clock, cpu) = xen_steal_clock(cpu);
+}
+
+void xen_restore_steal_clock(int cpu)
+{
+	u64 steal_clock = __xen_steal_clock(cpu);
+
+	if (per_cpu(xen_prev_steal_clock, cpu) > steal_clock) {
+		/* Need to update the offset */
+		per_cpu(xen_steal_clock_offset, cpu) =
+		    per_cpu(xen_prev_steal_clock, cpu) - steal_clock;
+	} else {
+		/* Avoid unnecessary steal clock warp */
+		per_cpu(xen_steal_clock_offset, cpu) = 0;
+	}
+}
+
 void xen_setup_runstate_info(int cpu)
 {
 	struct vcpu_register_runstate_memory_area area;
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 89b1e88712d6..74fb5eb3aad8 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -37,6 +37,8 @@ void xen_time_setup_guest(void);
 void xen_manage_runstate_time(int action);
 void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
 u64 xen_steal_clock(int cpu);
+void xen_save_steal_clock(int cpu);
+void xen_restore_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 09/12] x86/xen: save and restore steal clock
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (7 preceding siblings ...)
  2020-05-19 23:28 ` [PATCH 08/12] xen/time: introduce xen_{save,restore}_steal_clock Anchal Agarwal
@ 2020-05-19 23:28 ` Anchal Agarwal
  2020-05-30 23:44   ` Boris Ostrovsky
  2020-05-19 23:29 ` [PATCH 10/12] xen: Introduce wrapper for save/restore sched clock offset Anchal Agarwal
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:28 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Munehisa Kamata <kamatam@amazon.com>

Save steal clock values of all present CPUs in the system core ops
suspend callbacks. Also, restore a boot CPU's steal clock in the system
core resume callback. For non-boot CPUs, restore after they're brought
up, because runstate info for non-boot CPUs are not active until then.

Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
---
 arch/x86/xen/suspend.c | 13 ++++++++++++-
 arch/x86/xen/time.c    |  3 +++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 784c4484100b..dae0f74f5390 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -91,12 +91,20 @@ void xen_arch_suspend(void)
 static int xen_syscore_suspend(void)
 {
 	struct xen_remove_from_physmap xrfp;
-	int ret;
+	int cpu, ret;
 
 	/* Xen suspend does similar stuffs in its own logic */
 	if (xen_suspend_mode_is_xen_suspend())
 		return 0;
 
+	for_each_present_cpu(cpu) {
+		/*
+		 * Nonboot CPUs are already offline, but the last copy of
+		 * runstate info is still accessible.
+		 */
+		xen_save_steal_clock(cpu);
+	}
+
 	xrfp.domid = DOMID_SELF;
 	xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
@@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
 
 	pvclock_resume();
 
+	/* Nonboot CPUs will be resumed when they're brought up */
+	xen_restore_steal_clock(smp_processor_id());
+
 	gnttab_resume();
 }
 
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index c8897aad13cd..33d754564b09 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -545,6 +545,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
 {
 	int cpu = smp_processor_id();
 	xen_setup_runstate_info(cpu);
+	if (cpu)
+		xen_restore_steal_clock(cpu);
+
 	/*
 	 * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
 	 * doing it xen_hvm_cpu_notify (which gets called by smp_init during
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 10/12] xen: Introduce wrapper for save/restore sched clock offset
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (8 preceding siblings ...)
  2020-05-19 23:28 ` [PATCH 09/12] x86/xen: save and restore steal clock Anchal Agarwal
@ 2020-05-19 23:29 ` Anchal Agarwal
  2020-05-19 23:29 ` [PATCH 11/12] xen: Update sched clock offset to avoid system instability in hibernation Anchal Agarwal
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:29 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

Introduce wrappers for save/restore xen_sched_clock_offset to be
used by PM hibernation code to avoid system instability during resume.

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
---
 arch/x86/xen/time.c    | 15 +++++++++++++--
 arch/x86/xen/xen-ops.h |  2 ++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 33d754564b09..1fc2beb7a6c1 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -386,12 +386,23 @@ static const struct pv_time_ops xen_time_ops __initconst = {
 static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
 static u64 xen_clock_value_saved;
 
+/*This is needed to maintain a monotonic clock value during PM hibernation */
+void xen_save_sched_clock_offset(void)
+{
+	xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+}
+
+void xen_restore_sched_clock_offset(void)
+{
+	xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+}
+
 void xen_save_time_memory_area(void)
 {
 	struct vcpu_register_time_memory_area t;
 	int ret;
 
-	xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+	xen_save_sched_clock_offset();
 
 	if (!xen_clock)
 		return;
@@ -434,7 +445,7 @@ void xen_restore_time_memory_area(void)
 out:
 	/* Need pvclock_resume() before using xen_clocksource_read(). */
 	pvclock_resume();
-	xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+	xen_restore_sched_clock_offset();
 }
 
 static void xen_setup_vsyscall_time_info(void)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index d84c357994bd..9f49124df033 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -72,6 +72,8 @@ void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
 void xen_init_time_ops(void);
 void xen_hvm_init_time_ops(void);
+void xen_save_sched_clock_offset(void);
+void xen_restore_sched_clock_offset(void);
 
 irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
 
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation
  2020-05-19 23:26 ` [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation Anchal Agarwal
@ 2020-05-19 23:29   ` Singh, Balbir
  2020-05-19 23:36     ` Agarwal, Anchal
  2020-05-19 23:34   ` Anchal Agarwal
  2020-05-30 23:17   ` Boris Ostrovsky
  2 siblings, 1 reply; 38+ messages in thread
From: Singh, Balbir @ 2020-05-19 23:29 UTC (permalink / raw)
  To: boris.ostrovsky, linux-kernel, Agarwal, Anchal, peterz,
	Woodhouse, David, vkuznets, sstabellini, tglx, linux-pm,
	Valentin, Eduardo, linux-mm, jgross, konrad.wilk, axboe, x86,
	roger.pau, hpa, rjw, mingo, Kamata, Munehisa, pavel, bp, netdev,
	len.brown, davem, benh, xen-devel

On Tue, 2020-05-19 at 23:26 +0000, Anchal Agarwal wrote:
> Signed-off--by: Thomas Gleixner <tglx@linutronix.de>

The Signed-off-by line needs to be fixed (hint: you have --)

Balbir Singh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 11/12] xen: Update sched clock offset to avoid system instability in hibernation
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (9 preceding siblings ...)
  2020-05-19 23:29 ` [PATCH 10/12] xen: Introduce wrapper for save/restore sched clock offset Anchal Agarwal
@ 2020-05-19 23:29 ` Anchal Agarwal
  2020-05-19 23:29 ` [PATCH 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA Anchal Agarwal
  2020-05-28 17:59 ` [PATCH 00/12] Fix PM hibernation in Xen guests Agarwal, Anchal
  12 siblings, 0 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:29 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

Save/restore xen_sched_clock_offset in syscore suspend/resume during PM
hibernation. Commit '867cefb4cb1012: ("xen: Fix x86 sched_clock() interface
for xen")' fixes xen guest time handling during migration. A similar issue
is seen during PM hibernation when system runs CPU intensive workload.
Post resume pvclock resets the value to 0 however, xen sched_clock_offset
is never updated. System instability is seen during resume from hibernation
when system is under heavy CPU load. Since xen_sched_clock_offset is not
updated, system does not see the monotonic clock value and the scheduler
would then think that heavy CPU hog tasks need more time in CPU, causing
the system to freeze

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
---
 arch/x86/xen/suspend.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index dae0f74f5390..7e5275944810 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -105,6 +105,8 @@ static int xen_syscore_suspend(void)
 		xen_save_steal_clock(cpu);
 	}
 
+	xen_save_sched_clock_offset();
+
 	xrfp.domid = DOMID_SELF;
 	xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
@@ -126,6 +128,12 @@ static void xen_syscore_resume(void)
 
 	pvclock_resume();
 
+	/*
+	 * Restore xen_sched_clock_offset during resume to maintain
+	 * monotonic clock value
+	 */
+	xen_restore_sched_clock_offset();
+
 	/* Nonboot CPUs will be resumed when they're brought up */
 	xen_restore_steal_clock(smp_processor_id());
 
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (10 preceding siblings ...)
  2020-05-19 23:29 ` [PATCH 11/12] xen: Update sched clock offset to avoid system instability in hibernation Anchal Agarwal
@ 2020-05-19 23:29 ` Anchal Agarwal
  2020-05-28 17:59 ` [PATCH 00/12] Fix PM hibernation in Xen guests Agarwal, Anchal
  12 siblings, 0 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:29 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Aleksei Besogonov <cyberax@amazon.com>

The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
offset on a running kernel to enable hibernating to a swap file.
However, it doesn't actually update the swsusp_resume_block variable. As
a result, the hibernation fails at the last step (after all the data is
written out) in the validation of the swap signature in
mark_swapfiles().

Before this patch, the command line processing was the only place where
swsusp_resume_block was set.
[Changelog: Resolved patch conflict as code fragmented to
snapshot_set_swap_area]
Signed-off-by: Aleksei Besogonov <cyberax@amazon.com>
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
---
 kernel/power/user.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/power/user.c b/kernel/power/user.c
index 7959449765d9..1afa1f0a223e 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -235,8 +235,12 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
 		return -EINVAL;
 	}
 	data->swap = swap_type_of(swdev, offset, NULL);
-	if (data->swap < 0)
+	if (data->swap < 0) {
 		return -ENODEV;
+	} else {
+	    swsusp_resume_device = swdev;
+	    swsusp_resume_block = offset;
+	}
 	return 0;
 }
 
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation
  2020-05-19 23:26 ` [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation Anchal Agarwal
  2020-05-19 23:29   ` Singh, Balbir
@ 2020-05-19 23:34   ` Anchal Agarwal
  2020-05-30 23:17   ` Boris Ostrovsky
  2 siblings, 0 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-19 23:34 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

Resending with fixed Signed-off-by.

Many legacy device drivers do not implement power management (PM)
functions which means that interrupts requested by these drivers stay
in active state when the kernel is hibernated.

This does not matter on bare metal and on most hypervisors because the
interrupt is restored on resume without any noticable side effects as
it stays connected to the same physical or virtual interrupt line.

The XEN interrupt mechanism is different as it maintains a mapping
between the Linux interrupt number and a XEN event channel. If the
interrupt stays active on hibernation this mapping is preserved but
there is unfortunately no guarantee that on resume the same event
channels are reassigned to these devices. This can result in event
channel conflicts which prevent the affected devices from being
restored correctly.

One way to solve this would be to add the necessary power management
functions to all affected legacy device drivers, but that's a
questionable effort which does not provide any benefits on non-XEN
environments.

The least intrusive and most efficient solution is to provide a
mechanism which allows the core interrupt code to tear down these
interrupts on hibernation and bring them back up again on resume. This
allows the XEN event channel mechanism to assign an arbitrary event
channel on resume without affecting the functionality of these
devices.

Fortunately all these device interrupts are handled by a dedicated XEN
interrupt chip so the chip can be marked that all interrupts connected
to it are handled this way. This is pretty much in line with the other
interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.

Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
it the core interrupt suspend/resume paths.

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/xen/events/events_base.c |  1 +
 include/linux/irq.h              |  2 ++
 kernel/irq/chip.c                |  2 +-
 kernel/irq/internals.h           |  1 +
 kernel/irq/pm.c                  | 31 ++++++++++++++++++++++---------
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 3a791c8485d0..decf65bd3451 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1613,6 +1613,7 @@ static struct irq_chip xen_pirq_chip __read_mostly = {
 	.irq_set_affinity	= set_affinity_irq,
 
 	.irq_retrigger		= retrigger_dynirq,
+	.flags                  = IRQCHIP_SHUTDOWN_ON_SUSPEND,
 };
 
 static struct irq_chip xen_percpu_chip __read_mostly = {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 8d5bc2c237d7..94cb8c994d06 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -542,6 +542,7 @@ struct irq_chip {
  * IRQCHIP_EOI_THREADED:	Chip requires eoi() on unmask in threaded mode
  * IRQCHIP_SUPPORTS_LEVEL_MSI	Chip can provide two doorbells for Level MSIs
  * IRQCHIP_SUPPORTS_NMI:	Chip can deliver NMIs, only for root irqchips
+ * IRQCHIP_SHUTDOWN_ON_SUSPEND: Shutdown non wake irqs in the suspend path
  */
 enum {
 	IRQCHIP_SET_TYPE_MASKED		= (1 <<  0),
@@ -553,6 +554,7 @@ enum {
 	IRQCHIP_EOI_THREADED		= (1 <<  6),
 	IRQCHIP_SUPPORTS_LEVEL_MSI	= (1 <<  7),
 	IRQCHIP_SUPPORTS_NMI		= (1 <<  8),
+	IRQCHIP_SHUTDOWN_ON_SUSPEND     = (1 <<  9),
 };
 
 #include <linux/irqdesc.h>
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 41e7e37a0928..fd59489ff14b 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -233,7 +233,7 @@ __irq_startup_managed(struct irq_desc *desc, struct cpumask *aff, bool force)
 }
 #endif
 
-static int __irq_startup(struct irq_desc *desc)
+int __irq_startup(struct irq_desc *desc)
 {
 	struct irq_data *d = irq_desc_get_irq_data(desc);
 	int ret = 0;
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 7db284b10ac9..b6fca5eacff7 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -80,6 +80,7 @@ extern void __enable_irq(struct irq_desc *desc);
 extern int irq_activate(struct irq_desc *desc);
 extern int irq_activate_and_startup(struct irq_desc *desc, bool resend);
 extern int irq_startup(struct irq_desc *desc, bool resend, bool force);
+extern int __irq_startup(struct irq_desc *desc);
 
 extern void irq_shutdown(struct irq_desc *desc);
 extern void irq_shutdown_and_deactivate(struct irq_desc *desc);
diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c
index 8f557fa1f4fe..dc48a25f1756 100644
--- a/kernel/irq/pm.c
+++ b/kernel/irq/pm.c
@@ -85,16 +85,25 @@ static bool suspend_device_irq(struct irq_desc *desc)
 	}
 
 	desc->istate |= IRQS_SUSPENDED;
-	__disable_irq(desc);
-
 	/*
-	 * Hardware which has no wakeup source configuration facility
-	 * requires that the non wakeup interrupts are masked at the
-	 * chip level. The chip implementation indicates that with
-	 * IRQCHIP_MASK_ON_SUSPEND.
+	 * Some irq chips (e.g. XEN PIRQ) require a full shutdown on suspend
+	 * as some of the legacy drivers(e.g. floppy) do nothing during the
+	 * suspend path
 	 */
-	if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
-		mask_irq(desc);
+	if (irq_desc_get_chip(desc)->flags & IRQCHIP_SHUTDOWN_ON_SUSPEND) {
+		irq_shutdown(desc);
+	} else {
+		__disable_irq(desc);
+
+	       /*
+		* Hardware which has no wakeup source configuration facility
+		* requires that the non wakeup interrupts are masked at the
+		* chip level. The chip implementation indicates that with
+		* IRQCHIP_MASK_ON_SUSPEND.
+		*/
+		if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
+			mask_irq(desc);
+	}
 	return true;
 }
 
@@ -152,7 +161,11 @@ static void resume_irq(struct irq_desc *desc)
 	irq_state_set_masked(desc);
 resume:
 	desc->istate &= ~IRQS_SUSPENDED;
-	__enable_irq(desc);
+
+	if (irq_desc_get_chip(desc)->flags & IRQCHIP_SHUTDOWN_ON_SUSPEND)
+		__irq_startup(desc);
+	else
+		__enable_irq(desc);
 }
 
 static void resume_irqs(bool want_early)
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation
  2020-05-19 23:29   ` Singh, Balbir
@ 2020-05-19 23:36     ` Agarwal, Anchal
  0 siblings, 0 replies; 38+ messages in thread
From: Agarwal, Anchal @ 2020-05-19 23:36 UTC (permalink / raw)
  To: Singh, Balbir, boris.ostrovsky, linux-kernel, peterz, Woodhouse,
	David, vkuznets, sstabellini, tglx, linux-pm, Valentin, Eduardo,
	linux-mm, jgross, konrad.wilk, axboe, x86, roger.pau, hpa, rjw,
	mingo, Kamata, Munehisa, pavel, bp, netdev, len.brown, davem,
	benh, xen-devel

Thanks. Looks like send an old one without fix. Did resend the patch again.

    On Tue, 2020-05-19 at 23:26 +0000, Anchal Agarwal wrote:
    > Signed-off--by: Thomas Gleixner <tglx@linutronix.de>

    The Signed-off-by line needs to be fixed (hint: you have --)

    Balbir Singh



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation
  2020-05-19 23:27 ` [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
@ 2020-05-20  5:00   ` kbuild test robot
  2020-05-20  5:07   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: kbuild test robot @ 2020-05-20  5:00 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, boris.ostrovsky,
	jgross, linux-pm, linux-mm, kamatam, sstabellini, konrad.wilk,
	roger.pau, axboe, davem, rjw, len.brown, pavel, peterz, eduval,
	sblbir, xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh
  Cc: kbuild-all, clang-built-linux


[-- Attachment #1: Type: text/plain, Size: 10029 bytes --]

Hi Anchal,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.7-rc6]
[cannot apply to xen-tip/linux-next tip/irq/core tip/auto-latest next-20200519]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Anchal-Agarwal/Fix-PM-hibernation-in-Xen-guests/20200520-073211
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 03fb3acae4be8a6b680ffedb220a8b6c07260b40
config: x86_64-randconfig-a016-20200519 (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project e6658079aca6d971b4e9d7137a3a2ecbc9c34aec)
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>, old ones prefixed by <<):

>> drivers/block/xen-blkfront.c:2699:30: warning: missing terminating '"' character [-Winvalid-pp-token]
xenbus_dev_error(dev, err, "Hibernation Failed.
^
>> drivers/block/xen-blkfront.c:2699:30: error: expected expression
drivers/block/xen-blkfront.c:2700:26: warning: missing terminating '"' character [-Winvalid-pp-token]
The ring is still busy");
^
>> drivers/block/xen-blkfront.c:2726:1: error: function definition is not allowed here
{
^
>> drivers/block/xen-blkfront.c:2762:10: error: use of undeclared identifier 'blkfront_restore'
.thaw = blkfront_restore,
^
drivers/block/xen-blkfront.c:2763:13: error: use of undeclared identifier 'blkfront_restore'
.restore = blkfront_restore
^
drivers/block/xen-blkfront.c:2767:1: error: function definition is not allowed here
{
^
drivers/block/xen-blkfront.c:2800:1: error: function definition is not allowed here
{
^
drivers/block/xen-blkfront.c:2822:1: error: function definition is not allowed here
{
^
>> drivers/block/xen-blkfront.c:2863:13: error: use of undeclared identifier 'xlblk_init'
module_init(xlblk_init);
^
drivers/block/xen-blkfront.c:2867:1: error: function definition is not allowed here
{
^
>> drivers/block/xen-blkfront.c:2874:13: error: use of undeclared identifier 'xlblk_exit'
module_exit(xlblk_exit);
^
>> drivers/block/xen-blkfront.c:2880:24: error: expected '}'
MODULE_ALIAS("xenblk");
^
drivers/block/xen-blkfront.c:2674:1: note: to match this '{'
{
^
>> drivers/block/xen-blkfront.c:2738:45: warning: ISO C90 forbids mixing declarations and code [-Wdeclaration-after-statement]
static const struct block_device_operations xlvbd_block_fops =
^
3 warnings and 11 errors generated.

vim +2699 drivers/block/xen-blkfront.c

  2672	
  2673	static int blkfront_freeze(struct xenbus_device *dev)
  2674	{
  2675		unsigned int i;
  2676		struct blkfront_info *info = dev_get_drvdata(&dev->dev);
  2677		struct blkfront_ring_info *rinfo;
  2678		/* This would be reasonable timeout as used in xenbus_dev_shutdown() */
  2679		unsigned int timeout = 5 * HZ;
  2680		unsigned long flags;
  2681		int err = 0;
  2682	
  2683		info->connected = BLKIF_STATE_FREEZING;
  2684	
  2685		blk_mq_freeze_queue(info->rq);
  2686		blk_mq_quiesce_queue(info->rq);
  2687	
  2688		for_each_rinfo(info, rinfo, i) {
  2689		    /* No more gnttab callback work. */
  2690		    gnttab_cancel_free_callback(&rinfo->callback);
  2691		    /* Flush gnttab callback work. Must be done with no locks held. */
  2692		    flush_work(&rinfo->work);
  2693		}
  2694	
  2695		for_each_rinfo(info, rinfo, i) {
  2696		    spin_lock_irqsave(&rinfo->ring_lock, flags);
  2697		    if (RING_FULL(&rinfo->ring)
  2698			    || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
> 2699			xenbus_dev_error(dev, err, "Hibernation Failed.
  2700				The ring is still busy");
  2701			info->connected = BLKIF_STATE_CONNECTED;
  2702			spin_unlock_irqrestore(&rinfo->ring_lock, flags);
  2703			return -EBUSY;
  2704		}
  2705		    spin_unlock_irqrestore(&rinfo->ring_lock, flags);
  2706		}
  2707		/* Kick the backend to disconnect */
  2708		xenbus_switch_state(dev, XenbusStateClosing);
  2709	
  2710		/*
  2711		 * We don't want to move forward before the frontend is diconnected
  2712		 * from the backend cleanly.
  2713		 */
  2714		timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
  2715						      timeout);
  2716		if (!timeout) {
  2717			err = -EBUSY;
  2718			xenbus_dev_error(dev, err, "Freezing timed out;"
  2719					 "the device may become inconsistent state");
  2720		}
  2721	
  2722		return err;
  2723	}
  2724	
  2725	static int blkfront_restore(struct xenbus_device *dev)
> 2726	{
  2727		struct blkfront_info *info = dev_get_drvdata(&dev->dev);
  2728		int err = 0;
  2729	
  2730		err = talk_to_blkback(dev, info);
  2731		blk_mq_unquiesce_queue(info->rq);
  2732		blk_mq_unfreeze_queue(info->rq);
  2733		if (!err)
  2734		    blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);
  2735		return err;
  2736	}
  2737	
> 2738	static const struct block_device_operations xlvbd_block_fops =
  2739	{
  2740		.owner = THIS_MODULE,
  2741		.open = blkif_open,
  2742		.release = blkif_release,
  2743		.getgeo = blkif_getgeo,
  2744		.ioctl = blkif_ioctl,
  2745		.compat_ioctl = blkdev_compat_ptr_ioctl,
  2746	};
  2747	
  2748	
  2749	static const struct xenbus_device_id blkfront_ids[] = {
  2750		{ "vbd" },
  2751		{ "" }
  2752	};
  2753	
  2754	static struct xenbus_driver blkfront_driver = {
  2755		.ids  = blkfront_ids,
  2756		.probe = blkfront_probe,
  2757		.remove = blkfront_remove,
  2758		.resume = blkfront_resume,
  2759		.otherend_changed = blkback_changed,
  2760		.is_ready = blkfront_is_ready,
  2761		.freeze = blkfront_freeze,
> 2762		.thaw = blkfront_restore,
  2763		.restore = blkfront_restore
  2764	};
  2765	
  2766	static void purge_persistent_grants(struct blkfront_info *info)
> 2767	{
  2768		unsigned int i;
  2769		unsigned long flags;
  2770		struct blkfront_ring_info *rinfo;
  2771	
  2772		for_each_rinfo(info, rinfo, i) {
  2773			struct grant *gnt_list_entry, *tmp;
  2774	
  2775			spin_lock_irqsave(&rinfo->ring_lock, flags);
  2776	
  2777			if (rinfo->persistent_gnts_c == 0) {
  2778				spin_unlock_irqrestore(&rinfo->ring_lock, flags);
  2779				continue;
  2780			}
  2781	
  2782			list_for_each_entry_safe(gnt_list_entry, tmp, &rinfo->grants,
  2783						 node) {
  2784				if (gnt_list_entry->gref == GRANT_INVALID_REF ||
  2785				    gnttab_query_foreign_access(gnt_list_entry->gref))
  2786					continue;
  2787	
  2788				list_del(&gnt_list_entry->node);
  2789				gnttab_end_foreign_access(gnt_list_entry->gref, 0, 0UL);
  2790				rinfo->persistent_gnts_c--;
  2791				gnt_list_entry->gref = GRANT_INVALID_REF;
  2792				list_add_tail(&gnt_list_entry->node, &rinfo->grants);
  2793			}
  2794	
  2795			spin_unlock_irqrestore(&rinfo->ring_lock, flags);
  2796		}
  2797	}
  2798	
  2799	static void blkfront_delay_work(struct work_struct *work)
  2800	{
  2801		struct blkfront_info *info;
  2802		bool need_schedule_work = false;
  2803	
  2804		mutex_lock(&blkfront_mutex);
  2805	
  2806		list_for_each_entry(info, &info_list, info_list) {
  2807			if (info->feature_persistent) {
  2808				need_schedule_work = true;
  2809				mutex_lock(&info->mutex);
  2810				purge_persistent_grants(info);
  2811				mutex_unlock(&info->mutex);
  2812			}
  2813		}
  2814	
  2815		if (need_schedule_work)
  2816			schedule_delayed_work(&blkfront_work, HZ * 10);
  2817	
  2818		mutex_unlock(&blkfront_mutex);
  2819	}
  2820	
  2821	static int __init xlblk_init(void)
> 2822	{
  2823		int ret;
  2824		int nr_cpus = num_online_cpus();
  2825	
  2826		if (!xen_domain())
  2827			return -ENODEV;
  2828	
  2829		if (!xen_has_pv_disk_devices())
  2830			return -ENODEV;
  2831	
  2832		if (register_blkdev(XENVBD_MAJOR, DEV_NAME)) {
  2833			pr_warn("xen_blk: can't get major %d with name %s\n",
  2834				XENVBD_MAJOR, DEV_NAME);
  2835			return -ENODEV;
  2836		}
  2837	
  2838		if (xen_blkif_max_segments < BLKIF_MAX_SEGMENTS_PER_REQUEST)
  2839			xen_blkif_max_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
  2840	
  2841		if (xen_blkif_max_ring_order > XENBUS_MAX_RING_GRANT_ORDER) {
  2842			pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
  2843				xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER);
  2844			xen_blkif_max_ring_order = XENBUS_MAX_RING_GRANT_ORDER;
  2845		}
  2846	
  2847		if (xen_blkif_max_queues > nr_cpus) {
  2848			pr_info("Invalid max_queues (%d), will use default max: %d.\n",
  2849				xen_blkif_max_queues, nr_cpus);
  2850			xen_blkif_max_queues = nr_cpus;
  2851		}
  2852	
  2853		INIT_DELAYED_WORK(&blkfront_work, blkfront_delay_work);
  2854	
  2855		ret = xenbus_register_frontend(&blkfront_driver);
  2856		if (ret) {
  2857			unregister_blkdev(XENVBD_MAJOR, DEV_NAME);
  2858			return ret;
  2859		}
  2860	
  2861		return 0;
  2862	}
> 2863	module_init(xlblk_init);
  2864	
  2865	
  2866	static void __exit xlblk_exit(void)
  2867	{
  2868		cancel_delayed_work_sync(&blkfront_work);
  2869	
  2870		xenbus_unregister_driver(&blkfront_driver);
  2871		unregister_blkdev(XENVBD_MAJOR, DEV_NAME);
  2872		kfree(minors);
  2873	}
> 2874	module_exit(xlblk_exit);
  2875	
  2876	MODULE_DESCRIPTION("Xen virtual block device frontend");
  2877	MODULE_LICENSE("GPL");
  2878	MODULE_ALIAS_BLOCKDEV_MAJOR(XENVBD_MAJOR);
  2879	MODULE_ALIAS("xen:vbd");
> 2880	MODULE_ALIAS("xenblk");

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40415 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation
  2020-05-19 23:27 ` [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
  2020-05-20  5:00   ` kbuild test robot
@ 2020-05-20  5:07   ` kbuild test robot
  2020-05-21 23:48   ` Anchal Agarwal
  2020-05-28 12:30   ` Roger Pau Monné
  3 siblings, 0 replies; 38+ messages in thread
From: kbuild test robot @ 2020-05-20  5:07 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, boris.ostrovsky,
	jgross, linux-pm, linux-mm, kamatam, sstabellini, konrad.wilk,
	roger.pau, axboe, davem, rjw, len.brown, pavel, peterz, eduval,
	sblbir, xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh
  Cc: kbuild-all


[-- Attachment #1: Type: text/plain, Size: 4085 bytes --]

Hi Anchal,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.7-rc6]
[cannot apply to xen-tip/linux-next tip/irq/core tip/auto-latest next-20200519]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Anchal-Agarwal/Fix-PM-hibernation-in-Xen-guests/20200520-073211
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 03fb3acae4be8a6b680ffedb220a8b6c07260b40
config: x86_64-rhel (attached as .config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>, old ones prefixed by <<):

drivers/block/xen-blkfront.c: In function 'blkfront_freeze':
>> drivers/block/xen-blkfront.c:2699:30: warning: missing terminating " character
xenbus_dev_error(dev, err, "Hibernation Failed.
^
>> drivers/block/xen-blkfront.c:2699:30: error: missing terminating " character
xenbus_dev_error(dev, err, "Hibernation Failed.
^~~~~~~~~~~~~~~~~~~~
>> drivers/block/xen-blkfront.c:2700:4: error: 'The' undeclared (first use in this function)
The ring is still busy");
^~~
drivers/block/xen-blkfront.c:2700:4: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/block/xen-blkfront.c:2700:8: error: expected ')' before 'ring'
The ring is still busy");
^~~~
drivers/block/xen-blkfront.c:2700:26: warning: missing terminating " character
The ring is still busy");
^
drivers/block/xen-blkfront.c:2700:26: error: missing terminating " character
The ring is still busy");
^~~
>> drivers/block/xen-blkfront.c:2704:2: error: expected ';' before '}' token
}
^

vim +2699 drivers/block/xen-blkfront.c

  2672	
  2673	static int blkfront_freeze(struct xenbus_device *dev)
  2674	{
  2675		unsigned int i;
  2676		struct blkfront_info *info = dev_get_drvdata(&dev->dev);
  2677		struct blkfront_ring_info *rinfo;
  2678		/* This would be reasonable timeout as used in xenbus_dev_shutdown() */
  2679		unsigned int timeout = 5 * HZ;
  2680		unsigned long flags;
  2681		int err = 0;
  2682	
  2683		info->connected = BLKIF_STATE_FREEZING;
  2684	
  2685		blk_mq_freeze_queue(info->rq);
  2686		blk_mq_quiesce_queue(info->rq);
  2687	
  2688		for_each_rinfo(info, rinfo, i) {
  2689		    /* No more gnttab callback work. */
  2690		    gnttab_cancel_free_callback(&rinfo->callback);
  2691		    /* Flush gnttab callback work. Must be done with no locks held. */
  2692		    flush_work(&rinfo->work);
  2693		}
  2694	
  2695		for_each_rinfo(info, rinfo, i) {
  2696		    spin_lock_irqsave(&rinfo->ring_lock, flags);
  2697		    if (RING_FULL(&rinfo->ring)
  2698			    || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
> 2699			xenbus_dev_error(dev, err, "Hibernation Failed.
> 2700				The ring is still busy");
  2701			info->connected = BLKIF_STATE_CONNECTED;
  2702			spin_unlock_irqrestore(&rinfo->ring_lock, flags);
  2703			return -EBUSY;
> 2704		}
  2705		    spin_unlock_irqrestore(&rinfo->ring_lock, flags);
  2706		}
  2707		/* Kick the backend to disconnect */
  2708		xenbus_switch_state(dev, XenbusStateClosing);
  2709	
  2710		/*
  2711		 * We don't want to move forward before the frontend is diconnected
  2712		 * from the backend cleanly.
  2713		 */
  2714		timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
  2715						      timeout);
  2716		if (!timeout) {
  2717			err = -EBUSY;
  2718			xenbus_dev_error(dev, err, "Freezing timed out;"
  2719					 "the device may become inconsistent state");
  2720		}
  2721	
  2722		return err;
  2723	}
  2724	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 44803 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation
  2020-05-19 23:27 ` [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
  2020-05-20  5:00   ` kbuild test robot
  2020-05-20  5:07   ` kbuild test robot
@ 2020-05-21 23:48   ` Anchal Agarwal
  2020-05-22  1:43     ` Singh, Balbir
  2020-05-28 12:30   ` Roger Pau Monné
  3 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2020-05-21 23:48 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, anchalag,
	xen-devel, vkuznets, netdev, linux-kernel, dwmw, benh

From: Munehisa Kamata <kamatam@amazon.com>

S4 power transisiton states are much different than xen
suspend/resume. Former is visible to the guest and frontend drivers should
be aware of the state transistions and should be able to take appropriate
actions when needed. In transition to S4 we need to make sure that at least
all the in-flight blkif requests get completed, since they probably contain
bits of the guest's memory image and that's not going to get saved any
other way. Hence, re-issuing of in-flight requests as in case of xen resume
will not work here. This is in contrast to xen-suspend where we need to
freeze with as little processing as possible to avoid dirtying RAM late in
the migration cycle and we know that in-flight data can wait.

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
events, need to implement these xenbus_driver callbacks. The freeze handler
stops block-layer queue and disconnect the frontend from the backend while
freeing ring_info and associated resources. Before disconnecting from the
backend, we need to prevent any new IO from being queued and wait for
existing IO to complete. Freeze/unfreeze of the queues will guarantee that
there are no requests in use on the shared ring. However, for sanity we
should check state of the ring before disconnecting to make sure that there
are no outstanding requests to be processed on the ring. The restore
handler re-allocates ring_info, unquiesces and unfreezes the queue
and re-connect to the backend, so that rest of the kernel can continue
to use the block device transparently.

Note:For older backends,if a backend doesn't have commit'12ea729645ace'
xen/blkback: unmap all persistent grants when frontend gets disconnected,
the frontend may see massive amount of grant table warning when freeing
resources.
[   36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
[   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!

In this case, persistent grants would need to be disabled.

[Anchal Changelog: Removed timeout/request during blkfront freeze.
Reworked the whole patch to work with blk-mq and incorporate upstream's
comments]

Fixes: Build errors reported by kbuild due to linebreak
Reported-by: kbuild test robot <lkp@intel.com>

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
---
 drivers/block/xen-blkfront.c | 118 +++++++++++++++++++++++++++++++++--
 1 file changed, 112 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3b889ea950c2..34b0e51697b6 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -48,6 +48,8 @@
 #include <linux/list.h>
 #include <linux/workqueue.h>
 #include <linux/sched/mm.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
 
 #include <xen/xen.h>
 #include <xen/xenbus.h>
@@ -80,6 +82,8 @@ enum blkif_state {
 	BLKIF_STATE_DISCONNECTED,
 	BLKIF_STATE_CONNECTED,
 	BLKIF_STATE_SUSPENDED,
+	BLKIF_STATE_FREEZING,
+	BLKIF_STATE_FROZEN
 };
 
 struct grant {
@@ -219,6 +223,7 @@ struct blkfront_info
 	struct list_head requests;
 	struct bio_list bio_list;
 	struct list_head info_list;
+	struct completion wait_backend_disconnected;
 };
 
 static unsigned int nr_minors;
@@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
 	info->sector_size = sector_size;
 	info->physical_sector_size = physical_sector_size;
 	blkif_set_queue_limits(info);
+	init_completion(&info->wait_backend_disconnected);
 
 	return 0;
 }
@@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
 		case XEN_SCSI_DISK5_MAJOR:
 		case XEN_SCSI_DISK6_MAJOR:
 		case XEN_SCSI_DISK7_MAJOR:
-			*offset = (*minor / PARTS_PER_DISK) + 
+			*offset = (*minor / PARTS_PER_DISK) +
 				((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
 				EMULATED_SD_DISK_NAME_OFFSET;
 			*minor = *minor +
@@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
 		case XEN_SCSI_DISK13_MAJOR:
 		case XEN_SCSI_DISK14_MAJOR:
 		case XEN_SCSI_DISK15_MAJOR:
-			*offset = (*minor / PARTS_PER_DISK) + 
+			*offset = (*minor / PARTS_PER_DISK) +
 				((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
 				EMULATED_SD_DISK_NAME_OFFSET;
 			*minor = *minor +
@@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 	unsigned int i;
 	struct blkfront_ring_info *rinfo;
 
+	if (info->connected == BLKIF_STATE_FREEZING)
+		goto free_rings;
 	/* Prevent new requests being issued until we fix things up. */
 	info->connected = suspend ?
 		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 	if (info->rq)
 		blk_mq_stop_hw_queues(info->rq);
 
+free_rings:
 	for_each_rinfo(info, rinfo, i)
 		blkif_free_ring(rinfo);
 
@@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 	struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
 	struct blkfront_info *info = rinfo->dev_info;
 
-	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
+	if (unlikely(info->connected != BLKIF_STATE_CONNECTED
+		&& info->connected != BLKIF_STATE_FREEZING)){
 		return IRQ_HANDLED;
+	}
 
 	spin_lock_irqsave(&rinfo->ring_lock, flags);
  again:
@@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info)
 	unsigned int segs;
 	struct blkfront_ring_info *rinfo;
 
+	bool frozen = info->connected == BLKIF_STATE_FROZEN;
 	blkfront_gather_backend_features(info);
 	/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
 	blkif_set_queue_limits(info);
@@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info)
 		kick_pending_request_queues(rinfo);
 	}
 
+	if (frozen)
+		return 0;
+
 	list_for_each_entry_safe(req, n, &info->requests, queuelist) {
 		/* Requeue pending requests (flush or discard) */
 		list_del_init(&req->queuelist);
@@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
 
 		return;
 	case BLKIF_STATE_SUSPENDED:
+	case BLKIF_STATE_FROZEN:
 		/*
 		 * If we are recovering from suspension, we need to wait
 		 * for the backend to announce it's features before
@@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev,
 		break;
 
 	case XenbusStateClosed:
-		if (dev->state == XenbusStateClosed)
+		if (dev->state == XenbusStateClosed) {
+			if (info->connected == BLKIF_STATE_FREEZING) {
+				blkif_free(info, 0);
+				info->connected = BLKIF_STATE_FROZEN;
+				complete(&info->wait_backend_disconnected);
+				break;
+			}
+
+			break;
+		}
+
+		/*
+		 * We may somehow receive backend's Closed again while thawing
+		 * or restoring and it causes thawing or restoring to fail.
+		 * Ignore such unexpected state regardless of the backend state.
+		 */
+		if (info->connected == BLKIF_STATE_FROZEN) {
+			dev_dbg(&dev->dev,
+					"ignore the backend's Closed state: %s",
+					dev->nodename);
 			break;
+		}
 		/* fall through */
 	case XenbusStateClosing:
-		if (info)
-			blkfront_closing(info);
+		if (info) {
+			if (info->connected == BLKIF_STATE_FREEZING)
+				xenbus_frontend_closed(dev);
+			else
+				blkfront_closing(info);
+		}
 		break;
 	}
 }
@@ -2630,6 +2670,69 @@ static void blkif_release(struct gendisk *disk, fmode_t mode)
 	mutex_unlock(&blkfront_mutex);
 }
 
+static int blkfront_freeze(struct xenbus_device *dev)
+{
+	unsigned int i;
+	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
+	struct blkfront_ring_info *rinfo;
+	/* This would be reasonable timeout as used in xenbus_dev_shutdown() */
+	unsigned int timeout = 5 * HZ;
+	unsigned long flags;
+	int err = 0;
+
+	info->connected = BLKIF_STATE_FREEZING;
+
+	blk_mq_freeze_queue(info->rq);
+	blk_mq_quiesce_queue(info->rq);
+
+	for_each_rinfo(info, rinfo, i) {
+		/* No more gnttab callback work. */
+		gnttab_cancel_free_callback(&rinfo->callback);
+		/* Flush gnttab callback work. Must be done with no locks held. */
+		flush_work(&rinfo->work);
+	}
+
+	for_each_rinfo(info, rinfo, i) {
+		spin_lock_irqsave(&rinfo->ring_lock, flags);
+		if (RING_FULL(&rinfo->ring)
+			|| RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
+			xenbus_dev_error(dev, err, "Hibernation Failed.The ring is still busy");
+			info->connected = BLKIF_STATE_CONNECTED;
+			spin_unlock_irqrestore(&rinfo->ring_lock, flags);
+			return -EBUSY;
+		}
+		spin_unlock_irqrestore(&rinfo->ring_lock, flags);
+	}
+	/* Kick the backend to disconnect */
+	xenbus_switch_state(dev, XenbusStateClosing);
+
+	/*
+	 * We don't want to move forward before the frontend is diconnected
+	 * from the backend cleanly.
+	 */
+	timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
+					      timeout);
+	if (!timeout) {
+		err = -EBUSY;
+		xenbus_dev_error(dev, err, "Freezing timed out;"
+				 "the device may become inconsistent state");
+	}
+	return err;
+}
+
+static int blkfront_restore(struct xenbus_device *dev)
+{
+	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
+	int err = 0;
+
+	err = talk_to_blkback(dev, info);
+	blk_mq_unquiesce_queue(info->rq);
+	blk_mq_unfreeze_queue(info->rq);
+	if (!err)
+		blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);
+	return err;
+}
+
 static const struct block_device_operations xlvbd_block_fops =
 {
 	.owner = THIS_MODULE,
@@ -2653,6 +2756,9 @@ static struct xenbus_driver blkfront_driver = {
 	.resume = blkfront_resume,
 	.otherend_changed = blkback_changed,
 	.is_ready = blkfront_is_ready,
+	.freeze = blkfront_freeze,
+	.thaw = blkfront_restore,
+	.restore = blkfront_restore
 };
 
 static void purge_persistent_grants(struct blkfront_info *info)
-- 
2.24.1.AMZN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation
  2020-05-21 23:48   ` Anchal Agarwal
@ 2020-05-22  1:43     ` Singh, Balbir
  0 siblings, 0 replies; 38+ messages in thread
From: Singh, Balbir @ 2020-05-22  1:43 UTC (permalink / raw)
  To: boris.ostrovsky, linux-kernel, Agarwal, Anchal, peterz,
	Woodhouse, David, vkuznets, sstabellini, tglx, linux-pm,
	Valentin, Eduardo, linux-mm, jgross, konrad.wilk, axboe, x86,
	roger.pau, hpa, rjw, mingo, Kamata, Munehisa, pavel, bp, netdev,
	len.brown, davem, benh, xen-devel

> @@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
>  		case XEN_SCSI_DISK5_MAJOR:
>  		case XEN_SCSI_DISK6_MAJOR:
>  		case XEN_SCSI_DISK7_MAJOR:
> -			*offset = (*minor / PARTS_PER_DISK) + 
> +			*offset = (*minor / PARTS_PER_DISK) +
>  				((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
>  				EMULATED_SD_DISK_NAME_OFFSET;
>  			*minor = *minor +
> @@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
>  		case XEN_SCSI_DISK13_MAJOR:
>  		case XEN_SCSI_DISK14_MAJOR:
>  		case XEN_SCSI_DISK15_MAJOR:
> -			*offset = (*minor / PARTS_PER_DISK) + 
> +			*offset = (*minor / PARTS_PER_DISK) +
>  				((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
>  				EMULATED_SD_DISK_NAME_OFFSET;
>  			*minor = *minor +

These seem like whitespace fixes? If so, they should be in a separate patch

Balbir


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation
  2020-05-19 23:27 ` [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
                     ` (2 preceding siblings ...)
  2020-05-21 23:48   ` Anchal Agarwal
@ 2020-05-28 12:30   ` Roger Pau Monné
  3 siblings, 0 replies; 38+ messages in thread
From: Roger Pau Monné @ 2020-05-28 12:30 UTC (permalink / raw)
  To: Anchal Agarwal
  Cc: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, axboe, davem, rjw,
	len.brown, pavel, peterz, eduval, sblbir, xen-devel, vkuznets,
	netdev, linux-kernel, dwmw, benh

On Tue, May 19, 2020 at 11:27:50PM +0000, Anchal Agarwal wrote:
> From: Munehisa Kamata <kamatam@amazon.com>
> 
> S4 power transition states are much different than xen
> suspend/resume. Former is visible to the guest and frontend drivers should
> be aware of the state transitions and should be able to take appropriate
> actions when needed. In transition to S4 we need to make sure that at least
> all the in-flight blkif requests get completed, since they probably contain
> bits of the guest's memory image and that's not going to get saved any
> other way. Hence, re-issuing of in-flight requests as in case of xen resume
> will not work here. This is in contrast to xen-suspend where we need to
> freeze with as little processing as possible to avoid dirtying RAM late in
> the migration cycle and we know that in-flight data can wait.
> 
> Add freeze, thaw and restore callbacks for PM suspend and hibernation
> support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
> events, need to implement these xenbus_driver callbacks. The freeze handler
> stops block-layer queue and disconnect the frontend from the backend while
> freeing ring_info and associated resources. Before disconnecting from the
> backend, we need to prevent any new IO from being queued and wait for existing
> IO to complete. Freeze/unfreeze of the queues will guarantee that there are no
> requests in use on the shared ring. However, for sanity we should check
> state of the ring before disconnecting to make sure that there are no
> outstanding requests to be processed on the ring. The restore handler
> re-allocates ring_info, unquiesces and unfreezes the queue and re-connect to
> the backend, so that rest of the kernel can continue to use the block device
> transparently.
> 
> Note:For older backends,if a backend doesn't have commit'12ea729645ace'
> xen/blkback: unmap all persistent grants when frontend gets disconnected,
> the frontend may see massive amount of grant table warning when freeing
> resources.
> [   36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
> [   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
> 
> In this case, persistent grants would need to be disabled.
> 
> [Anchal Changelog: Removed timeout/request during blkfront freeze.
> Reworked the whole patch to work with blk-mq and incorporate upstream's
> comments]

Please tag versions using vX and it would be helpful if you could list
the specific changes that you performed between versions. There where
3 RFC versions IIRC, and there's no log of the changes between them.

> 
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
> ---
>  drivers/block/xen-blkfront.c | 122 +++++++++++++++++++++++++++++++++--
>  1 file changed, 115 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 3b889ea950c2..464863ed7093 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -48,6 +48,8 @@
>  #include <linux/list.h>
>  #include <linux/workqueue.h>
>  #include <linux/sched/mm.h>
> +#include <linux/completion.h>
> +#include <linux/delay.h>
>  
>  #include <xen/xen.h>
>  #include <xen/xenbus.h>
> @@ -80,6 +82,8 @@ enum blkif_state {
>  	BLKIF_STATE_DISCONNECTED,
>  	BLKIF_STATE_CONNECTED,
>  	BLKIF_STATE_SUSPENDED,
> +	BLKIF_STATE_FREEZING,
> +	BLKIF_STATE_FROZEN

Nit: adding a terminating ',' would prevent further additions from
having to modify this line.

>  };
>  
>  struct grant {
> @@ -219,6 +223,7 @@ struct blkfront_info
>  	struct list_head requests;
>  	struct bio_list bio_list;
>  	struct list_head info_list;
> +	struct completion wait_backend_disconnected;
>  };
>  
>  static unsigned int nr_minors;
> @@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
>  	info->sector_size = sector_size;
>  	info->physical_sector_size = physical_sector_size;
>  	blkif_set_queue_limits(info);
> +	init_completion(&info->wait_backend_disconnected);
>  
>  	return 0;
>  }
> @@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
>  		case XEN_SCSI_DISK5_MAJOR:
>  		case XEN_SCSI_DISK6_MAJOR:
>  		case XEN_SCSI_DISK7_MAJOR:
> -			*offset = (*minor / PARTS_PER_DISK) + 
> +			*offset = (*minor / PARTS_PER_DISK) +
>  				((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
>  				EMULATED_SD_DISK_NAME_OFFSET;
>  			*minor = *minor +
> @@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
>  		case XEN_SCSI_DISK13_MAJOR:
>  		case XEN_SCSI_DISK14_MAJOR:
>  		case XEN_SCSI_DISK15_MAJOR:
> -			*offset = (*minor / PARTS_PER_DISK) + 
> +			*offset = (*minor / PARTS_PER_DISK) +

Unrelated changes, please split to a pre-patch.

>  				((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
>  				EMULATED_SD_DISK_NAME_OFFSET;
>  			*minor = *minor +
> @@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  	unsigned int i;
>  	struct blkfront_ring_info *rinfo;
>  
> +	if (info->connected == BLKIF_STATE_FREEZING)
> +		goto free_rings;
>  	/* Prevent new requests being issued until we fix things up. */
>  	info->connected = suspend ?
>  		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
> @@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  	if (info->rq)
>  		blk_mq_stop_hw_queues(info->rq);
>  
> +free_rings:
>  	for_each_rinfo(info, rinfo, i)
>  		blkif_free_ring(rinfo);
>  
> @@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>  	struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
>  	struct blkfront_info *info = rinfo->dev_info;
>  
> -	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> -		return IRQ_HANDLED;
> +	if (unlikely(info->connected != BLKIF_STATE_CONNECTED
> +		    && info->connected != BLKIF_STATE_FREEZING)){

Extra tab and missing space between '){'. Also my preference would be
for the && to go at the end of the previous line, like it's done
elsewhere in the file.

> +	    return IRQ_HANDLED;
> +	}
>  
>  	spin_lock_irqsave(&rinfo->ring_lock, flags);
>   again:
> @@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info)
>  	unsigned int segs;
>  	struct blkfront_ring_info *rinfo;
>  
> +	bool frozen = info->connected == BLKIF_STATE_FROZEN;

Please put this together with the rest of the variable definitions,
and leave the empty line as a split between variable definitions and
code. I've already requested this on RFC v3 but you seem to have
dropped some of the requests I've made there.

>  	blkfront_gather_backend_features(info);
>  	/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
>  	blkif_set_queue_limits(info);
> @@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info)
>  		kick_pending_request_queues(rinfo);
>  	}
>  
> +	if (frozen)
> +		return 0;
> +
>  	list_for_each_entry_safe(req, n, &info->requests, queuelist) {
>  		/* Requeue pending requests (flush or discard) */
>  		list_del_init(&req->queuelist);
> @@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
>  
>  		return;
>  	case BLKIF_STATE_SUSPENDED:
> +	case BLKIF_STATE_FROZEN:
>  		/*
>  		 * If we are recovering from suspension, we need to wait
>  		 * for the backend to announce it's features before
> @@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev,
>  		break;
>  
>  	case XenbusStateClosed:
> -		if (dev->state == XenbusStateClosed)
> +		if (dev->state == XenbusStateClosed) {
> +			if (info->connected == BLKIF_STATE_FREEZING) {
> +				blkif_free(info, 0);
> +				info->connected = BLKIF_STATE_FROZEN;
> +				complete(&info->wait_backend_disconnected);
> +				break;

There's no need for the break here, you can rely on the break below.

> +			}
> +
>  			break;
> +		}
> +
> +		/*
> +		 * We may somehow receive backend's Closed again while thawing
> +		 * or restoring and it causes thawing or restoring to fail.
> +		 * Ignore such unexpected state regardless of the backend state.
> +		 */
> +		if (info->connected == BLKIF_STATE_FROZEN) {

I think you can join this with the previous dev->state == XenbusStateClosed?

Also, won't the device be in the Closed state already if it's in state
frozen?

> +			dev_dbg(&dev->dev,
> +					"ignore the backend's Closed state: %s",
> +					dev->nodename);
> +			break;
> +		}
>  		/* fall through */
>  	case XenbusStateClosing:
> -		if (info)
> -			blkfront_closing(info);
> +		if (info) {
> +			if (info->connected == BLKIF_STATE_FREEZING)
> +				xenbus_frontend_closed(dev);
> +			else
> +				blkfront_closing(info);
> +		}
>  		break;
>  	}
>  }
> @@ -2630,6 +2670,71 @@ static void blkif_release(struct gendisk *disk, fmode_t mode)
>  	mutex_unlock(&blkfront_mutex);
>  }
>  
> +static int blkfront_freeze(struct xenbus_device *dev)
> +{
> +	unsigned int i;
> +	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
> +	struct blkfront_ring_info *rinfo;
> +	/* This would be reasonable timeout as used in xenbus_dev_shutdown() */
> +	unsigned int timeout = 5 * HZ;
> +	unsigned long flags;
> +	int err = 0;
> +
> +	info->connected = BLKIF_STATE_FREEZING;
> +
> +	blk_mq_freeze_queue(info->rq);
> +	blk_mq_quiesce_queue(info->rq);
> +
> +	for_each_rinfo(info, rinfo, i) {
> +	    /* No more gnttab callback work. */
> +	    gnttab_cancel_free_callback(&rinfo->callback);
> +	    /* Flush gnttab callback work. Must be done with no locks held. */
> +	    flush_work(&rinfo->work);
> +	}
> +
> +	for_each_rinfo(info, rinfo, i) {
> +	    spin_lock_irqsave(&rinfo->ring_lock, flags);
> +	    if (RING_FULL(&rinfo->ring)
> +		    || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {

'||' should go at the end of the previous line.

> +		xenbus_dev_error(dev, err, "Hibernation Failed.
> +			The ring is still busy");
> +		info->connected = BLKIF_STATE_CONNECTED;
> +		spin_unlock_irqrestore(&rinfo->ring_lock, flags);

You need to unfreeze the queues here, or else the device will be in a
blocked state AFAICT.

> +		return -EBUSY;
> +	}
> +	    spin_unlock_irqrestore(&rinfo->ring_lock, flags);
> +	}

This block has indentation all messed up.

> +	/* Kick the backend to disconnect */
> +	xenbus_switch_state(dev, XenbusStateClosing);
> +
> +	/*
> +	 * We don't want to move forward before the frontend is diconnected
> +	 * from the backend cleanly.
> +	 */
> +	timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
> +					      timeout);
> +	if (!timeout) {
> +		err = -EBUSY;

Note err is only used here, and I think could just be dropped.

> +		xenbus_dev_error(dev, err, "Freezing timed out;"
> +				 "the device may become inconsistent state");

Leaving the device in this state is quite bad, as it's in a closed
state and with the queues frozen. You should make an attempt to
restore things to a working state.

> +	}
> +
> +	return err;
> +}
> +
> +static int blkfront_restore(struct xenbus_device *dev)
> +{
> +	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
> +	int err = 0;
> +
> +	err = talk_to_blkback(dev, info);
> +	blk_mq_unquiesce_queue(info->rq);
> +	blk_mq_unfreeze_queue(info->rq);
> +	if (!err)
> +	    blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);

Bad indentation. Also shouldn't you first update the queues and then
unfreeze them?

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/12] Fix PM hibernation in Xen guests
  2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
                   ` (11 preceding siblings ...)
  2020-05-19 23:29 ` [PATCH 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA Anchal Agarwal
@ 2020-05-28 17:59 ` Agarwal, Anchal
  12 siblings, 0 replies; 38+ messages in thread
From: Agarwal, Anchal @ 2020-05-28 17:59 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, boris.ostrovsky, jgross, linux-pm,
	linux-mm, Kamata, Munehisa, sstabellini, konrad.wilk, roger.pau,
	axboe, davem, rjw, len.brown, pavel, peterz, Valentin, Eduardo,
	Singh, Balbir, xen-devel, vkuznets, netdev, linux-kernel,
	Woodhouse, David, benh

A gentle ping on this whole patch series.

Thanks,
Anchal

    Hello,
    This series fixes PM hibernation for hvm guests running on xen hypervisor.
    The running guest could now be hibernated and resumed successfully at a
    later time. The fixes for PM hibernation are added to block and
    network device drivers i.e xen-blkfront and xen-netfront. Any other driver
    that needs to add S4 support if not already, can follow same method of
    introducing freeze/thaw/restore callbacks.
    The patches had been tested against upstream kernel and xen4.11. Large
    scale testing is also done on Xen based Amazon EC2 instances. All this testing
    involved running memory exhausting workload in the background.

    Doing guest hibernation does not involve any support from hypervisor and
    this way guest has complete control over its state. Infrastructure
    restrictions for saving up guest state can be overcome by guest initiated
    hibernation.

    These patches were send out as RFC before and all the feedback had been
    incorporated in the patches. The last RFCV3 could be found here:
    https://lkml.org/lkml/2020/2/14/2789

    Known issues:
    1.KASLR causes intermittent hibernation failures. VM fails to resumes and
    has to be restarted. I will investigate this issue separately and shouldn't
    be a blocker for this patch series.
    2. During hibernation, I observed sometimes that freezing of tasks fails due
    to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
    out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
    may work. Also, this is a known issue with hibernation and some
    filesystems like XFS has been discussed by the community for years with not an
    effectve resolution at this point.

    Testing How to:
    ---------------
    1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
    xen-4.11]
    2. Bring up a HVM guest w/t kernel compiled with hibernation patches
    [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
    3. Create a swap file size=RAM size
    4. Update grub parameters and reboot
    5. Trigger pm-hibernation from within the VM

    Example:
    Set up a file-backed swap space. Swap file size>=Total memory on the system
    sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
    sudo chmod 600 /swap
    sudo mkswap /swap
    sudo swapon /swap

    Update resume device/resume offset in grub if using swap file:
    resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1

    Execute:
    --------
    sudo pm-hibernate
    OR
    echo disk > /sys/power/state && echo reboot > /sys/power/disk

    Compute resume offset code:
    "
    #!/usr/bin/env python
    import sys
    import array
    import fcntl

    #swap file
    f = open(sys.argv[1], 'r')
    buf = array.array('L', [0])

    #FIBMAP
    ret = fcntl.ioctl(f.fileno(), 0x01, buf)
    print buf[0]
    "


    Anchal Agarwal (5):
      x86/xen: Introduce new function to map HYPERVISOR_shared_info on
        Resume
      genirq: Shutdown irq chips in suspend/resume during hibernation
      xen: Introduce wrapper for save/restore sched clock offset
      xen: Update sched clock offset to avoid system instability in
        hibernation
      PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

    Munehisa Kamata (7):
      xen/manage: keep track of the on-going suspend mode
      xenbus: add freeze/thaw/restore callbacks support
      x86/xen: add system core suspend and resume callbacks
      xen-blkfront: add callbacks for PM suspend and hibernation
      xen-netfront: add callbacks for PM suspend and hibernation
      xen/time: introduce xen_{save,restore}_steal_clock
      x86/xen: save and restore steal clock

     arch/x86/xen/enlighten_hvm.c      |   8 ++
     arch/x86/xen/suspend.c            |  72 ++++++++++++++++++
     arch/x86/xen/time.c               |  18 ++++-
     arch/x86/xen/xen-ops.h            |   3 +
     drivers/block/xen-blkfront.c      | 122 ++++++++++++++++++++++++++++--
     drivers/net/xen-netfront.c        |  98 +++++++++++++++++++++++-
     drivers/xen/events/events_base.c  |   1 +
     drivers/xen/manage.c              |  73 ++++++++++++++++++
     drivers/xen/time.c                |  29 ++++++-
     drivers/xen/xenbus/xenbus_probe.c |  99 +++++++++++++++++++-----
     include/linux/irq.h               |   2 +
     include/xen/xen-ops.h             |   8 ++
     include/xen/xenbus.h              |   3 +
     kernel/irq/chip.c                 |   2 +-
     kernel/irq/internals.h            |   1 +
     kernel/irq/pm.c                   |  31 +++++---
     kernel/power/user.c               |   6 +-
     17 files changed, 536 insertions(+), 40 deletions(-)

    -- 
    2.24.1.AMZN



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/12] xen/manage: keep track of the on-going suspend mode
  2020-05-19 23:24 ` [PATCH 01/12] xen/manage: keep track of the on-going suspend mode Anchal Agarwal
@ 2020-05-30 22:26   ` Boris Ostrovsky
  2020-06-01 21:00     ` Agarwal, Anchal
  0 siblings, 1 reply; 38+ messages in thread
From: Boris Ostrovsky @ 2020-05-30 22:26 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, xen-devel,
	vkuznets, netdev, linux-kernel, dwmw, benh

On 5/19/20 7:24 PM, Anchal Agarwal wrote:
>  
> +enum suspend_modes {
> +	NO_SUSPEND = 0,
> +	XEN_SUSPEND,
> +	PM_SUSPEND,
> +	PM_HIBERNATION,
> +};
> +
> +/* Protected by pm_mutex */
> +static enum suspend_modes suspend_mode = NO_SUSPEND;
> +
> +bool xen_suspend_mode_is_xen_suspend(void)
> +{
> +	return suspend_mode == XEN_SUSPEND;
> +}
> +
> +bool xen_suspend_mode_is_pm_suspend(void)
> +{
> +	return suspend_mode == PM_SUSPEND;
> +}
> +
> +bool xen_suspend_mode_is_pm_hibernation(void)
> +{
> +	return suspend_mode == PM_HIBERNATION;
> +}
> +


I don't see these last two used anywhere. Are you, in fact,
distinguishing between PM suspend and hibernation?


(I would also probably shorten the name a bit, perhaps
xen_is_pv/pm_suspend()?)


-boris




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 02/12] xenbus: add freeze/thaw/restore callbacks support
  2020-05-19 23:25 ` [PATCH 02/12] xenbus: add freeze/thaw/restore callbacks support Anchal Agarwal
@ 2020-05-30 22:56   ` Boris Ostrovsky
  2020-06-01 23:36     ` Agarwal, Anchal
  0 siblings, 1 reply; 38+ messages in thread
From: Boris Ostrovsky @ 2020-05-30 22:56 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, xen-devel,
	vkuznets, netdev, linux-kernel, dwmw, benh

On 5/19/20 7:25 PM, Anchal Agarwal wrote:
>  
>  int xenbus_dev_resume(struct device *dev)
>  {
> -	int err;
> +	int err = 0;


That's not necessary.


>  	struct xenbus_driver *drv;
>  	struct xenbus_device *xdev
>  		= container_of(dev, struct xenbus_device, dev);
> -
> +	bool xen_suspend = xen_suspend_mode_is_xen_suspend();
>  	DPRINTK("%s", xdev->nodename);
>  
>  	if (dev->driver == NULL)
> @@ -627,24 +645,32 @@ int xenbus_dev_resume(struct device *dev)
>  	drv = to_xenbus_driver(dev->driver);
>  	err = talk_to_otherend(xdev);
>  	if (err) {
> -		pr_warn("resume (talk_to_otherend) %s failed: %i\n",
> +		pr_warn("%s (talk_to_otherend) %s failed: %i\n",


Please use dev_warn() everywhere, we just had a bunch of patches that
replaced pr_warn(). In fact,  this is one of the lines that got changed.


>  
>  int xenbus_dev_cancel(struct device *dev)
>  {
> -	/* Do nothing */
> -	DPRINTK("cancel");
> +	int err = 0;


Again, no need to initialize.


> +	struct xenbus_driver *drv;
> +	struct xenbus_device *xdev
> +		= container_of(dev, struct xenbus_device, dev);


xendev please to be consistent with other code. And use to_xenbus_device().


-boris


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
  2020-05-19 23:25 ` [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume Anchal Agarwal
@ 2020-05-30 23:02   ` Boris Ostrovsky
  2020-06-04 23:03     ` Anchal Agarwal
  0 siblings, 1 reply; 38+ messages in thread
From: Boris Ostrovsky @ 2020-05-30 23:02 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, xen-devel,
	vkuznets, netdev, linux-kernel, dwmw, benh

On 5/19/20 7:25 PM, Anchal Agarwal wrote:
> Introduce a small function which re-uses shared page's PA allocated
> during guest initialization time in reserve_shared_info() and not
> allocate new page during resume flow.
> It also  does the mapping of shared_info_page by calling
> xen_hvm_init_shared_info() to use the function.
>
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> ---
>  arch/x86/xen/enlighten_hvm.c | 7 +++++++
>  arch/x86/xen/xen-ops.h       | 1 +
>  2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> index e138f7de52d2..75b1ec7a0fcd 100644
> --- a/arch/x86/xen/enlighten_hvm.c
> +++ b/arch/x86/xen/enlighten_hvm.c
> @@ -27,6 +27,13 @@
>  
>  static unsigned long shared_info_pfn;
>  
> +void xen_hvm_map_shared_info(void)
> +{
> +	xen_hvm_init_shared_info();
> +	if (shared_info_pfn)
> +		HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
> +}
> +


AFAICT it is only called once so I don't see a need for new routine.


And is it possible for shared_info_pfn to be NULL in resume path (which
is where this is called)?


-boris



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/12] x86/xen: add system core suspend and resume callbacks
  2020-05-19 23:26 ` [PATCH 04/12] x86/xen: add system core suspend and resume callbacks Anchal Agarwal
@ 2020-05-30 23:10   ` Boris Ostrovsky
  2020-06-03 22:40     ` Agarwal, Anchal
  0 siblings, 1 reply; 38+ messages in thread
From: Boris Ostrovsky @ 2020-05-30 23:10 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, xen-devel,
	vkuznets, netdev, linux-kernel, dwmw, benh

On 5/19/20 7:26 PM, Anchal Agarwal wrote:
> From: Munehisa Kamata <kamatam@amazon.com>
>
> Add Xen PVHVM specific system core callbacks for PM suspend and
> hibernation support. The callbacks suspend and resume Xen
> primitives,like shared_info, pvclock and grant table. Note that
> Xen suspend can handle them in a different manner, but system
> core callbacks are called from the context.


I don't think I understand that last sentence.


>  So if the callbacks
> are called from Xen suspend context, return immediately.
>


> +
> +static int xen_syscore_suspend(void)
> +{
> +	struct xen_remove_from_physmap xrfp;
> +	int ret;
> +
> +	/* Xen suspend does similar stuffs in its own logic */
> +	if (xen_suspend_mode_is_xen_suspend())
> +		return 0;
> +
> +	xrfp.domid = DOMID_SELF;
> +	xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
> +
> +	ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
> +	if (!ret)
> +		HYPERVISOR_shared_info = &xen_dummy_shared_info;
> +
> +	return ret;
> +}
> +
> +static void xen_syscore_resume(void)
> +{
> +	/* Xen suspend does similar stuffs in its own logic */
> +	if (xen_suspend_mode_is_xen_suspend())
> +		return;
> +
> +	/* No need to setup vcpu_info as it's already moved off */
> +	xen_hvm_map_shared_info();
> +
> +	pvclock_resume();
> +
> +	gnttab_resume();


Do you call gnttab_suspend() in pm suspend path?


> +}
> +
> +/*
> + * These callbacks will be called with interrupts disabled and when having only
> + * one CPU online.
> + */
> +static struct syscore_ops xen_hvm_syscore_ops = {
> +	.suspend = xen_syscore_suspend,
> +	.resume = xen_syscore_resume
> +};
> +
> +void __init xen_setup_syscore_ops(void)
> +{
> +	if (xen_hvm_domain())


Have you tested this (the whole feature, not just this patch) with PVH
guest BTW? And PVH dom0 for that matter?


-boris


> +		register_syscore_ops(&xen_hvm_syscore_ops);
> +}




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation
  2020-05-19 23:26 ` [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation Anchal Agarwal
  2020-05-19 23:29   ` Singh, Balbir
  2020-05-19 23:34   ` Anchal Agarwal
@ 2020-05-30 23:17   ` Boris Ostrovsky
  2020-06-01 20:46     ` Agarwal, Anchal
  2 siblings, 1 reply; 38+ messages in thread
From: Boris Ostrovsky @ 2020-05-30 23:17 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, xen-devel,
	vkuznets, netdev, linux-kernel, dwmw, benh

On 5/19/20 7:26 PM, Anchal Agarwal wrote:
> Many legacy device drivers do not implement power management (PM)
> functions which means that interrupts requested by these drivers stay
> in active state when the kernel is hibernated.
>
> This does not matter on bare metal and on most hypervisors because the
> interrupt is restored on resume without any noticable side effects as
> it stays connected to the same physical or virtual interrupt line.
>
> The XEN interrupt mechanism is different as it maintains a mapping
> between the Linux interrupt number and a XEN event channel. If the
> interrupt stays active on hibernation this mapping is preserved but
> there is unfortunately no guarantee that on resume the same event
> channels are reassigned to these devices. This can result in event
> channel conflicts which prevent the affected devices from being
> restored correctly.
>
> One way to solve this would be to add the necessary power management
> functions to all affected legacy device drivers, but that's a
> questionable effort which does not provide any benefits on non-XEN
> environments.
>
> The least intrusive and most efficient solution is to provide a
> mechanism which allows the core interrupt code to tear down these
> interrupts on hibernation and bring them back up again on resume. This
> allows the XEN event channel mechanism to assign an arbitrary event
> channel on resume without affecting the functionality of these
> devices.
>
> Fortunately all these device interrupts are handled by a dedicated XEN
> interrupt chip so the chip can be marked that all interrupts connected
> to it are handled this way. This is pretty much in line with the other
> interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.
>
> Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
> it the core interrupt suspend/resume paths.
>
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> Signed-off--by: Thomas Gleixner <tglx@linutronix.de>


Since Thomas wrote this patch I think it should also have "From: " him.


-boris



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 08/12] xen/time: introduce xen_{save,restore}_steal_clock
  2020-05-19 23:28 ` [PATCH 08/12] xen/time: introduce xen_{save,restore}_steal_clock Anchal Agarwal
@ 2020-05-30 23:32   ` Boris Ostrovsky
  0 siblings, 0 replies; 38+ messages in thread
From: Boris Ostrovsky @ 2020-05-30 23:32 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, xen-devel,
	vkuznets, netdev, linux-kernel, dwmw, benh

On 5/19/20 7:28 PM, Anchal Agarwal wrote:
> From: Munehisa Kamata <kamatam@amazon.com>
>
> Currently, steal time accounting code in scheduler expects steal clock
> callback to provide monotonically increasing value. If the accounting
> code receives a smaller value than previous one, it uses a negative
> value to calculate steal time and results in incorrectly updated idle
> and steal time accounting. This breaks userspace tools which read
> /proc/stat.
>
> top - 08:05:35 up  2:12,  3 users,  load average: 0.00, 0.07, 0.23
> Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,30100.0%id,  0.0%wa,  0.0%hi, 0.0%si,-1253874204672.0%st
>
> This can actually happen when a Xen PVHVM guest gets restored from
> hibernation, because such a restored guest is just a fresh domain from
> Xen perspective and the time information in runstate info starts over
> from scratch.
>
> This patch introduces xen_save_steal_clock() which saves current values
> in runstate info into per-cpu variables. Its couterpart,
> xen_restore_steal_clock(), sets offset if it found the current values in
> runstate info are smaller than previous ones. xen_steal_clock() is also
> modified to use the offset to ensure that scheduler only sees
> monotonically increasing number.
>
> Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> ---
>  drivers/xen/time.c    | 29 ++++++++++++++++++++++++++++-
>  include/xen/xen-ops.h |  2 ++
>  2 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
> index 0968859c29d0..3560222cc0dd 100644
> --- a/drivers/xen/time.c
> +++ b/drivers/xen/time.c
> @@ -23,6 +23,9 @@ static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>  
>  static DEFINE_PER_CPU(u64[4], old_runstate_time);
>  
> +static DEFINE_PER_CPU(u64, xen_prev_steal_clock);
> +static DEFINE_PER_CPU(u64, xen_steal_clock_offset);


Can you use old_runstate_time here? It is used to solve a similar
problem for pv suspend, isn't it?


-boris





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 09/12] x86/xen: save and restore steal clock
  2020-05-19 23:28 ` [PATCH 09/12] x86/xen: save and restore steal clock Anchal Agarwal
@ 2020-05-30 23:44   ` Boris Ostrovsky
  2020-06-04 18:33     ` Anchal Agarwal
  0 siblings, 1 reply; 38+ messages in thread
From: Boris Ostrovsky @ 2020-05-30 23:44 UTC (permalink / raw)
  To: Anchal Agarwal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, kamatam, sstabellini, konrad.wilk, roger.pau, axboe,
	davem, rjw, len.brown, pavel, peterz, eduval, sblbir, xen-devel,
	vkuznets, netdev, linux-kernel, dwmw, benh

On 5/19/20 7:28 PM, Anchal Agarwal wrote:
> From: Munehisa Kamata <kamatam@amazon.com>
>
> Save steal clock values of all present CPUs in the system core ops
> suspend callbacks. Also, restore a boot CPU's steal clock in the system
> core resume callback. For non-boot CPUs, restore after they're brought
> up, because runstate info for non-boot CPUs are not active until then.
>
> Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> ---
>  arch/x86/xen/suspend.c | 13 ++++++++++++-
>  arch/x86/xen/time.c    |  3 +++
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
> index 784c4484100b..dae0f74f5390 100644
> --- a/arch/x86/xen/suspend.c
> +++ b/arch/x86/xen/suspend.c
> @@ -91,12 +91,20 @@ void xen_arch_suspend(void)
>  static int xen_syscore_suspend(void)
>  {
>  	struct xen_remove_from_physmap xrfp;
> -	int ret;
> +	int cpu, ret;
>  
>  	/* Xen suspend does similar stuffs in its own logic */
>  	if (xen_suspend_mode_is_xen_suspend())
>  		return 0;
>  
> +	for_each_present_cpu(cpu) {
> +		/*
> +		 * Nonboot CPUs are already offline, but the last copy of
> +		 * runstate info is still accessible.
> +		 */
> +		xen_save_steal_clock(cpu);
> +	}
> +
>  	xrfp.domid = DOMID_SELF;
>  	xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
>  
> @@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
>  
>  	pvclock_resume();


Doesn't make any difference but I think since this patch is where you
are dealing with clock then pvclock_resume() should be added here and
not in the earlier patch.


-boris


>  
> +	/* Nonboot CPUs will be resumed when they're brought up */
> +	xen_restore_steal_clock(smp_processor_id());
> +
>  	gnttab_resume();
>  }
>  
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index c8897aad13cd..33d754564b09 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -545,6 +545,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
>  {
>  	int cpu = smp_processor_id();
>  	xen_setup_runstate_info(cpu);
> +	if (cpu)
> +		xen_restore_steal_clock(cpu);
> +
>  	/*
>  	 * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
>  	 * doing it xen_hvm_cpu_notify (which gets called by smp_init during




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation
  2020-05-30 23:17   ` Boris Ostrovsky
@ 2020-06-01 20:46     ` Agarwal, Anchal
  0 siblings, 0 replies; 38+ messages in thread
From: Agarwal, Anchal @ 2020-06-01 20:46 UTC (permalink / raw)
  To: Boris Ostrovsky, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, Kamata, Munehisa, sstabellini, konrad.wilk, roger.pau,
	axboe, davem, rjw, len.brown, pavel, peterz, Valentin, Eduardo,
	Singh, Balbir, xen-devel, vkuznets, netdev, linux-kernel,
	Woodhouse, David, benh


    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On 5/19/20 7:26 PM, Anchal Agarwal wrote:
    > Many legacy device drivers do not implement power management (PM)
    > functions which means that interrupts requested by these drivers stay
    > in active state when the kernel is hibernated.
    >
    > This does not matter on bare metal and on most hypervisors because the
    > interrupt is restored on resume without any noticable side effects as
    > it stays connected to the same physical or virtual interrupt line.
    >
    > The XEN interrupt mechanism is different as it maintains a mapping
    > between the Linux interrupt number and a XEN event channel. If the
    > interrupt stays active on hibernation this mapping is preserved but
    > there is unfortunately no guarantee that on resume the same event
    > channels are reassigned to these devices. This can result in event
    > channel conflicts which prevent the affected devices from being
    > restored correctly.
    >
    > One way to solve this would be to add the necessary power management
    > functions to all affected legacy device drivers, but that's a
    > questionable effort which does not provide any benefits on non-XEN
    > environments.
    >
    > The least intrusive and most efficient solution is to provide a
    > mechanism which allows the core interrupt code to tear down these
    > interrupts on hibernation and bring them back up again on resume. This
    > allows the XEN event channel mechanism to assign an arbitrary event
    > channel on resume without affecting the functionality of these
    > devices.
    >
    > Fortunately all these device interrupts are handled by a dedicated XEN
    > interrupt chip so the chip can be marked that all interrupts connected
    > to it are handled this way. This is pretty much in line with the other
    > interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.
    >
    > Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
    > it the core interrupt suspend/resume paths.
    >
    > Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
    > Signed-off--by: Thomas Gleixner <tglx@linutronix.de>


    Since Thomas wrote this patch I think it should also have "From: " him.

That sounds about right. I will update it next round and add Tested-by.

    -boris

- Anchal




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/12] xen/manage: keep track of the on-going suspend mode
  2020-05-30 22:26   ` Boris Ostrovsky
@ 2020-06-01 21:00     ` Agarwal, Anchal
  2020-06-01 22:39       ` Boris Ostrovsky
  0 siblings, 1 reply; 38+ messages in thread
From: Agarwal, Anchal @ 2020-06-01 21:00 UTC (permalink / raw)
  To: Boris Ostrovsky, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, Kamata, Munehisa, sstabellini, konrad.wilk, roger.pau,
	axboe, davem, rjw, len.brown, pavel, peterz, Valentin, Eduardo,
	Singh, Balbir, xen-devel, vkuznets, netdev, linux-kernel,
	Woodhouse, David, benh, Agarwal, Anchal


    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On 5/19/20 7:24 PM, Anchal Agarwal wrote:
    >
    > +enum suspend_modes {
    > +     NO_SUSPEND = 0,
    > +     XEN_SUSPEND,
    > +     PM_SUSPEND,
    > +     PM_HIBERNATION,
    > +};
    > +
    > +/* Protected by pm_mutex */
    > +static enum suspend_modes suspend_mode = NO_SUSPEND;
    > +
    > +bool xen_suspend_mode_is_xen_suspend(void)
    > +{
    > +     return suspend_mode == XEN_SUSPEND;
    > +}
    > +
    > +bool xen_suspend_mode_is_pm_suspend(void)
    > +{
    > +     return suspend_mode == PM_SUSPEND;
    > +}
    > +
    > +bool xen_suspend_mode_is_pm_hibernation(void)
    > +{
    > +     return suspend_mode == PM_HIBERNATION;
    > +}
    > +


    I don't see these last two used anywhere. Are you, in fact,
    distinguishing between PM suspend and hibernation?

Yes, I am. Unless there is a better way to distinguish at runtime which I haven't figured out yet.
The initial design was to have separate states for separate modes. Currently, PM_HIBERNATION is handled 
by !xen_suspend . However, if any case arises where we need to set the suspend_mode, its available via 
this interface. This is basically to support PM* ops via ACPI path. Since, PM_SUSPEND is not handled by the series
the code piece can be removed and added later. Any comments?


    (I would also probably shorten the name a bit, perhaps
    xen_is_pv/pm_suspend()?)

Sure. Will fix in my next round of post.
    -boris

Thanks,
Anchal





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/12] xen/manage: keep track of the on-going suspend mode
  2020-06-01 21:00     ` Agarwal, Anchal
@ 2020-06-01 22:39       ` Boris Ostrovsky
  0 siblings, 0 replies; 38+ messages in thread
From: Boris Ostrovsky @ 2020-06-01 22:39 UTC (permalink / raw)
  To: Agarwal, Anchal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, Kamata, Munehisa, sstabellini, konrad.wilk, roger.pau,
	axboe, davem, rjw, len.brown, pavel, peterz, Valentin, Eduardo,
	Singh, Balbir, xen-devel, vkuznets, netdev, linux-kernel,
	Woodhouse, David, benh

On 6/1/20 5:00 PM, Agarwal, Anchal wrote:
>    
>
>     I don't see these last two used anywhere. Are you, in fact,
>     distinguishing between PM suspend and hibernation?
>
> Yes, I am. Unless there is a better way to distinguish at runtime which I haven't figured out yet.
> The initial design was to have separate states for separate modes. Currently, PM_HIBERNATION is handled 
> by !xen_suspend . However, if any case arises where we need to set the suspend_mode, its available via 
> this interface. This is basically to support PM* ops via ACPI path. Since, PM_SUSPEND is not handled by the series
> the code piece can be removed and added later. Any comments?


Yes, if this is not being handled then I don't see any reason for this
code to be there.


-boris


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 02/12] xenbus: add freeze/thaw/restore callbacks support
  2020-05-30 22:56   ` Boris Ostrovsky
@ 2020-06-01 23:36     ` Agarwal, Anchal
  0 siblings, 0 replies; 38+ messages in thread
From: Agarwal, Anchal @ 2020-06-01 23:36 UTC (permalink / raw)
  To: Boris Ostrovsky, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, Kamata, Munehisa, sstabellini, konrad.wilk, roger.pau,
	axboe, davem, rjw, len.brown, pavel, peterz, Valentin, Eduardo,
	Singh, Balbir, xen-devel, vkuznets, netdev, linux-kernel,
	Woodhouse, David, benh



    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On 5/19/20 7:25 PM, Anchal Agarwal wrote:
    >
    >  int xenbus_dev_resume(struct device *dev)
    >  {
    > -     int err;
    > +     int err = 0;


    That's not necessary.
ACK.

    >       struct xenbus_driver *drv;
    >       struct xenbus_device *xdev
    >               = container_of(dev, struct xenbus_device, dev);
    > -
    > +     bool xen_suspend = xen_suspend_mode_is_xen_suspend();
    >       DPRINTK("%s", xdev->nodename);
    >
    >       if (dev->driver == NULL)
    > @@ -627,24 +645,32 @@ int xenbus_dev_resume(struct device *dev)
    >       drv = to_xenbus_driver(dev->driver);
    >       err = talk_to_otherend(xdev);
    >       if (err) {
    > -             pr_warn("resume (talk_to_otherend) %s failed: %i\n",
    > +             pr_warn("%s (talk_to_otherend) %s failed: %i\n",


    Please use dev_warn() everywhere, we just had a bunch of patches that
    replaced pr_warn(). In fact,  this is one of the lines that got changed.

ACK. Will send fixes in next series

    >
    >  int xenbus_dev_cancel(struct device *dev)
    >  {
    > -     /* Do nothing */
    > -     DPRINTK("cancel");
    > +     int err = 0;


    Again, no need to initialize.

ACK.
    > +     struct xenbus_driver *drv;
    > +     struct xenbus_device *xdev
    > +             = container_of(dev, struct xenbus_device, dev);


    xendev please to be consistent with other code. And use to_xenbus_device().
ACK.

    -boris

I will put the fixes in next round of patches.

Thanks,
Anchal



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/12] x86/xen: add system core suspend and resume callbacks
  2020-05-30 23:10   ` Boris Ostrovsky
@ 2020-06-03 22:40     ` Agarwal, Anchal
  2020-06-05 21:24       ` Boris Ostrovsky
  0 siblings, 1 reply; 38+ messages in thread
From: Agarwal, Anchal @ 2020-06-03 22:40 UTC (permalink / raw)
  To: Boris Ostrovsky, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, Kamata, Munehisa, sstabellini, konrad.wilk, roger.pau,
	axboe, davem, rjw, len.brown, pavel, peterz, Valentin, Eduardo,
	Singh, Balbir, xen-devel, vkuznets, netdev, linux-kernel,
	Woodhouse, David, benh, Agarwal, Anchal

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On 5/19/20 7:26 PM, Anchal Agarwal wrote:
    > From: Munehisa Kamata <kamatam@amazon.com>
    >
    > Add Xen PVHVM specific system core callbacks for PM suspend and
    > hibernation support. The callbacks suspend and resume Xen
    > primitives,like shared_info, pvclock and grant table. Note that
    > Xen suspend can handle them in a different manner, but system
    > core callbacks are called from the context.


    I don't think I understand that last sentence.

Looks like it may have cryptic meaning of stating that xen_suspend calls syscore_suspend from xen_suspend
So, if these syscore ops gets called  during xen_suspend do not do anything. Check if the mode is in xen suspend 
and return from there. These syscore_ops are specifically for domU hibernation.
I must admit, I may have overlooked lack of explanation of some implicit details in the original commit msg. 

    >  So if the callbacks
    > are called from Xen suspend context, return immediately.
    >


    > +
    > +static int xen_syscore_suspend(void)
    > +{
    > +     struct xen_remove_from_physmap xrfp;
    > +     int ret;
    > +
    > +     /* Xen suspend does similar stuffs in its own logic */
    > +     if (xen_suspend_mode_is_xen_suspend())
    > +             return 0;
    > +
    > +     xrfp.domid = DOMID_SELF;
    > +     xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
    > +
    > +     ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
    > +     if (!ret)
    > +             HYPERVISOR_shared_info = &xen_dummy_shared_info;
    > +
    > +     return ret;
    > +}
    > +
    > +static void xen_syscore_resume(void)
    > +{
    > +     /* Xen suspend does similar stuffs in its own logic */
    > +     if (xen_suspend_mode_is_xen_suspend())
    > +             return;
    > +
    > +     /* No need to setup vcpu_info as it's already moved off */
    > +     xen_hvm_map_shared_info();
    > +
    > +     pvclock_resume();
    > +
    > +     gnttab_resume();


    Do you call gnttab_suspend() in pm suspend path?
No, since it does nothing for HVM guests. The unmap_frames is only applicable for PV guests right?

    > +}
    > +
    > +/*
    > + * These callbacks will be called with interrupts disabled and when having only
    > + * one CPU online.
    > + */
    > +static struct syscore_ops xen_hvm_syscore_ops = {
    > +     .suspend = xen_syscore_suspend,
    > +     .resume = xen_syscore_resume
    > +};
    > +
    > +void __init xen_setup_syscore_ops(void)
    > +{
    > +     if (xen_hvm_domain())


    Have you tested this (the whole feature, not just this patch) with PVH
    guest BTW? And PVH dom0 for that matter?

No I haven't. The whole series is just tested with hvm/pvhvm guests.

    -boris
Thanks,
Anchal

    > +             register_syscore_ops(&xen_hvm_syscore_ops);
    > +}





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 09/12] x86/xen: save and restore steal clock
  2020-05-30 23:44   ` Boris Ostrovsky
@ 2020-06-04 18:33     ` Anchal Agarwal
  0 siblings, 0 replies; 38+ messages in thread
From: Anchal Agarwal @ 2020-06-04 18:33 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tglx, mingo, bp, hpa, x86, jgross, linux-pm, linux-mm, kamatam,
	sstabellini, konrad.wilk, roger.pau, axboe, davem, rjw,
	len.brown, pavel, peterz, eduval, sblbir, xen-devel, vkuznets,
	netdev, linux-kernel, dwmw, benh, anchalag

On Sat, May 30, 2020 at 07:44:06PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 5/19/20 7:28 PM, Anchal Agarwal wrote:
> > From: Munehisa Kamata <kamatam@amazon.com>
> >
> > Save steal clock values of all present CPUs in the system core ops
> > suspend callbacks. Also, restore a boot CPU's steal clock in the system
> > core resume callback. For non-boot CPUs, restore after they're brought
> > up, because runstate info for non-boot CPUs are not active until then.
> >
> > Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
> > Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> > ---
> >  arch/x86/xen/suspend.c | 13 ++++++++++++-
> >  arch/x86/xen/time.c    |  3 +++
> >  2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
> > index 784c4484100b..dae0f74f5390 100644
> > --- a/arch/x86/xen/suspend.c
> > +++ b/arch/x86/xen/suspend.c
> > @@ -91,12 +91,20 @@ void xen_arch_suspend(void)
> >  static int xen_syscore_suspend(void)
> >  {
> >       struct xen_remove_from_physmap xrfp;
> > -     int ret;
> > +     int cpu, ret;
> >
> >       /* Xen suspend does similar stuffs in its own logic */
> >       if (xen_suspend_mode_is_xen_suspend())
> >               return 0;
> >
> > +     for_each_present_cpu(cpu) {
> > +             /*
> > +              * Nonboot CPUs are already offline, but the last copy of
> > +              * runstate info is still accessible.
> > +              */
> > +             xen_save_steal_clock(cpu);
> > +     }
> > +
> >       xrfp.domid = DOMID_SELF;
> >       xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
> >
> > @@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
> >
> >       pvclock_resume();
> 
> 
> Doesn't make any difference but I think since this patch is where you
> are dealing with clock then pvclock_resume() should be added here and
> not in the earlier patch.
> 
> 
> -boris
I think the reason it may be in previous patch because it was a part
of syscore_resume and steal clock fix came in later. 
It could me moved to this patch that deals with all clock stuff.

-Anchal
> 
> 

> >
> > +     /* Nonboot CPUs will be resumed when they're brought up */
> > +     xen_restore_steal_clock(smp_processor_id());
> > +
> >       gnttab_resume();
> >  }
> >
> > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> > index c8897aad13cd..33d754564b09 100644
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -545,6 +545,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
> >  {
> >       int cpu = smp_processor_id();
> >       xen_setup_runstate_info(cpu);
> > +     if (cpu)
> > +             xen_restore_steal_clock(cpu);
> > +
> >       /*
> >        * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
> >        * doing it xen_hvm_cpu_notify (which gets called by smp_init during
> 
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
  2020-05-30 23:02   ` Boris Ostrovsky
@ 2020-06-04 23:03     ` Anchal Agarwal
  2020-06-05 21:39       ` Boris Ostrovsky
  0 siblings, 1 reply; 38+ messages in thread
From: Anchal Agarwal @ 2020-06-04 23:03 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tglx, mingo, bp, hpa, x86, jgross, linux-pm, linux-mm, kamatam,
	sstabellini, konrad.wilk, roger.pau, axboe, davem, rjw,
	len.brown, pavel, peterz, eduval, sblbir, xen-devel, vkuznets,
	netdev, linux-kernel, dwmw, benh

On Sat, May 30, 2020 at 07:02:01PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 5/19/20 7:25 PM, Anchal Agarwal wrote:
> > Introduce a small function which re-uses shared page's PA allocated
> > during guest initialization time in reserve_shared_info() and not
> > allocate new page during resume flow.
> > It also  does the mapping of shared_info_page by calling
> > xen_hvm_init_shared_info() to use the function.
> >
> > Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> > ---
> >  arch/x86/xen/enlighten_hvm.c | 7 +++++++
> >  arch/x86/xen/xen-ops.h       | 1 +
> >  2 files changed, 8 insertions(+)
> >
> > diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> > index e138f7de52d2..75b1ec7a0fcd 100644
> > --- a/arch/x86/xen/enlighten_hvm.c
> > +++ b/arch/x86/xen/enlighten_hvm.c
> > @@ -27,6 +27,13 @@
> >
> >  static unsigned long shared_info_pfn;
> >
> > +void xen_hvm_map_shared_info(void)
> > +{
> > +     xen_hvm_init_shared_info();
> > +     if (shared_info_pfn)
> > +             HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
> > +}
> > +
> 
> 
> AFAICT it is only called once so I don't see a need for new routine.
> 
> 
HYPERVISOR_shared_info can only be mapped in this scope without refactoring
much of the code.
> And is it possible for shared_info_pfn to be NULL in resume path (which
> is where this is called)?
> 
> 
I don't think it should be, still a sanity check but I don't think its needed there
because hibernation will fail in any case if thats the case. 
However, HYPERVISOR_shared_info does needs to be re-mapped on resume as its been
marked to dummy address on suspend. Its also safe in case va changes.
Does the answer your question?
> -boris

-Anchal
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/12] x86/xen: add system core suspend and resume callbacks
  2020-06-03 22:40     ` Agarwal, Anchal
@ 2020-06-05 21:24       ` Boris Ostrovsky
  0 siblings, 0 replies; 38+ messages in thread
From: Boris Ostrovsky @ 2020-06-05 21:24 UTC (permalink / raw)
  To: Agarwal, Anchal, tglx, mingo, bp, hpa, x86, jgross, linux-pm,
	linux-mm, Kamata, Munehisa, sstabellini, konrad.wilk, roger.pau,
	axboe, davem, rjw, len.brown, pavel, peterz, Valentin, Eduardo,
	Singh, Balbir, xen-devel, vkuznets, netdev, linux-kernel,
	Woodhouse, David, benh

On 6/3/20 6:40 PM, Agarwal, Anchal wrote:
>     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
>     On 5/19/20 7:26 PM, Anchal Agarwal wrote:
>     > From: Munehisa Kamata <kamatam@amazon.com>
>     >
>     > Add Xen PVHVM specific system core callbacks for PM suspend and
>     > hibernation support. The callbacks suspend and resume Xen
>     > primitives,like shared_info, pvclock and grant table. Note that
>     > Xen suspend can handle them in a different manner, but system
>     > core callbacks are called from the context.
>
>
>     I don't think I understand that last sentence.
>
> Looks like it may have cryptic meaning of stating that xen_suspend calls syscore_suspend from xen_suspend
> So, if these syscore ops gets called  during xen_suspend do not do anything. Check if the mode is in xen suspend 
> and return from there. These syscore_ops are specifically for domU hibernation.
> I must admit, I may have overlooked lack of explanation of some implicit details in the original commit msg. 
>
>     >  So if the callbacks
>     > are called from Xen suspend context, return immediately.
>     >
>
>
>     > +
>     > +static int xen_syscore_suspend(void)
>     > +{
>     > +     struct xen_remove_from_physmap xrfp;
>     > +     int ret;
>     > +
>     > +     /* Xen suspend does similar stuffs in its own logic */
>     > +     if (xen_suspend_mode_is_xen_suspend())
>     > +             return 0;


With your explanation now making this clearer, is this check really
necessary? From what I see we are in XEN_SUSPEND mode when
lock_system_sleep() lock is taken, meaning that we can't initialize
hibernation.


>     > +
>     > +     xrfp.domid = DOMID_SELF;
>     > +     xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
>     > +
>     > +     ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
>     > +     if (!ret)
>     > +             HYPERVISOR_shared_info = &xen_dummy_shared_info;
>     > +
>     > +     return ret;
>     > +}
>     > +
>     > +static void xen_syscore_resume(void)
>     > +{
>     > +     /* Xen suspend does similar stuffs in its own logic */
>     > +     if (xen_suspend_mode_is_xen_suspend())
>     > +             return;
>     > +
>     > +     /* No need to setup vcpu_info as it's already moved off */
>     > +     xen_hvm_map_shared_info();
>     > +
>     > +     pvclock_resume();
>     > +
>     > +     gnttab_resume();
>
>
>     Do you call gnttab_suspend() in pm suspend path?
> No, since it does nothing for HVM guests. The unmap_frames is only applicable for PV guests right?


You should call it nevertheless. It will decide whether or not anything
needs to be done.


-boris



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
  2020-06-04 23:03     ` Anchal Agarwal
@ 2020-06-05 21:39       ` Boris Ostrovsky
  0 siblings, 0 replies; 38+ messages in thread
From: Boris Ostrovsky @ 2020-06-05 21:39 UTC (permalink / raw)
  To: Anchal Agarwal
  Cc: tglx, mingo, bp, hpa, x86, jgross, linux-pm, linux-mm, kamatam,
	sstabellini, konrad.wilk, roger.pau, axboe, davem, rjw,
	len.brown, pavel, peterz, eduval, sblbir, xen-devel, vkuznets,
	netdev, linux-kernel, dwmw, benh

On 6/4/20 7:03 PM, Anchal Agarwal wrote:
> On Sat, May 30, 2020 at 07:02:01PM -0400, Boris Ostrovsky wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> On 5/19/20 7:25 PM, Anchal Agarwal wrote:
>>> Introduce a small function which re-uses shared page's PA allocated
>>> during guest initialization time in reserve_shared_info() and not
>>> allocate new page during resume flow.
>>> It also  does the mapping of shared_info_page by calling
>>> xen_hvm_init_shared_info() to use the function.
>>>
>>> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
>>> ---
>>>  arch/x86/xen/enlighten_hvm.c | 7 +++++++
>>>  arch/x86/xen/xen-ops.h       | 1 +
>>>  2 files changed, 8 insertions(+)
>>>
>>> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
>>> index e138f7de52d2..75b1ec7a0fcd 100644
>>> --- a/arch/x86/xen/enlighten_hvm.c
>>> +++ b/arch/x86/xen/enlighten_hvm.c
>>> @@ -27,6 +27,13 @@
>>>
>>>  static unsigned long shared_info_pfn;
>>>
>>> +void xen_hvm_map_shared_info(void)
>>> +{
>>> +     xen_hvm_init_shared_info();
>>> +     if (shared_info_pfn)
>>> +             HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
>>> +}
>>> +
>>
>> AFAICT it is only called once so I don't see a need for new routine.
>>
>>
> HYPERVISOR_shared_info can only be mapped in this scope without refactoring
> much of the code.


Refactoring what? All am suggesting is

--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -124,7 +124,9 @@ static void xen_syscore_resume(void)
                return;
 
        /* No need to setup vcpu_info as it's already moved off */
-       xen_hvm_map_shared_info();
+       xen_hvm_init_shared_info();
+       if (shared_info_pfn)
+               HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
 
        pvclock_resume();

>> And is it possible for shared_info_pfn to be NULL in resume path (which
>> is where this is called)?
>>
>>
> I don't think it should be, still a sanity check but I don't think its needed there
> because hibernation will fail in any case if thats the case. 


If shared_info_pfn is NULL you'd have problems long before hibernation
started. We set it in xen_hvm_guest_init() and never touch again.


In fact, I'd argue that it should be __ro_after_init.


> However, HYPERVISOR_shared_info does needs to be re-mapped on resume as its been
> marked to dummy address on suspend. Its also safe in case va changes.
> Does the answer your question?


I wasn't arguing whether HYPERVISOR_shared_info needs to be set, I was
only saying that shared_info_pfn doesn't need to be tested.


-boris



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, back to index

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 23:24 [PATCH 00/12] Fix PM hibernation in Xen guests Anchal Agarwal
2020-05-19 23:24 ` [PATCH 01/12] xen/manage: keep track of the on-going suspend mode Anchal Agarwal
2020-05-30 22:26   ` Boris Ostrovsky
2020-06-01 21:00     ` Agarwal, Anchal
2020-06-01 22:39       ` Boris Ostrovsky
2020-05-19 23:25 ` [PATCH 02/12] xenbus: add freeze/thaw/restore callbacks support Anchal Agarwal
2020-05-30 22:56   ` Boris Ostrovsky
2020-06-01 23:36     ` Agarwal, Anchal
2020-05-19 23:25 ` [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume Anchal Agarwal
2020-05-30 23:02   ` Boris Ostrovsky
2020-06-04 23:03     ` Anchal Agarwal
2020-06-05 21:39       ` Boris Ostrovsky
2020-05-19 23:26 ` [PATCH 04/12] x86/xen: add system core suspend and resume callbacks Anchal Agarwal
2020-05-30 23:10   ` Boris Ostrovsky
2020-06-03 22:40     ` Agarwal, Anchal
2020-06-05 21:24       ` Boris Ostrovsky
2020-05-19 23:26 ` [PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation Anchal Agarwal
2020-05-19 23:29   ` Singh, Balbir
2020-05-19 23:36     ` Agarwal, Anchal
2020-05-19 23:34   ` Anchal Agarwal
2020-05-30 23:17   ` Boris Ostrovsky
2020-06-01 20:46     ` Agarwal, Anchal
2020-05-19 23:27 ` [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
2020-05-20  5:00   ` kbuild test robot
2020-05-20  5:07   ` kbuild test robot
2020-05-21 23:48   ` Anchal Agarwal
2020-05-22  1:43     ` Singh, Balbir
2020-05-28 12:30   ` Roger Pau Monné
2020-05-19 23:28 ` [PATCH 07/12] xen-netfront: " Anchal Agarwal
2020-05-19 23:28 ` [PATCH 08/12] xen/time: introduce xen_{save,restore}_steal_clock Anchal Agarwal
2020-05-30 23:32   ` Boris Ostrovsky
2020-05-19 23:28 ` [PATCH 09/12] x86/xen: save and restore steal clock Anchal Agarwal
2020-05-30 23:44   ` Boris Ostrovsky
2020-06-04 18:33     ` Anchal Agarwal
2020-05-19 23:29 ` [PATCH 10/12] xen: Introduce wrapper for save/restore sched clock offset Anchal Agarwal
2020-05-19 23:29 ` [PATCH 11/12] xen: Update sched clock offset to avoid system instability in hibernation Anchal Agarwal
2020-05-19 23:29 ` [PATCH 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA Anchal Agarwal
2020-05-28 17:59 ` [PATCH 00/12] Fix PM hibernation in Xen guests Agarwal, Anchal

Linux-PM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pm/0 linux-pm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pm linux-pm/ https://lore.kernel.org/linux-pm \
		linux-pm@vger.kernel.org
	public-inbox-index linux-pm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git