nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/4] resource: Use list_head to link sibling resource
@ 2018-07-18  2:49 Baoquan He
  2018-07-18  2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: brijesh.singh, devicetree, airlied, linux-pci, richard.weiyang,
	jcmvbkbc, baiyaowei, kys, frowand.list, lorenzo.pieralisi,
	sthemmin, Baoquan He, linux-nvdimm, patrik.r.jakobsson,
	linux-input, gustavo, dyoung, thomas.lendacky, haiyangz,
	maarten.lankhorst, jglisse, seanpaul, bhelgaas, tglx, yinghai,
	jonathan.derrick, chris, monstr, linux-parisc, gregkh,
	dmitry.torokhov, ebiederm, devel, linuxppc-dev, davem

This patchset is doing:
1) Move reparent_resources() to kernel/resource.c to clean up duplicated
   code in arch/microblaze/pci/pci-common.c and
   arch/powerpc/kernel/pci-common.c .
2) Replace struct resource's sibling list from singly linked list to
   list_head. Clearing out those pointer operation within singly linked
   list for better code readability.
2) Based on list_head replacement, add a new function
   walk_system_ram_res_rev() which can does reversed iteration on
   iomem_resource's siblings.
3) Change kexec_file loading to search system RAM top down for kernel
   loadin, using walk_system_ram_res_rev().

Note:
This patchset only passed testing on  x86_64 arch with network
enabling. The thing we need pay attetion to is that a root resource's
child member need be initialized specifically with LIST_HEAD_INIT() if
statically defined or INIT_LIST_HEAD() for dynamically definition. Here
Just like we do for iomem_resource/ioport_resource, or the change in
get_pci_domain_busn_res().

v6:
http://lkml.kernel.org/r/20180704041038.8190-1-bhe@redhat.com

v5:
http://lkml.kernel.org/r/20180612032831.29747-1-bhe@redhat.com

v4:
http://lkml.kernel.org/r/20180507063224.24229-1-bhe@redhat.com

v3:
http://lkml.kernel.org/r/20180419001848.3041-1-bhe@redhat.com

v2:
http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com

v1:
http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com

Changelog:
v6->v7:
  Fix code bugs that test robot reported on mips and ia64.

  Add error code description in reparent_resources() according to
  Andy's comment, and fix minor log typo.
v5->v6:
  Fix code style problems in reparent_resources() and use existing
  error codes, according to Andy's suggestion.

  Fix bugs test robot reported.

v4->v5:
  Add new patch 0001 to move duplicated reparent_resources() to
  kernel/resource.c to make it be shared by different ARCH-es.

  Fix several code bugs reported by test robot on ARCH powerpc and
  microblaze.
v3->v4:
  Fix several bugs test robot reported. Rewrite cover letter and patch
  log according to reviewer's comment.

v2->v3:
  Rename resource functions first_child() and sibling() to
  resource_first_chils() and resource_sibling(). Dan suggested this.

  Move resource_first_chils() and resource_sibling() to linux/ioport.h
  and make them as inline function. Rob suggested this. Accordingly add
  linux/list.h including in linux/ioport.h, please help review if this
  bring efficiency degradation or code redundancy.

  The change on struct resource {} bring two pointers of size increase,
  mention this in git log to make it more specifically, Rob suggested
  this.

v1->v2:
  Use list_head instead to link resource siblings. This is suggested by
  Andrew.

  Rewrite walk_system_ram_res_rev() after list_head is taken to link
  resouce siblings.

Baoquan He (4):
  resource: Move reparent_resources() to kernel/resource.c and make it
    public
  resource: Use list_head to link sibling resource
  resource: add walk_system_ram_res_rev()
  kexec_file: Load kernel at top of system RAM if required

 arch/arm/plat-samsung/pm-check.c            |   6 +-
 arch/ia64/sn/kernel/io_init.c               |   2 +-
 arch/microblaze/pci/pci-common.c            |  41 +----
 arch/mips/pci/pci-rc32434.c                 |  12 +-
 arch/powerpc/kernel/pci-common.c            |  39 +---
 arch/sparc/kernel/ioport.c                  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h        |   4 +-
 drivers/eisa/eisa-bus.c                     |   2 +
 drivers/gpu/drm/drm_memory.c                |   3 +-
 drivers/gpu/drm/gma500/gtt.c                |   5 +-
 drivers/hv/vmbus_drv.c                      |  52 +++---
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c             |   6 +-
 drivers/nvdimm/nd.h                         |   5 +-
 drivers/of/address.c                        |   4 +-
 drivers/parisc/lba_pci.c                    |   4 +-
 drivers/pci/controller/vmd.c                |   8 +-
 drivers/pci/probe.c                         |   2 +
 drivers/pci/setup-bus.c                     |   2 +-
 include/linux/ioport.h                      |  21 ++-
 kernel/kexec_file.c                         |   2 +
 kernel/resource.c                           | 266 ++++++++++++++++++----------
 22 files changed, 260 insertions(+), 232 deletions(-)

-- 
2.13.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public
  2018-07-18  2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He
@ 2018-07-18  2:49 ` Baoquan He
  2018-07-18 16:36   ` Andy Shevchenko
  2018-07-18  2:49 ` [PATCH v7 2/4] resource: Use list_head to link sibling resource Baoquan He
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: brijesh.singh, devicetree, airlied, linux-pci, richard.weiyang,
	jcmvbkbc, Paul Mackerras, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, Baoquan He, linux-nvdimm,
	Michael Ellerman, patrik.r.jakobsson, linux-input, gustavo,
	dyoung, thomas.lendacky, haiyangz, maarten.lankhorst, jglisse,
	seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick, chris,
	monstr, linux-parisc, gregkh, dmitry.torokhov,
	Benjamin Herrenschmidt, ebiederm, devel, linuxppc-dev, davem

reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
so that it's shared.

Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/microblaze/pci/pci-common.c | 37 -----------------------------------
 arch/powerpc/kernel/pci-common.c | 35 ---------------------------------
 include/linux/ioport.h           |  1 +
 kernel/resource.c                | 42 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 43 insertions(+), 72 deletions(-)

diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index f34346d56095..7899bafab064 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -619,43 +619,6 @@ int pcibios_add_device(struct pci_dev *dev)
 EXPORT_SYMBOL(pcibios_add_device);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int __init reparent_resources(struct resource *parent,
-				     struct resource *res)
-{
-	struct resource *p, **pp;
-	struct resource **firstpp = NULL;
-
-	for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
-		if (p->end < res->start)
-			continue;
-		if (res->end < p->start)
-			break;
-		if (p->start < res->start || p->end > res->end)
-			return -1;	/* not completely contained */
-		if (firstpp == NULL)
-			firstpp = pp;
-	}
-	if (firstpp == NULL)
-		return -1;	/* didn't find any conflicting entries? */
-	res->parent = parent;
-	res->child = *firstpp;
-	res->sibling = *pp;
-	*firstpp = res;
-	*pp = NULL;
-	for (p = res->child; p != NULL; p = p->sibling) {
-		p->parent = res;
-		pr_debug("PCI: Reparented %s [%llx..%llx] under %s\n",
-			 p->name,
-			 (unsigned long long)p->start,
-			 (unsigned long long)p->end, res->name);
-	}
-	return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index fe9733ffffaa..926035bb378d 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1088,41 +1088,6 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res,
 EXPORT_SYMBOL(pcibios_align_resource);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int reparent_resources(struct resource *parent,
-				     struct resource *res)
-{
-	struct resource *p, **pp;
-	struct resource **firstpp = NULL;
-
-	for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
-		if (p->end < res->start)
-			continue;
-		if (res->end < p->start)
-			break;
-		if (p->start < res->start || p->end > res->end)
-			return -1;	/* not completely contained */
-		if (firstpp == NULL)
-			firstpp = pp;
-	}
-	if (firstpp == NULL)
-		return -1;	/* didn't find any conflicting entries? */
-	res->parent = parent;
-	res->child = *firstpp;
-	res->sibling = *pp;
-	*firstpp = res;
-	*pp = NULL;
-	for (p = res->child; p != NULL; p = p->sibling) {
-		p->parent = res;
-		pr_debug("PCI: Reparented %s %pR under %s\n",
-			 p->name, p, res->name);
-	}
-	return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index da0ebaec25f0..dfdcd0bfe54e 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -192,6 +192,7 @@ extern int allocate_resource(struct resource *root, struct resource *new,
 struct resource *lookup_resource(struct resource *root, resource_size_t start);
 int adjust_resource(struct resource *res, resource_size_t start,
 		    resource_size_t size);
+int reparent_resources(struct resource *parent, struct resource *res);
 resource_size_t resource_alignment(struct resource *res);
 static inline resource_size_t resource_size(const struct resource *res)
 {
diff --git a/kernel/resource.c b/kernel/resource.c
index 30e1bc68503b..81ccd19c1d9f 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -983,6 +983,48 @@ int adjust_resource(struct resource *res, resource_size_t start,
 }
 EXPORT_SYMBOL(adjust_resource);
 
+/**
+ * reparent_resources - reparent resource children of parent that res covers
+ * @parent: parent resource descriptor
+ * @res: resource descriptor desired by caller
+ *
+ * Returns 0 on success, -ENOTSUPP if child resource is not completely
+ * contained by 'res', -ECANCELED if no any conflicting entry found.
+ *
+ * Reparent resource children of 'parent' that conflict with 'res'
+ * under 'res', and make 'res' replace those children.
+ */
+int reparent_resources(struct resource *parent, struct resource *res)
+{
+	struct resource *p, **pp;
+	struct resource **firstpp = NULL;
+
+	for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
+		if (p->end < res->start)
+			continue;
+		if (res->end < p->start)
+			break;
+		if (p->start < res->start || p->end > res->end)
+			return -ENOTSUPP;	/* not completely contained */
+		if (firstpp == NULL)
+			firstpp = pp;
+	}
+	if (firstpp == NULL)
+		return -ECANCELED; /* didn't find any conflicting entries? */
+	res->parent = parent;
+	res->child = *firstpp;
+	res->sibling = *pp;
+	*firstpp = res;
+	*pp = NULL;
+	for (p = res->child; p != NULL; p = p->sibling) {
+		p->parent = res;
+		pr_debug("PCI: Reparented %s %pR under %s\n",
+			 p->name, p, res->name);
+	}
+	return 0;
+}
+EXPORT_SYMBOL(reparent_resources);
+
 static void __init __reserve_region_with_split(struct resource *root,
 		resource_size_t start, resource_size_t end,
 		const char *name)
-- 
2.13.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 2/4] resource: Use list_head to link sibling resource
  2018-07-18  2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He
  2018-07-18  2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He
@ 2018-07-18  2:49 ` Baoquan He
  2018-07-18  2:49 ` [PATCH v7 3/4] resource: add walk_system_ram_res_rev() Baoquan He
  2018-07-18  2:49 ` [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required Baoquan He
  3 siblings, 0 replies; 21+ messages in thread
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: linux-mips, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, Paul Mackerras, baiyaowei, kys,
	frowand.list, lorenzo.pieralisi, sthemmin, Baoquan He,
	linux-nvdimm, Michael Ellerman, patrik.r.jakobsson, linux-input,
	gustavo, dyoung, thomas.lendacky, haiyangz, maarten.lankhorst,
	jglisse, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov,
	Benjamin Herrenschmidt, ebiederm, devel, linuxppc-dev, davem

The struct resource uses singly linked list to link siblings, implemented
by pointer operation. Replace it with list_head for better code readability.

Based on this list_head replacement, it will be very easy to do reverse
iteration on iomem_resource's sibling list in later patch.

Besides, type of member variables of struct resource, sibling and child, are
changed from 'struct resource *' to 'struct list_head'. This brings two
pointers of size increase.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Cc: David Airlie <airlied@linux.ie>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jonathan Derrick <jonathan.derrick@intel.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: devel@linuxdriverproject.org
Cc: linux-input@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: devicetree@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>                                                                                             
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linux-mips@linux-mips.org
---
 arch/arm/plat-samsung/pm-check.c            |   6 +-
 arch/ia64/sn/kernel/io_init.c               |   2 +-
 arch/microblaze/pci/pci-common.c            |   4 +-
 arch/mips/pci/pci-rc32434.c                 |  12 +-
 arch/powerpc/kernel/pci-common.c            |   4 +-
 arch/sparc/kernel/ioport.c                  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h        |   4 +-
 drivers/eisa/eisa-bus.c                     |   2 +
 drivers/gpu/drm/drm_memory.c                |   3 +-
 drivers/gpu/drm/gma500/gtt.c                |   5 +-
 drivers/hv/vmbus_drv.c                      |  52 +++----
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c             |   6 +-
 drivers/nvdimm/nd.h                         |   5 +-
 drivers/of/address.c                        |   4 +-
 drivers/parisc/lba_pci.c                    |   4 +-
 drivers/pci/controller/vmd.c                |   8 +-
 drivers/pci/probe.c                         |   2 +
 drivers/pci/setup-bus.c                     |   2 +-
 include/linux/ioport.h                      |  17 ++-
 kernel/resource.c                           | 206 ++++++++++++++--------------
 21 files changed, 183 insertions(+), 171 deletions(-)

diff --git a/arch/arm/plat-samsung/pm-check.c b/arch/arm/plat-samsung/pm-check.c
index cd2c02c68bc3..5494355b1c49 100644
--- a/arch/arm/plat-samsung/pm-check.c
+++ b/arch/arm/plat-samsung/pm-check.c
@@ -46,8 +46,8 @@ typedef u32 *(run_fn_t)(struct resource *ptr, u32 *arg);
 static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
 {
 	while (ptr != NULL) {
-		if (ptr->child != NULL)
-			s3c_pm_run_res(ptr->child, fn, arg);
+		if (!list_empty(&ptr->child))
+			s3c_pm_run_res(resource_first_child(&ptr->child), fn, arg);
 
 		if ((ptr->flags & IORESOURCE_SYSTEM_RAM)
 				== IORESOURCE_SYSTEM_RAM) {
@@ -57,7 +57,7 @@ static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
 			arg = (fn)(ptr, arg);
 		}
 
-		ptr = ptr->sibling;
+		ptr = resource_sibling(ptr);
 	}
 }
 
diff --git a/arch/ia64/sn/kernel/io_init.c b/arch/ia64/sn/kernel/io_init.c
index d63809a6adfa..338a7b7f194d 100644
--- a/arch/ia64/sn/kernel/io_init.c
+++ b/arch/ia64/sn/kernel/io_init.c
@@ -192,7 +192,7 @@ sn_io_slot_fixup(struct pci_dev *dev)
 		 * if it's already in the device structure, remove it before
 		 * inserting
 		 */
-		if (res->parent && res->parent->child)
+		if (res->parent && !list_empty(&res->parent->child))
 			release_resource(res);
 
 		if (res->flags & IORESOURCE_IO)
diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 7899bafab064..2bf73e27e231 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -533,7 +533,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller *hose,
 			res->flags = range.flags;
 			res->start = range.cpu_addr;
 			res->end = range.cpu_addr + range.size - 1;
-			res->parent = res->child = res->sibling = NULL;
+			res->parent = NULL;
+			INIT_LIST_HEAD(&res->child);
+			INIT_LIST_HEAD(&res->sibling);
 		}
 	}
 
diff --git a/arch/mips/pci/pci-rc32434.c b/arch/mips/pci/pci-rc32434.c
index 7f6ce6d734c0..e80283df7925 100644
--- a/arch/mips/pci/pci-rc32434.c
+++ b/arch/mips/pci/pci-rc32434.c
@@ -53,8 +53,8 @@ static struct resource rc32434_res_pci_mem1 = {
 	.start = 0x50000000,
 	.end = 0x5FFFFFFF,
 	.flags = IORESOURCE_MEM,
-	.sibling = NULL,
-	.child = &rc32434_res_pci_mem2
+	.sibling = LIST_HEAD_INIT(rc32434_res_pci_mem1.sibling),
+	.child = LIST_HEAD_INIT(rc32434_res_pci_mem1.child),
 };
 
 static struct resource rc32434_res_pci_mem2 = {
@@ -63,8 +63,8 @@ static struct resource rc32434_res_pci_mem2 = {
 	.end = 0x6FFFFFFF,
 	.flags = IORESOURCE_MEM,
 	.parent = &rc32434_res_pci_mem1,
-	.sibling = NULL,
-	.child = NULL
+	.sibling = LIST_HEAD_INIT(rc32434_res_pci_mem2.sibling),
+	.child = LIST_HEAD_INIT(rc32434_res_pci_mem2.child),
 };
 
 static struct resource rc32434_res_pci_io1 = {
@@ -72,6 +72,8 @@ static struct resource rc32434_res_pci_io1 = {
 	.start = 0x18800000,
 	.end = 0x188FFFFF,
 	.flags = IORESOURCE_IO,
+	.sibling = LIST_HEAD_INIT(rc32434_res_pci_io1.sibling),
+	.child = LIST_HEAD_INIT(rc32434_res_pci_io1.child),
 };
 
 extern struct pci_ops rc32434_pci_ops;
@@ -208,6 +210,8 @@ static int __init rc32434_pci_init(void)
 
 	pr_info("PCI: Initializing PCI\n");
 
+	list_add(&rc32434_res_pci_mem2.sibling, &rc32434_res_pci_mem1.child);
+
 	ioport_resource.start = rc32434_res_pci_io1.start;
 	ioport_resource.end = rc32434_res_pci_io1.end;
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 926035bb378d..28fbe83c9daf 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -761,7 +761,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller *hose,
 			res->flags = range.flags;
 			res->start = range.cpu_addr;
 			res->end = range.cpu_addr + range.size - 1;
-			res->parent = res->child = res->sibling = NULL;
+			res->parent = NULL;
+			INIT_LIST_HEAD(&res->child);
+			INIT_LIST_HEAD(&res->sibling);
 		}
 	}
 }
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index cca9134cfa7d..99efe4e98b16 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -669,7 +669,7 @@ static int sparc_io_proc_show(struct seq_file *m, void *v)
 	struct resource *root = m->private, *r;
 	const char *nm;
 
-	for (r = root->child; r != NULL; r = r->sibling) {
+	list_for_each_entry(r, &root->child, sibling) {
 		if ((nm = r->name) == NULL) nm = "???";
 		seq_printf(m, "%016llx-%016llx: %s\n",
 				(unsigned long long)r->start,
diff --git a/arch/xtensa/include/asm/pci-bridge.h b/arch/xtensa/include/asm/pci-bridge.h
index 0b68c76ec1e6..f487b06817df 100644
--- a/arch/xtensa/include/asm/pci-bridge.h
+++ b/arch/xtensa/include/asm/pci-bridge.h
@@ -71,8 +71,8 @@ static inline void pcibios_init_resource(struct resource *res,
 	res->flags = flags;
 	res->name = name;
 	res->parent = NULL;
-	res->sibling = NULL;
-	res->child = NULL;
+	INIT_LIST_HEAD(&res->child);
+	INIT_LIST_HEAD(&res->sibling);
 }
 
 
diff --git a/drivers/eisa/eisa-bus.c b/drivers/eisa/eisa-bus.c
index 1e8062f6dbfc..dba78f75fd06 100644
--- a/drivers/eisa/eisa-bus.c
+++ b/drivers/eisa/eisa-bus.c
@@ -408,6 +408,8 @@ static struct resource eisa_root_res = {
 	.start = 0,
 	.end   = 0xffffffff,
 	.flags = IORESOURCE_IO,
+	.sibling = LIST_HEAD_INIT(eisa_root_res.sibling),
+	.child  = LIST_HEAD_INIT(eisa_root_res.child),
 };
 
 static int eisa_bus_count;
diff --git a/drivers/gpu/drm/drm_memory.c b/drivers/gpu/drm/drm_memory.c
index d69e4fc1ee77..33baa7fa5e41 100644
--- a/drivers/gpu/drm/drm_memory.c
+++ b/drivers/gpu/drm/drm_memory.c
@@ -155,9 +155,8 @@ u64 drm_get_max_iomem(void)
 	struct resource *tmp;
 	resource_size_t max_iomem = 0;
 
-	for (tmp = iomem_resource.child; tmp; tmp = tmp->sibling) {
+	list_for_each_entry(tmp, &iomem_resource.child, sibling)
 		max_iomem = max(max_iomem,  tmp->end);
-	}
 
 	return max_iomem;
 }
diff --git a/drivers/gpu/drm/gma500/gtt.c b/drivers/gpu/drm/gma500/gtt.c
index 3949b0990916..addd3bc009af 100644
--- a/drivers/gpu/drm/gma500/gtt.c
+++ b/drivers/gpu/drm/gma500/gtt.c
@@ -565,7 +565,7 @@ int psb_gtt_init(struct drm_device *dev, int resume)
 int psb_gtt_restore(struct drm_device *dev)
 {
 	struct drm_psb_private *dev_priv = dev->dev_private;
-	struct resource *r = dev_priv->gtt_mem->child;
+	struct resource *r;
 	struct gtt_range *range;
 	unsigned int restored = 0, total = 0, size = 0;
 
@@ -573,14 +573,13 @@ int psb_gtt_restore(struct drm_device *dev)
 	mutex_lock(&dev_priv->gtt_mutex);
 	psb_gtt_init(dev, 1);
 
-	while (r != NULL) {
+	list_for_each_entry(r, &dev_priv->gtt_mem->child, sibling) {
 		range = container_of(r, struct gtt_range, resource);
 		if (range->pages) {
 			psb_gtt_insert(dev, range, 1);
 			size += range->resource.end - range->resource.start;
 			restored++;
 		}
-		r = r->sibling;
 		total++;
 	}
 	mutex_unlock(&dev_priv->gtt_mutex);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index b10fe26c4891..d87ec5a1bc4c 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1412,9 +1412,8 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
 {
 	resource_size_t start = 0;
 	resource_size_t end = 0;
-	struct resource *new_res;
+	struct resource *new_res, *tmp;
 	struct resource **old_res = &hyperv_mmio;
-	struct resource **prev_res = NULL;
 
 	switch (res->type) {
 
@@ -1461,44 +1460,36 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
 	/*
 	 * If two ranges are adjacent, merge them.
 	 */
-	do {
-		if (!*old_res) {
-			*old_res = new_res;
-			break;
-		}
-
-		if (((*old_res)->end + 1) == new_res->start) {
-			(*old_res)->end = new_res->end;
+	if (!*old_res) {
+		*old_res = new_res;
+		return AE_OK;
+	}
+	tmp = *old_res;
+	list_for_each_entry_from(tmp, &tmp->parent->child, sibling) {
+		if ((tmp->end + 1) == new_res->start) {
+			tmp->end = new_res->end;
 			kfree(new_res);
 			break;
 		}
 
-		if ((*old_res)->start == new_res->end + 1) {
-			(*old_res)->start = new_res->start;
+		if (tmp->start == new_res->end + 1) {
+			tmp->start = new_res->start;
 			kfree(new_res);
 			break;
 		}
 
-		if ((*old_res)->start > new_res->end) {
-			new_res->sibling = *old_res;
-			if (prev_res)
-				(*prev_res)->sibling = new_res;
-			*old_res = new_res;
+		if (tmp->start > new_res->end) {
+			list_add(&new_res->sibling, tmp->sibling.prev);
 			break;
 		}
-
-		prev_res = old_res;
-		old_res = &(*old_res)->sibling;
-
-	} while (1);
+	}
 
 	return AE_OK;
 }
 
 static int vmbus_acpi_remove(struct acpi_device *device)
 {
-	struct resource *cur_res;
-	struct resource *next_res;
+	struct resource *res;
 
 	if (hyperv_mmio) {
 		if (fb_mmio) {
@@ -1507,10 +1498,9 @@ static int vmbus_acpi_remove(struct acpi_device *device)
 			fb_mmio = NULL;
 		}
 
-		for (cur_res = hyperv_mmio; cur_res; cur_res = next_res) {
-			next_res = cur_res->sibling;
-			kfree(cur_res);
-		}
+		res = hyperv_mmio;
+		list_for_each_entry_from(res, &res->parent->child, sibling)
+			kfree(res);
 	}
 
 	return 0;
@@ -1596,7 +1586,8 @@ int vmbus_allocate_mmio(struct resource **new, struct hv_device *device_obj,
 		}
 	}
 
-	for (iter = hyperv_mmio; iter; iter = iter->sibling) {
+	iter = hyperv_mmio;
+	list_for_each_entry_from(iter, &iter->parent->child, sibling) {
 		if ((iter->start >= max) || (iter->end <= min))
 			continue;
 
@@ -1639,7 +1630,8 @@ void vmbus_free_mmio(resource_size_t start, resource_size_t size)
 	struct resource *iter;
 
 	down(&hyperv_mmio_lock);
-	for (iter = hyperv_mmio; iter; iter = iter->sibling) {
+	iter = hyperv_mmio;
+	list_for_each_entry_from(iter, &iter->parent->child, sibling) {
 		if ((iter->start >= start + size) || (iter->end <= start))
 			continue;
 
diff --git a/drivers/input/joystick/iforce/iforce-main.c b/drivers/input/joystick/iforce/iforce-main.c
index daeeb4c7e3b0..5c0be27b33ff 100644
--- a/drivers/input/joystick/iforce/iforce-main.c
+++ b/drivers/input/joystick/iforce/iforce-main.c
@@ -305,8 +305,8 @@ int iforce_init_device(struct iforce *iforce)
 	iforce->device_memory.end = 200;
 	iforce->device_memory.flags = IORESOURCE_MEM;
 	iforce->device_memory.parent = NULL;
-	iforce->device_memory.child = NULL;
-	iforce->device_memory.sibling = NULL;
+	INIT_LIST_HEAD(&iforce->device_memory.child);
+	INIT_LIST_HEAD(&iforce->device_memory.sibling);
 
 /*
  * Wait until device ready - until it sends its first response.
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 28afdd668905..f53d410d9981 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -637,7 +637,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
  retry:
 	first = 0;
 	for_each_dpa_resource(ndd, res) {
-		struct resource *next = res->sibling, *new_res = NULL;
+		struct resource *next = resource_sibling(res), *new_res = NULL;
 		resource_size_t allocate, available = 0;
 		enum alloc_loc loc = ALLOC_ERR;
 		const char *action;
@@ -763,7 +763,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region,
 	 * an initial "pmem-reserve pass".  Only do an initial BLK allocation
 	 * when none of the DPA space is reserved.
 	 */
-	if ((is_pmem || !ndd->dpa.child) && n == to_allocate)
+	if ((is_pmem || list_empty(&ndd->dpa.child)) && n == to_allocate)
 		return init_dpa_allocation(label_id, nd_region, nd_mapping, n);
 	return n;
 }
@@ -779,7 +779,7 @@ static int merge_dpa(struct nd_region *nd_region,
  retry:
 	for_each_dpa_resource(ndd, res) {
 		int rc;
-		struct resource *next = res->sibling;
+		struct resource *next = resource_sibling(res);
 		resource_size_t end = res->start + resource_size(res);
 
 		if (!next || strcmp(res->name, label_id->id) != 0
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 32e0364b48b9..da7da15e03e7 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -102,11 +102,10 @@ unsigned sizeof_namespace_label(struct nvdimm_drvdata *ndd);
 		(unsigned long long) (res ? res->start : 0), ##arg)
 
 #define for_each_dpa_resource(ndd, res) \
-	for (res = (ndd)->dpa.child; res; res = res->sibling)
+	list_for_each_entry(res, &(ndd)->dpa.child, sibling)
 
 #define for_each_dpa_resource_safe(ndd, res, next) \
-	for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \
-			res; res = next, next = next ? next->sibling : NULL)
+	list_for_each_entry_safe(res, next, &(ndd)->dpa.child, sibling)
 
 struct nd_percpu_lane {
 	int count;
diff --git a/drivers/of/address.c b/drivers/of/address.c
index 53349912ac75..e2e25719ab52 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -330,7 +330,9 @@ int of_pci_range_to_resource(struct of_pci_range *range,
 {
 	int err;
 	res->flags = range->flags;
-	res->parent = res->child = res->sibling = NULL;
+	res->parent = NULL;
+	INIT_LIST_HEAD(&res->child);
+	INIT_LIST_HEAD(&res->sibling);
 	res->name = np->full_name;
 
 	if (res->flags & IORESOURCE_IO) {
diff --git a/drivers/parisc/lba_pci.c b/drivers/parisc/lba_pci.c
index 69bd98421eb1..7482bdfd1959 100644
--- a/drivers/parisc/lba_pci.c
+++ b/drivers/parisc/lba_pci.c
@@ -170,8 +170,8 @@ lba_dump_res(struct resource *r, int d)
 	for (i = d; i ; --i) printk(" ");
 	printk(KERN_DEBUG "%p [%lx,%lx]/%lx\n", r,
 		(long)r->start, (long)r->end, r->flags);
-	lba_dump_res(r->child, d+2);
-	lba_dump_res(r->sibling, d);
+	lba_dump_res(resource_first_child(&r->child), d+2);
+	lba_dump_res(resource_sibling(r), d);
 }
 
 
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 942b64fc7f1f..e3ace20345c7 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -542,14 +542,14 @@ static struct pci_ops vmd_ops = {
 
 static void vmd_attach_resources(struct vmd_dev *vmd)
 {
-	vmd->dev->resource[VMD_MEMBAR1].child = &vmd->resources[1];
-	vmd->dev->resource[VMD_MEMBAR2].child = &vmd->resources[2];
+	list_add(&vmd->resources[1].sibling, &vmd->dev->resource[VMD_MEMBAR1].child);
+	list_add(&vmd->resources[2].sibling, &vmd->dev->resource[VMD_MEMBAR2].child);
 }
 
 static void vmd_detach_resources(struct vmd_dev *vmd)
 {
-	vmd->dev->resource[VMD_MEMBAR1].child = NULL;
-	vmd->dev->resource[VMD_MEMBAR2].child = NULL;
+	INIT_LIST_HEAD(&vmd->dev->resource[VMD_MEMBAR1].child);
+	INIT_LIST_HEAD(&vmd->dev->resource[VMD_MEMBAR2].child);
 }
 
 /*
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac876e32de4b..9624dd1dfd49 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -59,6 +59,8 @@ static struct resource *get_pci_domain_busn_res(int domain_nr)
 	r->res.start = 0;
 	r->res.end = 0xff;
 	r->res.flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED;
+	INIT_LIST_HEAD(&r->res.child);
+	INIT_LIST_HEAD(&r->res.sibling);
 
 	list_add_tail(&r->list, &pci_domain_busn_res_list);
 
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 79b1824e83b4..8e685af8938d 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -2107,7 +2107,7 @@ int pci_reassign_bridge_resources(struct pci_dev *bridge, unsigned long type)
 				continue;
 
 			/* Ignore BARs which are still in use */
-			if (res->child)
+			if (!list_empty(&res->child))
 				continue;
 
 			ret = add_to_list(&saved, bridge, res, 0, 0);
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index dfdcd0bfe54e..b7456ae889dd 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -12,6 +12,7 @@
 #ifndef __ASSEMBLY__
 #include <linux/compiler.h>
 #include <linux/types.h>
+#include <linux/list.h>
 /*
  * Resources are tree-like, allowing
  * nesting etc..
@@ -22,7 +23,8 @@ struct resource {
 	const char *name;
 	unsigned long flags;
 	unsigned long desc;
-	struct resource *parent, *sibling, *child;
+	struct list_head child, sibling;
+	struct resource *parent;
 };
 
 /*
@@ -216,7 +218,6 @@ static inline bool resource_contains(struct resource *r1, struct resource *r2)
 	return r1->start <= r2->start && r1->end >= r2->end;
 }
 
-
 /* Convenience shorthand with allocation */
 #define request_region(start,n,name)		__request_region(&ioport_resource, (start), (n), (name), 0)
 #define request_muxed_region(start,n,name)	__request_region(&ioport_resource, (start), (n), (name), IORESOURCE_MUXED)
@@ -287,6 +288,18 @@ static inline bool resource_overlaps(struct resource *r1, struct resource *r2)
        return (r1->start <= r2->end && r1->end >= r2->start);
 }
 
+static inline struct resource *resource_sibling(struct resource *res)
+{
+	if (res->parent && !list_is_last(&res->sibling, &res->parent->child))
+		return list_next_entry(res, sibling);
+	return NULL;
+}
+
+static inline struct resource *resource_first_child(struct list_head *head)
+{
+	return list_first_entry_or_null(head, struct resource, sibling);
+}
+
 
 #endif /* __ASSEMBLY__ */
 #endif	/* _LINUX_IOPORT_H */
diff --git a/kernel/resource.c b/kernel/resource.c
index 81ccd19c1d9f..c96e58d3d2f8 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -31,6 +31,8 @@ struct resource ioport_resource = {
 	.start	= 0,
 	.end	= IO_SPACE_LIMIT,
 	.flags	= IORESOURCE_IO,
+	.sibling = LIST_HEAD_INIT(ioport_resource.sibling),
+	.child  = LIST_HEAD_INIT(ioport_resource.child),
 };
 EXPORT_SYMBOL(ioport_resource);
 
@@ -39,6 +41,8 @@ struct resource iomem_resource = {
 	.start	= 0,
 	.end	= -1,
 	.flags	= IORESOURCE_MEM,
+	.sibling = LIST_HEAD_INIT(iomem_resource.sibling),
+	.child  = LIST_HEAD_INIT(iomem_resource.child),
 };
 EXPORT_SYMBOL(iomem_resource);
 
@@ -57,20 +61,20 @@ static DEFINE_RWLOCK(resource_lock);
  * by boot mem after the system is up. So for reusing the resource entry
  * we need to remember the resource.
  */
-static struct resource *bootmem_resource_free;
+static struct list_head bootmem_resource_free = LIST_HEAD_INIT(bootmem_resource_free);
 static DEFINE_SPINLOCK(bootmem_resource_lock);
 
 static struct resource *next_resource(struct resource *p, bool sibling_only)
 {
 	/* Caller wants to traverse through siblings only */
 	if (sibling_only)
-		return p->sibling;
+		return resource_sibling(p);
 
-	if (p->child)
-		return p->child;
-	while (!p->sibling && p->parent)
+	if (!list_empty(&p->child))
+		return resource_first_child(&p->child);
+	while (!resource_sibling(p) && p->parent)
 		p = p->parent;
-	return p->sibling;
+	return resource_sibling(p);
 }
 
 static void *r_next(struct seq_file *m, void *v, loff_t *pos)
@@ -90,7 +94,7 @@ static void *r_start(struct seq_file *m, loff_t *pos)
 	struct resource *p = PDE_DATA(file_inode(m->file));
 	loff_t l = 0;
 	read_lock(&resource_lock);
-	for (p = p->child; p && l < *pos; p = r_next(m, p, &l))
+	for (p = resource_first_child(&p->child); p && l < *pos; p = r_next(m, p, &l))
 		;
 	return p;
 }
@@ -153,8 +157,7 @@ static void free_resource(struct resource *res)
 
 	if (!PageSlab(virt_to_head_page(res))) {
 		spin_lock(&bootmem_resource_lock);
-		res->sibling = bootmem_resource_free;
-		bootmem_resource_free = res;
+		list_add(&res->sibling, &bootmem_resource_free);
 		spin_unlock(&bootmem_resource_lock);
 	} else {
 		kfree(res);
@@ -166,10 +169,9 @@ static struct resource *alloc_resource(gfp_t flags)
 	struct resource *res = NULL;
 
 	spin_lock(&bootmem_resource_lock);
-	if (bootmem_resource_free) {
-		res = bootmem_resource_free;
-		bootmem_resource_free = res->sibling;
-	}
+	res = resource_first_child(&bootmem_resource_free);
+	if (res)
+		list_del(&res->sibling);
 	spin_unlock(&bootmem_resource_lock);
 
 	if (res)
@@ -177,6 +179,8 @@ static struct resource *alloc_resource(gfp_t flags)
 	else
 		res = kzalloc(sizeof(struct resource), flags);
 
+	INIT_LIST_HEAD(&res->child);
+	INIT_LIST_HEAD(&res->sibling);
 	return res;
 }
 
@@ -185,7 +189,7 @@ static struct resource * __request_resource(struct resource *root, struct resour
 {
 	resource_size_t start = new->start;
 	resource_size_t end = new->end;
-	struct resource *tmp, **p;
+	struct resource *tmp;
 
 	if (end < start)
 		return root;
@@ -193,64 +197,62 @@ static struct resource * __request_resource(struct resource *root, struct resour
 		return root;
 	if (end > root->end)
 		return root;
-	p = &root->child;
-	for (;;) {
-		tmp = *p;
-		if (!tmp || tmp->start > end) {
-			new->sibling = tmp;
-			*p = new;
+
+	if (list_empty(&root->child)) {
+		list_add(&new->sibling, &root->child);
+		new->parent = root;
+		INIT_LIST_HEAD(&new->child);
+		return NULL;
+	}
+
+	list_for_each_entry(tmp, &root->child, sibling) {
+		if (tmp->start > end) {
+			list_add(&new->sibling, tmp->sibling.prev);
 			new->parent = root;
+			INIT_LIST_HEAD(&new->child);
 			return NULL;
 		}
-		p = &tmp->sibling;
 		if (tmp->end < start)
 			continue;
 		return tmp;
 	}
+
+	list_add_tail(&new->sibling, &root->child);
+	new->parent = root;
+	INIT_LIST_HEAD(&new->child);
+	return NULL;
 }
 
 static int __release_resource(struct resource *old, bool release_child)
 {
-	struct resource *tmp, **p, *chd;
+	struct resource *tmp, *next, *chd;
 
-	p = &old->parent->child;
-	for (;;) {
-		tmp = *p;
-		if (!tmp)
-			break;
+	list_for_each_entry_safe(tmp, next, &old->parent->child, sibling) {
 		if (tmp == old) {
-			if (release_child || !(tmp->child)) {
-				*p = tmp->sibling;
+			if (release_child || list_empty(&tmp->child)) {
+				list_del(&tmp->sibling);
 			} else {
-				for (chd = tmp->child;; chd = chd->sibling) {
+				list_for_each_entry(chd, &tmp->child, sibling)
 					chd->parent = tmp->parent;
-					if (!(chd->sibling))
-						break;
-				}
-				*p = tmp->child;
-				chd->sibling = tmp->sibling;
+				list_splice(&tmp->child, tmp->sibling.prev);
+				list_del(&tmp->sibling);
 			}
+
 			old->parent = NULL;
 			return 0;
 		}
-		p = &tmp->sibling;
 	}
 	return -EINVAL;
 }
 
 static void __release_child_resources(struct resource *r)
 {
-	struct resource *tmp, *p;
+	struct resource *tmp, *next;
 	resource_size_t size;
 
-	p = r->child;
-	r->child = NULL;
-	while (p) {
-		tmp = p;
-		p = p->sibling;
-
+	list_for_each_entry_safe(tmp, next, &r->child, sibling) {
 		tmp->parent = NULL;
-		tmp->sibling = NULL;
+		list_del_init(&tmp->sibling);
 		__release_child_resources(tmp);
 
 		printk(KERN_DEBUG "release child resource %pR\n", tmp);
@@ -259,6 +261,8 @@ static void __release_child_resources(struct resource *r)
 		tmp->start = 0;
 		tmp->end = size - 1;
 	}
+
+	INIT_LIST_HEAD(&tmp->child);
 }
 
 void release_child_resources(struct resource *r)
@@ -343,7 +347,8 @@ static int find_next_iomem_res(struct resource *res, unsigned long desc,
 
 	read_lock(&resource_lock);
 
-	for (p = iomem_resource.child; p; p = next_resource(p, sibling_only)) {
+	for (p = resource_first_child(&iomem_resource.child); p;
+			p = next_resource(p, sibling_only)) {
 		if ((p->flags & res->flags) != res->flags)
 			continue;
 		if ((desc != IORES_DESC_NONE) && (desc != p->desc))
@@ -532,7 +537,7 @@ int region_intersects(resource_size_t start, size_t size, unsigned long flags,
 	struct resource *p;
 
 	read_lock(&resource_lock);
-	for (p = iomem_resource.child; p ; p = p->sibling) {
+	list_for_each_entry(p, &iomem_resource.child, sibling) {
 		bool is_type = (((p->flags & flags) == flags) &&
 				((desc == IORES_DESC_NONE) ||
 				 (desc == p->desc)));
@@ -586,7 +591,7 @@ static int __find_resource(struct resource *root, struct resource *old,
 			 resource_size_t  size,
 			 struct resource_constraint *constraint)
 {
-	struct resource *this = root->child;
+	struct resource *this = resource_first_child(&root->child);
 	struct resource tmp = *new, avail, alloc;
 
 	tmp.start = root->start;
@@ -596,7 +601,7 @@ static int __find_resource(struct resource *root, struct resource *old,
 	 */
 	if (this && this->start == root->start) {
 		tmp.start = (this == old) ? old->start : this->end + 1;
-		this = this->sibling;
+		this = resource_sibling(this);
 	}
 	for(;;) {
 		if (this)
@@ -632,7 +637,7 @@ next:		if (!this || this->end == root->end)
 
 		if (this != old)
 			tmp.start = this->end + 1;
-		this = this->sibling;
+		this = resource_sibling(this);
 	}
 	return -EBUSY;
 }
@@ -676,7 +681,7 @@ static int reallocate_resource(struct resource *root, struct resource *old,
 		goto out;
 	}
 
-	if (old->child) {
+	if (!list_empty(&old->child)) {
 		err = -EBUSY;
 		goto out;
 	}
@@ -757,7 +762,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start)
 	struct resource *res;
 
 	read_lock(&resource_lock);
-	for (res = root->child; res; res = res->sibling) {
+	list_for_each_entry(res, &root->child, sibling) {
 		if (res->start == start)
 			break;
 	}
@@ -790,32 +795,27 @@ static struct resource * __insert_resource(struct resource *parent, struct resou
 			break;
 	}
 
-	for (next = first; ; next = next->sibling) {
+	for (next = first; ; next = resource_sibling(next)) {
 		/* Partial overlap? Bad, and unfixable */
 		if (next->start < new->start || next->end > new->end)
 			return next;
-		if (!next->sibling)
+		if (!resource_sibling(next))
 			break;
-		if (next->sibling->start > new->end)
+		if (resource_sibling(next)->start > new->end)
 			break;
 	}
-
 	new->parent = parent;
-	new->sibling = next->sibling;
-	new->child = first;
+	list_add(&new->sibling, &next->sibling);
+	INIT_LIST_HEAD(&new->child);
 
-	next->sibling = NULL;
-	for (next = first; next; next = next->sibling)
+	/*
+	 * From first to next, they all fall into new's region, so change them
+	 * as new's children.
+	 */
+	list_cut_position(&new->child, first->sibling.prev, &next->sibling);
+	list_for_each_entry(next, &new->child, sibling)
 		next->parent = new;
 
-	if (parent->child == first) {
-		parent->child = new;
-	} else {
-		next = parent->child;
-		while (next->sibling != first)
-			next = next->sibling;
-		next->sibling = new;
-	}
 	return NULL;
 }
 
@@ -937,19 +937,17 @@ static int __adjust_resource(struct resource *res, resource_size_t start,
 	if ((start < parent->start) || (end > parent->end))
 		goto out;
 
-	if (res->sibling && (res->sibling->start <= end))
+	if (resource_sibling(res) && (resource_sibling(res)->start <= end))
 		goto out;
 
-	tmp = parent->child;
-	if (tmp != res) {
-		while (tmp->sibling != res)
-			tmp = tmp->sibling;
+	if (res->sibling.prev != &parent->child) {
+		tmp = list_prev_entry(res, sibling);
 		if (start <= tmp->end)
 			goto out;
 	}
 
 skip:
-	for (tmp = res->child; tmp; tmp = tmp->sibling)
+	list_for_each_entry(tmp, &res->child, sibling)
 		if ((tmp->start < start) || (tmp->end > end))
 			goto out;
 
@@ -996,27 +994,30 @@ EXPORT_SYMBOL(adjust_resource);
  */
 int reparent_resources(struct resource *parent, struct resource *res)
 {
-	struct resource *p, **pp;
-	struct resource **firstpp = NULL;
+	struct resource *p, *first = NULL;
 
-	for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
+	list_for_each_entry(p, &parent->child, sibling) {
 		if (p->end < res->start)
 			continue;
 		if (res->end < p->start)
 			break;
 		if (p->start < res->start || p->end > res->end)
 			return -ENOTSUPP;	/* not completely contained */
-		if (firstpp == NULL)
-			firstpp = pp;
+		if (first == NULL)
+			first = p;
 	}
-	if (firstpp == NULL)
+	if (first == NULL)
 		return -ECANCELED; /* didn't find any conflicting entries? */
 	res->parent = parent;
-	res->child = *firstpp;
-	res->sibling = *pp;
-	*firstpp = res;
-	*pp = NULL;
-	for (p = res->child; p != NULL; p = p->sibling) {
+	list_add(&res->sibling, p->sibling.prev);
+	INIT_LIST_HEAD(&res->child);
+
+	/*
+	 * From first to p's previous sibling, they all fall into
+	 * res's region, change them as res's children.
+	 */
+	list_cut_position(&res->child, first->sibling.prev, res->sibling.prev);
+	list_for_each_entry(p, &res->child, sibling) {
 		p->parent = res;
 		pr_debug("PCI: Reparented %s %pR under %s\n",
 			 p->name, p, res->name);
@@ -1216,34 +1217,32 @@ EXPORT_SYMBOL(__request_region);
 void __release_region(struct resource *parent, resource_size_t start,
 			resource_size_t n)
 {
-	struct resource **p;
+	struct resource *res;
 	resource_size_t end;
 
-	p = &parent->child;
+	res = resource_first_child(&parent->child);
 	end = start + n - 1;
 
 	write_lock(&resource_lock);
 
 	for (;;) {
-		struct resource *res = *p;
-
 		if (!res)
 			break;
 		if (res->start <= start && res->end >= end) {
 			if (!(res->flags & IORESOURCE_BUSY)) {
-				p = &res->child;
+				res = resource_first_child(&res->child);
 				continue;
 			}
 			if (res->start != start || res->end != end)
 				break;
-			*p = res->sibling;
+			list_del(&res->sibling);
 			write_unlock(&resource_lock);
 			if (res->flags & IORESOURCE_MUXED)
 				wake_up(&muxed_resource_wait);
 			free_resource(res);
 			return;
 		}
-		p = &res->sibling;
+		res = resource_sibling(res);
 	}
 
 	write_unlock(&resource_lock);
@@ -1278,9 +1277,7 @@ EXPORT_SYMBOL(__release_region);
 int release_mem_region_adjustable(struct resource *parent,
 			resource_size_t start, resource_size_t size)
 {
-	struct resource **p;
-	struct resource *res;
-	struct resource *new_res;
+	struct resource *res, *new_res;
 	resource_size_t end;
 	int ret = -EINVAL;
 
@@ -1291,16 +1288,16 @@ int release_mem_region_adjustable(struct resource *parent,
 	/* The alloc_resource() result gets checked later */
 	new_res = alloc_resource(GFP_KERNEL);
 
-	p = &parent->child;
+	res = resource_first_child(&parent->child);
 	write_lock(&resource_lock);
 
-	while ((res = *p)) {
+	while ((res)) {
 		if (res->start >= end)
 			break;
 
 		/* look for the next resource if it does not fit into */
 		if (res->start > start || res->end < end) {
-			p = &res->sibling;
+			res = resource_sibling(res);
 			continue;
 		}
 
@@ -1308,14 +1305,14 @@ int release_mem_region_adjustable(struct resource *parent,
 			break;
 
 		if (!(res->flags & IORESOURCE_BUSY)) {
-			p = &res->child;
+			res = resource_first_child(&res->child);
 			continue;
 		}
 
 		/* found the target resource; let's adjust accordingly */
 		if (res->start == start && res->end == end) {
 			/* free the whole entry */
-			*p = res->sibling;
+			list_del(&res->sibling);
 			free_resource(res);
 			ret = 0;
 		} else if (res->start == start && res->end != end) {
@@ -1338,14 +1335,13 @@ int release_mem_region_adjustable(struct resource *parent,
 			new_res->flags = res->flags;
 			new_res->desc = res->desc;
 			new_res->parent = res->parent;
-			new_res->sibling = res->sibling;
-			new_res->child = NULL;
+			INIT_LIST_HEAD(&new_res->child);
 
 			ret = __adjust_resource(res, res->start,
 						start - res->start);
 			if (ret)
 				break;
-			res->sibling = new_res;
+			list_add(&new_res->sibling, &res->sibling);
 			new_res = NULL;
 		}
 
@@ -1526,7 +1522,7 @@ static int __init reserve_setup(char *str)
 			res->end = io_start + io_num - 1;
 			res->flags |= IORESOURCE_BUSY;
 			res->desc = IORES_DESC_NONE;
-			res->child = NULL;
+			INIT_LIST_HEAD(&res->child);
 			if (request_resource(parent, res) == 0)
 				reserved = x+1;
 		}
@@ -1546,7 +1542,7 @@ int iomem_map_sanity_check(resource_size_t addr, unsigned long size)
 	loff_t l;
 
 	read_lock(&resource_lock);
-	for (p = p->child; p ; p = r_next(NULL, p, &l)) {
+	for (p = resource_first_child(&p->child); p; p = r_next(NULL, p, &l)) {
 		/*
 		 * We can probably skip the resources without
 		 * IORESOURCE_IO attribute?
@@ -1602,7 +1598,7 @@ bool iomem_is_exclusive(u64 addr)
 	addr = addr & PAGE_MASK;
 
 	read_lock(&resource_lock);
-	for (p = p->child; p ; p = r_next(NULL, p, &l)) {
+	for (p = resource_first_child(&p->child); p; p = r_next(NULL, p, &l)) {
 		/*
 		 * We can probably skip the resources without
 		 * IORESOURCE_IO attribute?
-- 
2.13.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 3/4] resource: add walk_system_ram_res_rev()
  2018-07-18  2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He
  2018-07-18  2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He
  2018-07-18  2:49 ` [PATCH v7 2/4] resource: Use list_head to link sibling resource Baoquan He
@ 2018-07-18  2:49 ` Baoquan He
  2018-07-18  2:49 ` [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required Baoquan He
  3 siblings, 0 replies; 21+ messages in thread
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: brijesh.singh, devicetree, airlied, linux-pci, richard.weiyang,
	jcmvbkbc, baiyaowei, kys, frowand.list, lorenzo.pieralisi,
	sthemmin, Baoquan He, linux-nvdimm, patrik.r.jakobsson,
	linux-input, gustavo, dyoung, thomas.lendacky, haiyangz,
	maarten.lankhorst, jglisse, seanpaul, bhelgaas, tglx, yinghai,
	jonathan.derrick, chris, monstr, linux-parisc, gregkh,
	dmitry.torokhov, ebiederm, devel, linuxppc-dev, davem

This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70acead ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM
in reversed order, i.e., from higher to lower.

It will be used in kexec_file code.

Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
---
 include/linux/ioport.h |  3 +++
 kernel/resource.c      | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index b7456ae889dd..066cc263e2cc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -279,6 +279,9 @@ extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
 		    int (*func)(struct resource *, void *));
 extern int
+walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+			int (*func)(struct resource *, void *));
+extern int
 walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
 		    void *arg, int (*func)(struct resource *, void *));
 
diff --git a/kernel/resource.c b/kernel/resource.c
index c96e58d3d2f8..3e18f24b90c4 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -23,6 +23,8 @@
 #include <linux/pfn.h>
 #include <linux/mm.h>
 #include <linux/resource_ext.h>
+#include <linux/string.h>
+#include <linux/vmalloc.h>
 #include <asm/io.h>
 
 
@@ -443,6 +445,44 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
 }
 
 /*
+ * This function, being a variant of walk_system_ram_res(), calls the @func
+ * callback against all memory ranges of type System RAM which are marked as
+ * IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY in reversed order, i.e., from
+ * higher to lower.
+ */
+int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+				int (*func)(struct resource *, void *))
+{
+	unsigned long flags;
+	struct resource *res;
+	int ret = -1;
+
+	flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+
+	read_lock(&resource_lock);
+	list_for_each_entry_reverse(res, &iomem_resource.child, sibling) {
+		if (start >= end)
+			break;
+		if ((res->flags & flags) != flags)
+			continue;
+		if (res->desc != IORES_DESC_NONE)
+			continue;
+		if (res->end < start)
+			break;
+
+		if ((res->end >= start) && (res->start < end)) {
+			ret = (*func)(res, arg);
+			if (ret)
+				break;
+		}
+		end = res->start - 1;
+
+	}
+	read_unlock(&resource_lock);
+	return ret;
+}
+
+/*
  * This function calls the @func callback against all memory ranges, which
  * are ranges marked as IORESOURCE_MEM and IORESOUCE_BUSY.
  */
-- 
2.13.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-18  2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He
                   ` (2 preceding siblings ...)
  2018-07-18  2:49 ` [PATCH v7 3/4] resource: add walk_system_ram_res_rev() Baoquan He
@ 2018-07-18  2:49 ` Baoquan He
  2018-07-18 22:33   ` Andrew Morton
  3 siblings, 1 reply; 21+ messages in thread
From: Baoquan He @ 2018-07-18  2:49 UTC (permalink / raw)
  To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko
  Cc: brijesh.singh, devicetree, airlied, linux-pci, richard.weiyang,
	jcmvbkbc, baiyaowei, kys, frowand.list, lorenzo.pieralisi,
	sthemmin, Baoquan He, linux-nvdimm, patrik.r.jakobsson,
	linux-input, gustavo, dyoung, thomas.lendacky, haiyangz,
	maarten.lankhorst, jglisse, seanpaul, bhelgaas, tglx, yinghai,
	jonathan.derrick, chris, monstr, linux-parisc, gregkh,
	dmitry.torokhov, kexec, ebiederm, devel, linuxppc-dev, davem

For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
is used to load kernel/initrd/purgatory is supposed to be allocated from
top to down. This is what we have been doing all along in the old kexec
loading interface and the kexec loading is still default setting in some
distributions. However, the current kexec_file loading interface doesn't
do like this. The function arch_kexec_walk_mem() it calls ignores checking
kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
all resources of System RAM from bottom to up, to try to find memory region
which can contain the specific kexec buffer, then call locate_mem_hole_callback()
to allocate memory in that found memory region from top to down. This brings
confusion especially when KASLR is widely supported , users have to make clear
why kexec/kdump kernel loading position is different between these two
interfaces in order to exclude unnecessary noises. Hence these two interfaces
need be unified on behaviour.

Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(),
if yes, call the newly added walk_system_ram_res_rev() to find memory region
from top to down to load kernel.

Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: kexec@lists.infradead.org
---
 kernel/kexec_file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index c6a3b6851372..75226c1d08ce 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -518,6 +518,8 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
 					   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
 					   crashk_res.start, crashk_res.end,
 					   kbuf, func);
+	else if (kbuf->top_down)
+		return walk_system_ram_res_rev(0, ULONG_MAX, kbuf, func);
 	else
 		return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
 }
-- 
2.13.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public
  2018-07-18  2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He
@ 2018-07-18 16:36   ` Andy Shevchenko
  2018-07-18 16:37     ` Andy Shevchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Shevchenko @ 2018-07-18 16:36 UTC (permalink / raw)
  To: Baoquan He
  Cc: Nicolas Pitre, brijesh.singh, devicetree, David Airlie,
	linux-pci, richard.weiyang, Max Filippov, Paul Mackerras,
	baiyaowei, KY Srinivasan, Frank Rowand, Lorenzo Pieralisi,
	Stephen Hemminger, linux-nvdimm, Michael Ellerman,
	Patrik Jakobsson, linux-input, Gustavo Padovan, Borislav Petkov,
	Dave Young, Tom Lendacky, Haiyang Zhang, Maarten Lankhorst,
	Josh Triplett, Jérôme Glisse, Rob Herring, Sean Paul,
	Bjorn Helgaas, Thomas Gleixner, Yinghai Lu, Jon Derrick,
	Chris Zankel, Michal Simek, linux-parisc, Greg Kroah-Hartman,
	Dmitry Torokhov, Linux Kernel Mailing List,
	Benjamin Herrenschmidt, Eric Biederman, devel, Andrew Morton,
	kbuild test robot,
	open list:LINUX FOR POWERPC PA SEMI PWRFICIENT, David S. Miller

On Wed, Jul 18, 2018 at 5:49 AM, Baoquan He <bhe@redhat.com> wrote:
> reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> so that it's shared.

Some minor stuff.

> +/**
> + * reparent_resources - reparent resource children of parent that res covers
> + * @parent: parent resource descriptor
> + * @res: resource descriptor desired by caller
> + *
> + * Returns 0 on success, -ENOTSUPP if child resource is not completely
> + * contained by 'res', -ECANCELED if no any conflicting entry found.

'res' -> @res

> + *
> + * Reparent resource children of 'parent' that conflict with 'res'

Ditto + 'parent' -> @parent

> + * under 'res', and make 'res' replace those children.

Ditto.

> + */
> +int reparent_resources(struct resource *parent, struct resource *res)
> +{
> +       struct resource *p, **pp;
> +       struct resource **firstpp = NULL;
> +
> +       for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
> +               if (p->end < res->start)
> +                       continue;
> +               if (res->end < p->start)
> +                       break;
> +               if (p->start < res->start || p->end > res->end)
> +                       return -ENOTSUPP;       /* not completely contained */
> +               if (firstpp == NULL)
> +                       firstpp = pp;
> +       }
> +       if (firstpp == NULL)
> +               return -ECANCELED; /* didn't find any conflicting entries? */
> +       res->parent = parent;
> +       res->child = *firstpp;
> +       res->sibling = *pp;
> +       *firstpp = res;
> +       *pp = NULL;
> +       for (p = res->child; p != NULL; p = p->sibling) {
> +               p->parent = res;

> +               pr_debug("PCI: Reparented %s %pR under %s\n",
> +                        p->name, p, res->name);

Now, PCI is a bit confusing here.

> +       }
> +       return 0;
> +}
> +EXPORT_SYMBOL(reparent_resources);
> +
>  static void __init __reserve_region_with_split(struct resource *root,
>                 resource_size_t start, resource_size_t end,
>                 const char *name)
> --
> 2.13.6
>



-- 
With Best Regards,
Andy Shevchenko
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public
  2018-07-18 16:36   ` Andy Shevchenko
@ 2018-07-18 16:37     ` Andy Shevchenko
  2018-07-19 15:18       ` Baoquan He
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Shevchenko @ 2018-07-18 16:37 UTC (permalink / raw)
  To: Baoquan He
  Cc: Nicolas Pitre, brijesh.singh, devicetree, David Airlie,
	linux-pci, richard.weiyang, Max Filippov, Paul Mackerras,
	baiyaowei, KY Srinivasan, Frank Rowand, Lorenzo Pieralisi,
	Stephen Hemminger, linux-nvdimm, Michael Ellerman,
	Patrik Jakobsson, linux-input, Gustavo Padovan, Borislav Petkov,
	Dave Young, Tom Lendacky, Haiyang Zhang, Maarten Lankhorst,
	Josh Triplett, Jérôme Glisse, Rob Herring, Sean Paul,
	Bjorn Helgaas, Thomas Gleixner, Yinghai Lu, Jon Derrick,
	Chris Zankel, Michal Simek, linux-parisc, Greg Kroah-Hartman,
	Dmitry Torokhov, Linux Kernel Mailing List,
	Benjamin Herrenschmidt, Eric Biederman, devel, Andrew Morton,
	kbuild test robot,
	open list:LINUX FOR POWERPC PA SEMI PWRFICIENT, David S. Miller

On Wed, Jul 18, 2018 at 7:36 PM, Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Wed, Jul 18, 2018 at 5:49 AM, Baoquan He <bhe@redhat.com> wrote:
>> reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
>> and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
>> so that it's shared.

>> + * Returns 0 on success, -ENOTSUPP if child resource is not completely
>> + * contained by 'res', -ECANCELED if no any conflicting entry found.

You also can refer to constants by prefixing them with %, e.g. %-ENOTSUPP.
But this is up to you completely.

-- 
With Best Regards,
Andy Shevchenko
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-18  2:49 ` [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required Baoquan He
@ 2018-07-18 22:33   ` Andrew Morton
  2018-07-19 15:17     ` Baoquan He
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2018-07-18 22:33 UTC (permalink / raw)
  To: Baoquan He
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list, tglx,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, yinghai, jonathan.derrick, chris,
	monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, fengguang.wu, linuxppc-dev, davem

On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote:

> For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> is used to load kernel/initrd/purgatory is supposed to be allocated from
> top to down. This is what we have been doing all along in the old kexec
> loading interface and the kexec loading is still default setting in some
> distributions. However, the current kexec_file loading interface doesn't
> do like this. The function arch_kexec_walk_mem() it calls ignores checking
> kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> all resources of System RAM from bottom to up, to try to find memory region
> which can contain the specific kexec buffer, then call locate_mem_hole_callback()
> to allocate memory in that found memory region from top to down. This brings
> confusion especially when KASLR is widely supported , users have to make clear
> why kexec/kdump kernel loading position is different between these two
> interfaces in order to exclude unnecessary noises. Hence these two interfaces
> need be unified on behaviour.

As far as I can tell, the above is the whole reason for the patchset,
yes?  To avoid confusing users.

Is that sufficient?  Can we instead simplify their lives by providing
better documentation or informative printks or better Kconfig text,
etc?

And who *are* the people who are performing this configuration?  Random
system administrators?  Linux distro engineers?  If the latter then
they presumably aren't easily confused!

In other words, I'm trying to understand how much benefit this patchset
will provide to our users as a whole.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-18 22:33   ` Andrew Morton
@ 2018-07-19 15:17     ` Baoquan He
  2018-07-19 19:44       ` Andrew Morton
  2018-07-23 14:34       ` Michal Hocko
  0 siblings, 2 replies; 21+ messages in thread
From: Baoquan He @ 2018-07-19 15:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list, tglx,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, yinghai, jonathan.derrick, chris,
	monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, fengguang.wu, linuxppc-dev, davem

Hi Andrew,

On 07/18/18 at 03:33pm, Andrew Morton wrote:
> On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote:
> 
> > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > top to down. This is what we have been doing all along in the old kexec
> > loading interface and the kexec loading is still default setting in some
> > distributions. However, the current kexec_file loading interface doesn't
> > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > all resources of System RAM from bottom to up, to try to find memory region
> > which can contain the specific kexec buffer, then call locate_mem_hole_callback()
> > to allocate memory in that found memory region from top to down. This brings
> > confusion especially when KASLR is widely supported , users have to make clear
> > why kexec/kdump kernel loading position is different between these two
> > interfaces in order to exclude unnecessary noises. Hence these two interfaces
> > need be unified on behaviour.
> 
> As far as I can tell, the above is the whole reason for the patchset,
> yes?  To avoid confusing users.


In fact, it's not just trying to avoid confusing users. Kexec loading
and kexec_file loading are just do the same thing in essence. Just we
need do kernel image verification on uefi system, have to port kexec
loading code to kernel. 

Kexec has been a formal feature in our distro, and customers owning
those kind of very large machine can make use of this feature to speed
up the reboot process. On uefi machine, the kexec_file loading will
search place to put kernel under 4G from top to down. As we know, the
1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
it. It may have possibility to not be able to find a usable space for
kernel/initrd. From the top down of the whole memory space, we don't
have this worry. 

And at the first post, I just posted below with AKASHI's
walk_system_ram_res_rev() version. Later you suggested to use
list_head to link child sibling of resource, see what the code change
looks like.
http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com

Then I posted v2
http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com
Rob Herring mentioned that other components which has this tree struct
have planned to do the same thing, replacing the singly linked list with
list_head to link resource child sibling. Just quote Rob's words as
below. I think this could be another reason.

~~~~~ From Rob
The DT struct device_node also has the same tree structure with
parent, child, sibling pointers and converting to list_head had been
on the todo list for a while. ACPI also has some tree walking
functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
common tree struct and helpers defined either on top of list_head or a
~~~~~
new struct if that saves some size.

> 
> Is that sufficient?  Can we instead simplify their lives by providing
> better documentation or informative printks or better Kconfig text,
> etc?
> 
> And who *are* the people who are performing this configuration?  Random
> system administrators?  Linux distro engineers?  If the latter then
> they presumably aren't easily confused!

Kexec was invented for kernel developer to speed up their kernel
rebooting. Now high end sever admin, kernel developer and QE are also
keen to use it to reboot large box for faster feature testing, bug
debugging. Kernel dev could know this well, about kernel loading
position, admin or QE might not be aware of it very well. 

> 
> In other words, I'm trying to understand how much benefit this patchset
> will provide to our users as a whole.

Understood. The list_head replacing patch truly involes too many code
changes, it's risky. I am willing to try any idea from reviewers, won't
persuit they have to be accepted finally. If don't have a try, we don't
know what it looks like, and what impact it may have. I am fine to take
AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
though it could be a little bit low efficient.

Thanks
Baoquan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public
  2018-07-18 16:37     ` Andy Shevchenko
@ 2018-07-19 15:18       ` Baoquan He
  0 siblings, 0 replies; 21+ messages in thread
From: Baoquan He @ 2018-07-19 15:18 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Nicolas Pitre, brijesh.singh, devicetree, David Airlie,
	linux-pci, richard.weiyang, Max Filippov, Paul Mackerras,
	baiyaowei, KY Srinivasan, Frank Rowand, Lorenzo Pieralisi,
	Stephen Hemminger, linux-nvdimm, Michael Ellerman,
	Patrik Jakobsson, linux-input, Gustavo Padovan, Borislav Petkov,
	Dave Young, Tom Lendacky, Haiyang Zhang, Maarten Lankhorst,
	Josh Triplett, Jérôme Glisse, Rob Herring, Sean Paul,
	Bjorn Helgaas, Thomas Gleixner, Yinghai Lu, Jon Derrick,
	Chris Zankel, Michal Simek, linux-parisc, Greg Kroah-Hartman,
	Dmitry Torokhov, Linux Kernel Mailing List,
	Benjamin Herrenschmidt, Eric Biederman, devel, Andrew Morton,
	kbuild test robot,
	open list:LINUX FOR POWERPC PA SEMI PWRFICIENT, David S. Miller

On 07/18/18 at 07:37pm, Andy Shevchenko wrote:
> On Wed, Jul 18, 2018 at 7:36 PM, Andy Shevchenko
> <andy.shevchenko@gmail.com> wrote:
> > On Wed, Jul 18, 2018 at 5:49 AM, Baoquan He <bhe@redhat.com> wrote:
> >> reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> >> and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> >> so that it's shared.
> 
> >> + * Returns 0 on success, -ENOTSUPP if child resource is not completely
> >> + * contained by 'res', -ECANCELED if no any conflicting entry found.
> 
> You also can refer to constants by prefixing them with %, e.g. %-ENOTSUPP.
> But this is up to you completely.

Thanks, will fix when repost. 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-19 15:17     ` Baoquan He
@ 2018-07-19 19:44       ` Andrew Morton
  2018-07-25  2:21         ` Baoquan He
  2018-07-23 14:34       ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2018-07-19 19:44 UTC (permalink / raw)
  To: Baoquan He
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list, tglx,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, yinghai, jonathan.derrick, chris,
	monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, fengguang.wu, linuxppc-dev, davem

On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He <bhe@redhat.com> wrote:

> Hi Andrew,
> 
> On 07/18/18 at 03:33pm, Andrew Morton wrote:
> > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote:
> > 
> > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > > top to down. This is what we have been doing all along in the old kexec
> > > loading interface and the kexec loading is still default setting in some
> > > distributions. However, the current kexec_file loading interface doesn't
> > > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > > all resources of System RAM from bottom to up, to try to find memory region
> > > which can contain the specific kexec buffer, then call locate_mem_hole_callback()
> > > to allocate memory in that found memory region from top to down. This brings
> > > confusion especially when KASLR is widely supported , users have to make clear
> > > why kexec/kdump kernel loading position is different between these two
> > > interfaces in order to exclude unnecessary noises. Hence these two interfaces
> > > need be unified on behaviour.
> > 
> > As far as I can tell, the above is the whole reason for the patchset,
> > yes?  To avoid confusing users.
> 
> 
> In fact, it's not just trying to avoid confusing users. Kexec loading
> and kexec_file loading are just do the same thing in essence. Just we
> need do kernel image verification on uefi system, have to port kexec
> loading code to kernel. 
> 
> Kexec has been a formal feature in our distro, and customers owning
> those kind of very large machine can make use of this feature to speed
> up the reboot process. On uefi machine, the kexec_file loading will
> search place to put kernel under 4G from top to down. As we know, the
> 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> it. It may have possibility to not be able to find a usable space for
> kernel/initrd. From the top down of the whole memory space, we don't
> have this worry. 
> 
> And at the first post, I just posted below with AKASHI's
> walk_system_ram_res_rev() version. Later you suggested to use
> list_head to link child sibling of resource, see what the code change
> looks like.
> http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com
> 
> Then I posted v2
> http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com
> Rob Herring mentioned that other components which has this tree struct
> have planned to do the same thing, replacing the singly linked list with
> list_head to link resource child sibling. Just quote Rob's words as
> below. I think this could be another reason.
> 
> ~~~~~ From Rob
> The DT struct device_node also has the same tree structure with
> parent, child, sibling pointers and converting to list_head had been
> on the todo list for a while. ACPI also has some tree walking
> functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
> common tree struct and helpers defined either on top of list_head or a
> ~~~~~
> new struct if that saves some size.

Please let's get all this into the changelogs?

> > 
> > Is that sufficient?  Can we instead simplify their lives by providing
> > better documentation or informative printks or better Kconfig text,
> > etc?
> > 
> > And who *are* the people who are performing this configuration?  Random
> > system administrators?  Linux distro engineers?  If the latter then
> > they presumably aren't easily confused!
> 
> Kexec was invented for kernel developer to speed up their kernel
> rebooting. Now high end sever admin, kernel developer and QE are also
> keen to use it to reboot large box for faster feature testing, bug
> debugging. Kernel dev could know this well, about kernel loading
> position, admin or QE might not be aware of it very well. 
> 
> > 
> > In other words, I'm trying to understand how much benefit this patchset
> > will provide to our users as a whole.
> 
> Understood. The list_head replacing patch truly involes too many code
> changes, it's risky. I am willing to try any idea from reviewers, won't
> persuit they have to be accepted finally. If don't have a try, we don't
> know what it looks like, and what impact it may have. I am fine to take
> AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
> though it could be a little bit low efficient.

The larger patch produces a better result.  We can handle it ;)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-19 15:17     ` Baoquan He
  2018-07-19 19:44       ` Andrew Morton
@ 2018-07-23 14:34       ` Michal Hocko
  2018-07-25  6:48         ` Baoquan He
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-07-23 14:34 UTC (permalink / raw)
  To: Baoquan He
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On Thu 19-07-18 23:17:53, Baoquan He wrote:
> Kexec has been a formal feature in our distro, and customers owning
> those kind of very large machine can make use of this feature to speed
> up the reboot process. On uefi machine, the kexec_file loading will
> search place to put kernel under 4G from top to down. As we know, the
> 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> it. It may have possibility to not be able to find a usable space for
> kernel/initrd. From the top down of the whole memory space, we don't
> have this worry. 

I do not have the full context here but let me note that you should be
careful when doing top-down reservation because you can easily get into
hotplugable memory and break the hotremove usecase. We even warn when
this is done. See memblock_find_in_range_node
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-19 19:44       ` Andrew Morton
@ 2018-07-25  2:21         ` Baoquan He
  0 siblings, 0 replies; 21+ messages in thread
From: Baoquan He @ 2018-07-25  2:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list, tglx,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, yinghai, jonathan.derrick, chris,
	monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, fengguang.wu, linuxppc-dev, davem

Hi Andrew,

On 07/19/18 at 12:44pm, Andrew Morton wrote:
> On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He <bhe@redhat.com> wrote:
> > > As far as I can tell, the above is the whole reason for the patchset,
> > > yes?  To avoid confusing users.
> > 
> > 
> > In fact, it's not just trying to avoid confusing users. Kexec loading
> > and kexec_file loading are just do the same thing in essence. Just we
> > need do kernel image verification on uefi system, have to port kexec
> > loading code to kernel. 
> > 
> > Kexec has been a formal feature in our distro, and customers owning
> > those kind of very large machine can make use of this feature to speed
> > up the reboot process. On uefi machine, the kexec_file loading will
> > search place to put kernel under 4G from top to down. As we know, the
> > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > it. It may have possibility to not be able to find a usable space for
> > kernel/initrd. From the top down of the whole memory space, we don't
> > have this worry. 
> > 
> > And at the first post, I just posted below with AKASHI's
> > walk_system_ram_res_rev() version. Later you suggested to use
> > list_head to link child sibling of resource, see what the code change
> > looks like.
> > http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com
> > 
> > Then I posted v2
> > http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com
> > Rob Herring mentioned that other components which has this tree struct
> > have planned to do the same thing, replacing the singly linked list with
> > list_head to link resource child sibling. Just quote Rob's words as
> > below. I think this could be another reason.
> > 
> > ~~~~~ From Rob
> > The DT struct device_node also has the same tree structure with
> > parent, child, sibling pointers and converting to list_head had been
> > on the todo list for a while. ACPI also has some tree walking
> > functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
> > common tree struct and helpers defined either on top of list_head or a
> > ~~~~~
> > new struct if that saves some size.
> 
> Please let's get all this into the changelogs?

Sorry for late reply because of some urgent customer hotplug issues.

I am rewriting all change logs, and cover letter. Then found I was wrong
about the 2nd reason. The current kexec_file_load calls
kexec_locate_mem_hole() to go through all system RAM region, if one
region is larger than the size of kernel or initrd, it will search a
position in that region from top to down. Since kexec will jump to 2nd
kernel and don't need to care the 1st kernel's data, we can always find
a usable space to load kexec kernel/initrd under 4G.

So the only reason for this patch is keeping consistent with kexec_load
and avoid confusion.

And since x86 5-level paging mode has been added, we have another issue
for top-down searching in the whole system RAM. That is we support
dynamic 4-level to 5-level changing. Namely a kernel compiled with
5-level support, we can add 'no5lvl' to force 4-level. Then jumping from
a 5-level kernel to 4-level kernel, e.g we load kernel at the top of
system RAM in 5-level paging mode which might be bigger than 64TB, then
try to jump to 4-level kernel with the upper limit of 64TB. For this
case, we need add limit for kexec kernel loading if in 5-level kernel.

All this mess makes me hesitate to choose a deligate method. Maybe I
should drop this patchset.

> 
> > > 
> > > Is that sufficient?  Can we instead simplify their lives by providing
> > > better documentation or informative printks or better Kconfig text,
> > > etc?
> > > 
> > > And who *are* the people who are performing this configuration?  Random
> > > system administrators?  Linux distro engineers?  If the latter then
> > > they presumably aren't easily confused!
> > 
> > Kexec was invented for kernel developer to speed up their kernel
> > rebooting. Now high end sever admin, kernel developer and QE are also
> > keen to use it to reboot large box for faster feature testing, bug
> > debugging. Kernel dev could know this well, about kernel loading
> > position, admin or QE might not be aware of it very well. 
> > 
> > > 
> > > In other words, I'm trying to understand how much benefit this patchset
> > > will provide to our users as a whole.
> > 
> > Understood. The list_head replacing patch truly involes too many code
> > changes, it's risky. I am willing to try any idea from reviewers, won't
> > persuit they have to be accepted finally. If don't have a try, we don't
> > know what it looks like, and what impact it may have. I am fine to take
> > AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
> > though it could be a little bit low efficient.
> 
> The larger patch produces a better result.  We can handle it ;)

For this issue, if we stop changing the kexec top down searching code,
I am not sure if we should post this replacing with list_head patches
separately.

Thanks
Baoquan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-23 14:34       ` Michal Hocko
@ 2018-07-25  6:48         ` Baoquan He
  2018-07-26 12:59           ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Baoquan He @ 2018-07-25  6:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On 07/23/18 at 04:34pm, Michal Hocko wrote:
> On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > Kexec has been a formal feature in our distro, and customers owning
> > those kind of very large machine can make use of this feature to speed
> > up the reboot process. On uefi machine, the kexec_file loading will
> > search place to put kernel under 4G from top to down. As we know, the
> > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > it. It may have possibility to not be able to find a usable space for
> > kernel/initrd. From the top down of the whole memory space, we don't
> > have this worry. 
> 
> I do not have the full context here but let me note that you should be
> careful when doing top-down reservation because you can easily get into
> hotplugable memory and break the hotremove usecase. We even warn when
> this is done. See memblock_find_in_range_node

Kexec read kernel/initrd file into buffer, just search usable positions
for them to do the later copying. You can see below struct kexec_segment, 
for the old kexec_load, kernel/initrd are read into user space buffer,
the @buf stores the user space buffer address, @mem stores the position
where kernel/initrd will be put. In kernel, it calls
kimage_load_normal_segment() to copy user space buffer to intermediate
pages which are allocated with flag GFP_KERNEL. These intermediate pages
are recorded as entries, later when user execute "kexec -e" to trigger
kexec jumping, it will do the final copying from the intermediate pages
to the real destination pages which @mem pointed. Because we can't touch
the existed data in 1st kernel when do kexec kernel loading. With my
understanding, GFP_KERNEL will make those intermediate pages be
allocated inside immovable area, it won't impact hotplugging. But the
@mem we searched in the whole system RAM might be lost along with
hotplug. Hence we need do kexec kernel again when hotplug event is
detected.

#define KEXEC_CONTROL_MEMORY_GFP (GFP_KERNEL | __GFP_NORETRY)


struct kexec_segment {
        /*
         * This pointer can point to user memory if kexec_load() system
         * call is used or will point to kernel memory if
         * kexec_file_load() system call is used.
         *
         * Use ->buf when expecting to deal with user memory and use ->kbuf
         * when expecting to deal with kernel memory.
         */
        union {
                void __user *buf;
                void *kbuf;
        };
        size_t bufsz;                                                                                                                             
        unsigned long mem;
        size_t memsz;
};

Thanks
Baoquan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-25  6:48         ` Baoquan He
@ 2018-07-26 12:59           ` Michal Hocko
  2018-07-26 13:09             ` Baoquan He
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-07-26 12:59 UTC (permalink / raw)
  To: Baoquan He
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On Wed 25-07-18 14:48:13, Baoquan He wrote:
> On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > Kexec has been a formal feature in our distro, and customers owning
> > > those kind of very large machine can make use of this feature to speed
> > > up the reboot process. On uefi machine, the kexec_file loading will
> > > search place to put kernel under 4G from top to down. As we know, the
> > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > it. It may have possibility to not be able to find a usable space for
> > > kernel/initrd. From the top down of the whole memory space, we don't
> > > have this worry. 
> > 
> > I do not have the full context here but let me note that you should be
> > careful when doing top-down reservation because you can easily get into
> > hotplugable memory and break the hotremove usecase. We even warn when
> > this is done. See memblock_find_in_range_node
> 
> Kexec read kernel/initrd file into buffer, just search usable positions
> for them to do the later copying. You can see below struct kexec_segment, 
> for the old kexec_load, kernel/initrd are read into user space buffer,
> the @buf stores the user space buffer address, @mem stores the position
> where kernel/initrd will be put. In kernel, it calls
> kimage_load_normal_segment() to copy user space buffer to intermediate
> pages which are allocated with flag GFP_KERNEL. These intermediate pages
> are recorded as entries, later when user execute "kexec -e" to trigger
> kexec jumping, it will do the final copying from the intermediate pages
> to the real destination pages which @mem pointed. Because we can't touch
> the existed data in 1st kernel when do kexec kernel loading. With my
> understanding, GFP_KERNEL will make those intermediate pages be
> allocated inside immovable area, it won't impact hotplugging. But the
> @mem we searched in the whole system RAM might be lost along with
> hotplug. Hence we need do kexec kernel again when hotplug event is
> detected.

I am not sure I am following. If @mem is placed at movable node then the
memory hotremove simply won't work, because we are seeing reserved pages
and do not know what to do about them. They are not migrateable.
Allocating intermediate pages from other nodes doesn't really help.

The memblock code warns exactly for that reason.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-26 12:59           ` Michal Hocko
@ 2018-07-26 13:09             ` Baoquan He
  2018-07-26 13:12               ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Baoquan He @ 2018-07-26 13:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On 07/26/18 at 02:59pm, Michal Hocko wrote:
> On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > Kexec has been a formal feature in our distro, and customers owning
> > > > those kind of very large machine can make use of this feature to speed
> > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > search place to put kernel under 4G from top to down. As we know, the
> > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > > it. It may have possibility to not be able to find a usable space for
> > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > have this worry. 
> > > 
> > > I do not have the full context here but let me note that you should be
> > > careful when doing top-down reservation because you can easily get into
> > > hotplugable memory and break the hotremove usecase. We even warn when
> > > this is done. See memblock_find_in_range_node
> > 
> > Kexec read kernel/initrd file into buffer, just search usable positions
> > for them to do the later copying. You can see below struct kexec_segment, 
> > for the old kexec_load, kernel/initrd are read into user space buffer,
> > the @buf stores the user space buffer address, @mem stores the position
> > where kernel/initrd will be put. In kernel, it calls
> > kimage_load_normal_segment() to copy user space buffer to intermediate
> > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > are recorded as entries, later when user execute "kexec -e" to trigger
> > kexec jumping, it will do the final copying from the intermediate pages
> > to the real destination pages which @mem pointed. Because we can't touch
> > the existed data in 1st kernel when do kexec kernel loading. With my
> > understanding, GFP_KERNEL will make those intermediate pages be
> > allocated inside immovable area, it won't impact hotplugging. But the
> > @mem we searched in the whole system RAM might be lost along with
> > hotplug. Hence we need do kexec kernel again when hotplug event is
> > detected.
> 
> I am not sure I am following. If @mem is placed at movable node then the
> memory hotremove simply won't work, because we are seeing reserved pages
> and do not know what to do about them. They are not migrateable.
> Allocating intermediate pages from other nodes doesn't really help.

OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
in 1st kernel, it does impact the kernel which kexec jump into if kernel
is at top of system RAM and the top RAM is in movable node.

> 
> The memblock code warns exactly for that reason.
> -- 
> Michal Hocko
> SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-26 13:09             ` Baoquan He
@ 2018-07-26 13:12               ` Michal Hocko
  2018-07-26 13:14                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-07-26 13:12 UTC (permalink / raw)
  To: Baoquan He
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On Thu 26-07-18 21:09:04, Baoquan He wrote:
> On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > Kexec has been a formal feature in our distro, and customers owning
> > > > > those kind of very large machine can make use of this feature to speed
> > > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > > search place to put kernel under 4G from top to down. As we know, the
> > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > > > it. It may have possibility to not be able to find a usable space for
> > > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > > have this worry. 
> > > > 
> > > > I do not have the full context here but let me note that you should be
> > > > careful when doing top-down reservation because you can easily get into
> > > > hotplugable memory and break the hotremove usecase. We even warn when
> > > > this is done. See memblock_find_in_range_node
> > > 
> > > Kexec read kernel/initrd file into buffer, just search usable positions
> > > for them to do the later copying. You can see below struct kexec_segment, 
> > > for the old kexec_load, kernel/initrd are read into user space buffer,
> > > the @buf stores the user space buffer address, @mem stores the position
> > > where kernel/initrd will be put. In kernel, it calls
> > > kimage_load_normal_segment() to copy user space buffer to intermediate
> > > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > > are recorded as entries, later when user execute "kexec -e" to trigger
> > > kexec jumping, it will do the final copying from the intermediate pages
> > > to the real destination pages which @mem pointed. Because we can't touch
> > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > understanding, GFP_KERNEL will make those intermediate pages be
> > > allocated inside immovable area, it won't impact hotplugging. But the
> > > @mem we searched in the whole system RAM might be lost along with
> > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > detected.
> > 
> > I am not sure I am following. If @mem is placed at movable node then the
> > memory hotremove simply won't work, because we are seeing reserved pages
> > and do not know what to do about them. They are not migrateable.
> > Allocating intermediate pages from other nodes doesn't really help.
> 
> OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
> in 1st kernel, it does impact the kernel which kexec jump into if kernel
> is at top of system RAM and the top RAM is in movable node.

It will affect the 1st kernel (which does the memblock allocation
top-down) as well. For reasons mentioned above.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-26 13:12               ` Michal Hocko
@ 2018-07-26 13:14                 ` Michal Hocko
  2018-07-26 13:37                   ` Baoquan He
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-07-26 13:14 UTC (permalink / raw)
  To: Baoquan He
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On Thu 26-07-18 15:12:42, Michal Hocko wrote:
> On Thu 26-07-18 21:09:04, Baoquan He wrote:
> > On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > > Kexec has been a formal feature in our distro, and customers owning
> > > > > > those kind of very large machine can make use of this feature to speed
> > > > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > > > search place to put kernel under 4G from top to down. As we know, the
> > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > > > > it. It may have possibility to not be able to find a usable space for
> > > > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > > > have this worry. 
> > > > > 
> > > > > I do not have the full context here but let me note that you should be
> > > > > careful when doing top-down reservation because you can easily get into
> > > > > hotplugable memory and break the hotremove usecase. We even warn when
> > > > > this is done. See memblock_find_in_range_node
> > > > 
> > > > Kexec read kernel/initrd file into buffer, just search usable positions
> > > > for them to do the later copying. You can see below struct kexec_segment, 
> > > > for the old kexec_load, kernel/initrd are read into user space buffer,
> > > > the @buf stores the user space buffer address, @mem stores the position
> > > > where kernel/initrd will be put. In kernel, it calls
> > > > kimage_load_normal_segment() to copy user space buffer to intermediate
> > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > > > are recorded as entries, later when user execute "kexec -e" to trigger
> > > > kexec jumping, it will do the final copying from the intermediate pages
> > > > to the real destination pages which @mem pointed. Because we can't touch
> > > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > > understanding, GFP_KERNEL will make those intermediate pages be
> > > > allocated inside immovable area, it won't impact hotplugging. But the
> > > > @mem we searched in the whole system RAM might be lost along with
> > > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > > detected.
> > > 
> > > I am not sure I am following. If @mem is placed at movable node then the
> > > memory hotremove simply won't work, because we are seeing reserved pages
> > > and do not know what to do about them. They are not migrateable.
> > > Allocating intermediate pages from other nodes doesn't really help.
> > 
> > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
> > in 1st kernel, it does impact the kernel which kexec jump into if kernel
> > is at top of system RAM and the top RAM is in movable node.
> 
> It will affect the 1st kernel (which does the memblock allocation
> top-down) as well. For reasons mentioned above.

And btw. in the ideal world, we would restrict the memblock allocation
top-down from the non-movable nodes. But I do not think we have that
information ready at the time when the reservation is done.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-26 13:14                 ` Michal Hocko
@ 2018-07-26 13:37                   ` Baoquan He
  2018-07-26 14:01                     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Baoquan He @ 2018-07-26 13:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On 07/26/18 at 03:14pm, Michal Hocko wrote:
> On Thu 26-07-18 15:12:42, Michal Hocko wrote:
> > On Thu 26-07-18 21:09:04, Baoquan He wrote:
> > > On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > > > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > > > Kexec has been a formal feature in our distro, and customers owning
> > > > > > > those kind of very large machine can make use of this feature to speed
> > > > > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > > > > search place to put kernel under 4G from top to down. As we know, the
> > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > > > > > it. It may have possibility to not be able to find a usable space for
> > > > > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > > > > have this worry. 
> > > > > > 
> > > > > > I do not have the full context here but let me note that you should be
> > > > > > careful when doing top-down reservation because you can easily get into
> > > > > > hotplugable memory and break the hotremove usecase. We even warn when
> > > > > > this is done. See memblock_find_in_range_node
> > > > > 
> > > > > Kexec read kernel/initrd file into buffer, just search usable positions
> > > > > for them to do the later copying. You can see below struct kexec_segment, 
> > > > > for the old kexec_load, kernel/initrd are read into user space buffer,
> > > > > the @buf stores the user space buffer address, @mem stores the position
> > > > > where kernel/initrd will be put. In kernel, it calls
> > > > > kimage_load_normal_segment() to copy user space buffer to intermediate
> > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > > > > are recorded as entries, later when user execute "kexec -e" to trigger
> > > > > kexec jumping, it will do the final copying from the intermediate pages
> > > > > to the real destination pages which @mem pointed. Because we can't touch
> > > > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > > > understanding, GFP_KERNEL will make those intermediate pages be
> > > > > allocated inside immovable area, it won't impact hotplugging. But the
> > > > > @mem we searched in the whole system RAM might be lost along with
> > > > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > > > detected.
> > > > 
> > > > I am not sure I am following. If @mem is placed at movable node then the
> > > > memory hotremove simply won't work, because we are seeing reserved pages
> > > > and do not know what to do about them. They are not migrateable.
> > > > Allocating intermediate pages from other nodes doesn't really help.
> > > 
> > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
> > > in 1st kernel, it does impact the kernel which kexec jump into if kernel
> > > is at top of system RAM and the top RAM is in movable node.
> > 
> > It will affect the 1st kernel (which does the memblock allocation
> > top-down) as well. For reasons mentioned above.
> 
> And btw. in the ideal world, we would restrict the memblock allocation
> top-down from the non-movable nodes. But I do not think we have that
> information ready at the time when the reservation is done.

Oh, you could mix kexec loading up with kdump kernel loading. For kdump
kernel, we need reserve memory region during bootup with memblock
allocator. For kexec loading, we just operate after system up, and do
not need to reserve any memmory region. About memory used to load them,
it's quite different way.

Thanks
Baoquan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-26 13:37                   ` Baoquan He
@ 2018-07-26 14:01                     ` Michal Hocko
  2018-07-26 15:10                       ` Baoquan He
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-07-26 14:01 UTC (permalink / raw)
  To: Baoquan He
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On Thu 26-07-18 21:37:05, Baoquan He wrote:
> On 07/26/18 at 03:14pm, Michal Hocko wrote:
> > On Thu 26-07-18 15:12:42, Michal Hocko wrote:
> > > On Thu 26-07-18 21:09:04, Baoquan He wrote:
> > > > On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > > > > Kexec has been a formal feature in our distro, and customers owning
> > > > > > > > those kind of very large machine can make use of this feature to speed
> > > > > > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > > > > > search place to put kernel under 4G from top to down. As we know, the
> > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > > > > > > it. It may have possibility to not be able to find a usable space for
> > > > > > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > > > > > have this worry. 
> > > > > > > 
> > > > > > > I do not have the full context here but let me note that you should be
> > > > > > > careful when doing top-down reservation because you can easily get into
> > > > > > > hotplugable memory and break the hotremove usecase. We even warn when
> > > > > > > this is done. See memblock_find_in_range_node
> > > > > > 
> > > > > > Kexec read kernel/initrd file into buffer, just search usable positions
> > > > > > for them to do the later copying. You can see below struct kexec_segment, 
> > > > > > for the old kexec_load, kernel/initrd are read into user space buffer,
> > > > > > the @buf stores the user space buffer address, @mem stores the position
> > > > > > where kernel/initrd will be put. In kernel, it calls
> > > > > > kimage_load_normal_segment() to copy user space buffer to intermediate
> > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > > > > > are recorded as entries, later when user execute "kexec -e" to trigger
> > > > > > kexec jumping, it will do the final copying from the intermediate pages
> > > > > > to the real destination pages which @mem pointed. Because we can't touch
> > > > > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > > > > understanding, GFP_KERNEL will make those intermediate pages be
> > > > > > allocated inside immovable area, it won't impact hotplugging. But the
> > > > > > @mem we searched in the whole system RAM might be lost along with
> > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > > > > detected.
> > > > > 
> > > > > I am not sure I am following. If @mem is placed at movable node then the
> > > > > memory hotremove simply won't work, because we are seeing reserved pages
> > > > > and do not know what to do about them. They are not migrateable.
> > > > > Allocating intermediate pages from other nodes doesn't really help.
> > > > 
> > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
> > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel
> > > > is at top of system RAM and the top RAM is in movable node.
> > > 
> > > It will affect the 1st kernel (which does the memblock allocation
> > > top-down) as well. For reasons mentioned above.
> > 
> > And btw. in the ideal world, we would restrict the memblock allocation
> > top-down from the non-movable nodes. But I do not think we have that
> > information ready at the time when the reservation is done.
> 
> Oh, you could mix kexec loading up with kdump kernel loading. For kdump
> kernel, we need reserve memory region during bootup with memblock
> allocator. For kexec loading, we just operate after system up, and do
> not need to reserve any memmory region. About memory used to load them,
> it's quite different way.

I didn't know about that. I thought both use the same underlying
reservation mechanism. My bad and sorry for the noise.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
  2018-07-26 14:01                     ` Michal Hocko
@ 2018-07-26 15:10                       ` Baoquan He
  0 siblings, 0 replies; 21+ messages in thread
From: Baoquan He @ 2018-07-26 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: nicolas.pitre, brijesh.singh, devicetree, airlied, linux-pci,
	richard.weiyang, jcmvbkbc, baiyaowei, kys, frowand.list,
	lorenzo.pieralisi, sthemmin, linux-nvdimm, patrik.r.jakobsson,
	andy.shevchenko, linux-input, gustavo, bp, dyoung,
	thomas.lendacky, haiyangz, maarten.lankhorst, josh, jglisse,
	robh+dt, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	linux-kernel, ebiederm, devel, Andrew Morton, fengguang.wu,
	linuxppc-dev, davem

On 07/26/18 at 04:01pm, Michal Hocko wrote:
> On Thu 26-07-18 21:37:05, Baoquan He wrote:
> > On 07/26/18 at 03:14pm, Michal Hocko wrote:
> > > On Thu 26-07-18 15:12:42, Michal Hocko wrote:
> > > > On Thu 26-07-18 21:09:04, Baoquan He wrote:
> > > > > On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > > > > > Kexec has been a formal feature in our distro, and customers owning
> > > > > > > > > those kind of very large machine can make use of this feature to speed
> > > > > > > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > > > > > > search place to put kernel under 4G from top to down. As we know, the
> > > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > > > > > > > it. It may have possibility to not be able to find a usable space for
> > > > > > > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > > > > > > have this worry. 
> > > > > > > > 
> > > > > > > > I do not have the full context here but let me note that you should be
> > > > > > > > careful when doing top-down reservation because you can easily get into
> > > > > > > > hotplugable memory and break the hotremove usecase. We even warn when
> > > > > > > > this is done. See memblock_find_in_range_node
> > > > > > > 
> > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions
> > > > > > > for them to do the later copying. You can see below struct kexec_segment, 
> > > > > > > for the old kexec_load, kernel/initrd are read into user space buffer,
> > > > > > > the @buf stores the user space buffer address, @mem stores the position
> > > > > > > where kernel/initrd will be put. In kernel, it calls
> > > > > > > kimage_load_normal_segment() to copy user space buffer to intermediate
> > > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > > > > > > are recorded as entries, later when user execute "kexec -e" to trigger
> > > > > > > kexec jumping, it will do the final copying from the intermediate pages
> > > > > > > to the real destination pages which @mem pointed. Because we can't touch
> > > > > > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > > > > > understanding, GFP_KERNEL will make those intermediate pages be
> > > > > > > allocated inside immovable area, it won't impact hotplugging. But the
> > > > > > > @mem we searched in the whole system RAM might be lost along with
> > > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > > > > > detected.
> > > > > > 
> > > > > > I am not sure I am following. If @mem is placed at movable node then the
> > > > > > memory hotremove simply won't work, because we are seeing reserved pages
> > > > > > and do not know what to do about them. They are not migrateable.
> > > > > > Allocating intermediate pages from other nodes doesn't really help.
> > > > > 
> > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
> > > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel
> > > > > is at top of system RAM and the top RAM is in movable node.
> > > > 
> > > > It will affect the 1st kernel (which does the memblock allocation
> > > > top-down) as well. For reasons mentioned above.
> > > 
> > > And btw. in the ideal world, we would restrict the memblock allocation
> > > top-down from the non-movable nodes. But I do not think we have that
> > > information ready at the time when the reservation is done.
> > 
> > Oh, you could mix kexec loading up with kdump kernel loading. For kdump
> > kernel, we need reserve memory region during bootup with memblock
> > allocator. For kexec loading, we just operate after system up, and do
> > not need to reserve any memmory region. About memory used to load them,
> > it's quite different way.
> 
> I didn't know about that. I thought both use the same underlying
> reservation mechanism. My bad and sorry for the noise.

Not at all. It's truly confusing. I often need take time to recall those
details. 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-07-26 15:10 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-18  2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He
2018-07-18  2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He
2018-07-18 16:36   ` Andy Shevchenko
2018-07-18 16:37     ` Andy Shevchenko
2018-07-19 15:18       ` Baoquan He
2018-07-18  2:49 ` [PATCH v7 2/4] resource: Use list_head to link sibling resource Baoquan He
2018-07-18  2:49 ` [PATCH v7 3/4] resource: add walk_system_ram_res_rev() Baoquan He
2018-07-18  2:49 ` [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required Baoquan He
2018-07-18 22:33   ` Andrew Morton
2018-07-19 15:17     ` Baoquan He
2018-07-19 19:44       ` Andrew Morton
2018-07-25  2:21         ` Baoquan He
2018-07-23 14:34       ` Michal Hocko
2018-07-25  6:48         ` Baoquan He
2018-07-26 12:59           ` Michal Hocko
2018-07-26 13:09             ` Baoquan He
2018-07-26 13:12               ` Michal Hocko
2018-07-26 13:14                 ` Michal Hocko
2018-07-26 13:37                   ` Baoquan He
2018-07-26 14:01                     ` Michal Hocko
2018-07-26 15:10                       ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).