nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/7] device backed vmemmap crash dump support
@ 2024-03-06 10:28 Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 1/7] mm: memremap: register/unregister altmap region to a separate resource Li Zhijian
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Li Zhijian @ 2024-03-06 10:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: y-goto, Alison Schofield, Andrew Morton, Baoquan He,
	Borislav Petkov, Dan Williams, Dave Hansen, Dave Jiang,
	Greg Kroah-Hartman, hpa, Ingo Molnar, Ira Weiny, Thomas Gleixner,
	Vishal Verma, linux-cxl, linux-mm, nvdimm, x86, kexec,
	Li Zhijian

Hello folks,

Compared with the V2[1] I posted a long time ago, this time it is a
completely new proposal design.

### Background and motivate overview ###
---
Crash dump is an important feature for troubleshooting the kernel. It is the
final way to chase what happened at the kernel panic, slowdown, and so on. It
is one of the most important tools for customer support.

Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
configure the dumpable regions. Generally, (A)iomem resources registered with
flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
(B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
are dumpable.

The pmem use cases including fsdax and devdax, could map their vmemmap to
their own devices. In this case, these part of vmemmap will not be dumped when
crash happened since these regions are satisfied with neither the above (A)
nor (B).

In fsdax, the vmemmap(struct page array) becomes very important, it is one of
the key data to find status of reverse map. Lacking of the information may
cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
That means troubleshooters are unable to check more details about pmem from
the dumpfile.

### Proposal ###
---
In this proposal, register the device backed vmemmap as a separate resource.
This resource has its own new flag and name, and then teaches kexec_file_load(2)
and kexec_load(2) to mark it as dumpable.

Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
Proposed name: "Device Backed Vmemmap"

NOTE: crash-utils also needs to adapt to this new name for kexec_load()

With current proposal, the /proc/iomem should show as following for device
backed vmemmap
# cat /proc/iomem
...
fffc0000-ffffffff : Reserved
100000000-13fffffff : Persistent Memory
  100000000-10fffffff : namespace0.0
    100000000-1005fffff : Device Backed Vmemmap  # fsdax
a80000000-b7fffffff : CXL Window 0
  a80000000-affffffff : Persistent Memory
    a80000000-affffffff : region1
      a80000000-a811fffff : namespace1.0
        a80000000-a811fffff : Device Backed Vmemmap # devdax
      a81200000-abfffffff : dax1.0
b80000000-c7fffffff : CXL Window 1
c80000000-147fffffff : PCI Bus 0000:00
  c80000000-c801fffff : PCI Bus 0000:01
...

### Kdump service reloading ###
---
Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr should also
be updated. There are 2 approaches to make the reloading more efficient.
1) Use udev rules to watch CPU and memory events, then reload kdump
2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)

This reloading also needed when device backed vmemmap layouts change, Similar
to what 1) does now, one could add the following as the first lines to the
RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:

# namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
# devdax <-> system-ram updated: watch daxX.Y of dax
SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"

Regarding 2), my idea is that it would need to call the memory_notify() in
devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
hotplug. This part is not yet mature, but it does not affect the whole feature
because we can still use method 1) alternatively.

[1] https://lore.kernel.org/lkml/02066f0f-dbc0-0388-4233-8e24b6f8435b@fujitsu.com/T/
--------------------------------------------
changes from V2[1]
- new proposal design

CC: Alison Schofield <alison.schofield@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Baoquan He <bhe@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ira Weiny <ira.weiny@intel.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Vishal Verma <vishal.l.verma@intel.com>
CC: linux-cxl@vger.kernel.org
CC: linux-mm@kvack.org
CC: nvdimm@lists.linux.dev
CC: x86@kernel.org

Li Zhijian (7):
  mm: memremap: register/unregister altmap region to a separate resource
  mm: memremap: add pgmap_parent_resource() helper
  nvdimm: pmem: assign a parent resource for vmemmap region for the
    fsdax
  dax: pmem: assign a parent resource for vmemmap region for the devdax
  resource: Introduce walk device_backed_vmemmap res() helper
  x86/crash: make device backed vmemmap dumpable for kexec_file_load
  nvdimm: set force_raw=1 in kdump kernel

 arch/x86/kernel/crash.c         |  5 +++++
 drivers/dax/pmem.c              |  8 ++++++--
 drivers/nvdimm/namespace_devs.c |  3 +++
 drivers/nvdimm/pmem.c           |  9 ++++++---
 include/linux/ioport.h          |  4 ++++
 include/linux/memremap.h        |  4 ++++
 kernel/resource.c               | 13 +++++++++++++
 mm/memremap.c                   | 30 +++++++++++++++++++++++++++++-
 8 files changed, 70 insertions(+), 6 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/7] mm: memremap: register/unregister altmap region to a separate resource
  2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
@ 2024-03-06 10:28 ` Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 2/7] mm: memremap: add pgmap_parent_resource() helper Li Zhijian
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Li Zhijian @ 2024-03-06 10:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: y-goto, Alison Schofield, Andrew Morton, Baoquan He,
	Borislav Petkov, Dan Williams, Dave Hansen, Dave Jiang,
	Greg Kroah-Hartman, hpa, Ingo Molnar, Ira Weiny, Thomas Gleixner,
	Vishal Verma, linux-cxl, linux-mm, nvdimm, x86, kexec,
	Li Zhijian

The elfcorehdr descirbes the dumpable region in PT_LOADs.

Generally, an iomem resource registered with flags
(IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) will be added to PT_LOADs by
kexe_file_load(2). An iomem resource with name prefix "System RAM" will
be added to PT_LOADs in kexec-tools by calling kexe_load(2).

So a simple way to make the altmap dumpable is to register altmap region
as a separate resource with the proper name and resource flags.

Here naming it as "Device Backed Vmemmap" plus resource flags
(IORESOURCE_DEVICE_BACKED_VMEMMAP and IORESOUCE_BUSY) to make it work first.

A /proc/iomem example is as following:
$ sudo cat /proc/iomem
...
fffc0000-ffffffff : Reserved
100000000-13fffffff : Persistent Memory
  100000000-10fffffff : namespace0.0
    100000000-1005fffff : Device Backed Vmemmap  # fsdax
a80000000-b7fffffff : CXL Window 0
  a80000000-affffffff : Persistent Memory
    a80000000-affffffff : region1
      a80000000-a811fffff : namespace1.0
        a80000000-a811fffff : Device Backed Vmemmap # devdax
      a81200000-abfffffff : dax1.0
b80000000-c7fffffff : CXL Window 1
c80000000-147fffffff : PCI Bus 0000:00
  c80000000-c801fffff : PCI Bus 0000:01
...

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Baoquan He <bhe@redhat.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: linux-mm@kvack.org
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h |  3 +++
 mm/memremap.c            | 23 ++++++++++++++++++++++-
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index db7fe25f3370..3b59e924f531 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -69,6 +69,7 @@ struct resource {
 #define IORESOURCE_UNSET	0x20000000	/* No address assigned yet */
 #define IORESOURCE_AUTO		0x40000000
 #define IORESOURCE_BUSY		0x80000000	/* Driver has marked this resource busy */
+#define IORESOURCE_DEVICE_BACKED_VMEMMAP 0xa0000000	/* device backed vmemmap resource */
 
 /* I/O resource extended types */
 #define IORESOURCE_SYSTEM_RAM		(IORESOURCE_MEM|IORESOURCE_SYSRAM)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 744c830f4b13..ca1f12353008 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,6 +17,8 @@ struct device;
  * @free: free pages set aside in the mapping for memmap storage
  * @align: pages reserved to meet allocation alignments
  * @alloc: track pages consumed, private to vmemmap_populate()
+ * @parent: the parent resource that altmap region belongs to
+ * @res: altmap region resource
  */
 struct vmem_altmap {
 	unsigned long base_pfn;
@@ -25,6 +27,7 @@ struct vmem_altmap {
 	unsigned long free;
 	unsigned long align;
 	unsigned long alloc;
+	struct resource *parent, *res;
 };
 
 /*
diff --git a/mm/memremap.c b/mm/memremap.c
index 9e9fb1972fff..78047157b0ee 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -157,7 +157,17 @@ EXPORT_SYMBOL_GPL(memunmap_pages);
 
 static void devm_memremap_pages_release(void *data)
 {
-	memunmap_pages(data);
+	struct dev_pagemap *pgmap = data;
+
+	if (pgmap->flags & PGMAP_ALTMAP_VALID && pgmap->altmap.res) {
+		resource_size_t start = pgmap->altmap.res->start;
+		resource_size_t size = pgmap->altmap.res->end -
+				       pgmap->altmap.res->start + 1;
+
+		__release_region(pgmap->altmap.parent, start, size);
+	}
+
+	memunmap_pages(pgmap);
 }
 
 static void dev_pagemap_percpu_release(struct percpu_ref *ref)
@@ -404,11 +414,22 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 {
 	int error;
 	void *ret;
+	struct vmem_altmap *altmap = &pgmap->altmap;
 
 	ret = memremap_pages(pgmap, dev_to_node(dev));
 	if (IS_ERR(ret))
 		return ret;
 
+	if (pgmap->flags & PGMAP_ALTMAP_VALID && altmap->parent) {
+		unsigned long start = altmap->base_pfn << PAGE_SHIFT;
+		unsigned long size = vmem_altmap_offset(altmap) << PAGE_SHIFT;
+		int flags = IORESOURCE_DEVICE_BACKED_VMEMMAP | IORESOURCE_BUSY;
+
+		altmap->res = __request_region(altmap->parent, start, size,
+					      "Device Backed Vmemmap", flags);
+		pr_debug("Insert a separate resource for altmap, %lx-%lx\n",
+			 start, start + size);
+	}
 	error = devm_add_action_or_reset(dev, devm_memremap_pages_release,
 			pgmap);
 	if (error)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 2/7] mm: memremap: add pgmap_parent_resource() helper
  2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 1/7] mm: memremap: register/unregister altmap region to a separate resource Li Zhijian
@ 2024-03-06 10:28 ` Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax Li Zhijian
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Li Zhijian @ 2024-03-06 10:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: y-goto, Alison Schofield, Andrew Morton, Baoquan He,
	Borislav Petkov, Dan Williams, Dave Hansen, Dave Jiang,
	Greg Kroah-Hartman, hpa, Ingo Molnar, Ira Weiny, Thomas Gleixner,
	Vishal Verma, linux-cxl, linux-mm, nvdimm, x86, kexec,
	Li Zhijian

It's a helper to set the parent resource for the altmap of a given pgmap

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Baoquan He <bhe@redhat.com>
CC: linux-mm@kvack.org
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 include/linux/memremap.h | 1 +
 mm/memremap.c            | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index ca1f12353008..1e8b25352f7c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -193,6 +193,7 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
 void zone_device_page_init(struct page *page);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
+void pgmap_parent_resource(struct dev_pagemap *pgmap, struct resource *res);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
diff --git a/mm/memremap.c b/mm/memremap.c
index 78047157b0ee..0bbf163d4817 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -390,6 +390,13 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 }
 EXPORT_SYMBOL_GPL(memremap_pages);
 
+void pgmap_parent_resource(struct dev_pagemap *pgmap, struct resource *res)
+{
+	if (pgmap && res)
+		pgmap->altmap.parent = res;
+}
+EXPORT_SYMBOL_GPL(pgmap_parent_resource);
+
 /**
  * devm_memremap_pages - remap and provide memmap backing for the given resource
  * @dev: hosting device for @res
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax
  2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 1/7] mm: memremap: register/unregister altmap region to a separate resource Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 2/7] mm: memremap: add pgmap_parent_resource() helper Li Zhijian
@ 2024-03-06 10:28 ` Li Zhijian
  2024-03-07  8:07   ` kernel test robot
  2024-03-07 11:08   ` kernel test robot
  2024-03-06 10:28 ` [PATCH v3 4/7] dax: pmem: assign a parent resource for vmemmap region for the devdax Li Zhijian
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 14+ messages in thread
From: Li Zhijian @ 2024-03-06 10:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: y-goto, Alison Schofield, Andrew Morton, Baoquan He,
	Borislav Petkov, Dan Williams, Dave Hansen, Dave Jiang,
	Greg Kroah-Hartman, hpa, Ingo Molnar, Ira Weiny, Thomas Gleixner,
	Vishal Verma, linux-cxl, linux-mm, nvdimm, x86, kexec,
	Li Zhijian

When the pmem is configured as fsdax, set the vmemmap region as a child
of the namespace region so that it can be registered as a separate
resource later.

CC: Dan Williams <dan.j.williams@intel.com>
CC: Vishal Verma <vishal.l.verma@intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: Ira Weiny <ira.weiny@intel.com>
CC: Baoquan He <bhe@redhat.com>
CC: nvdimm@lists.linux.dev
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 drivers/nvdimm/pmem.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..b2640a3fb693 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -452,7 +452,7 @@ static int pmem_attach_disk(struct device *dev,
 	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	struct nd_region *nd_region = to_nd_region(dev->parent);
 	int nid = dev_to_node(dev), fua;
-	struct resource *res = &nsio->res;
+	struct resource *res = &nsio->res, *parent;
 	struct range bb_range;
 	struct nd_pfn *nd_pfn = NULL;
 	struct dax_device *dax_dev;
@@ -491,12 +491,15 @@ static int pmem_attach_disk(struct device *dev,
 		fua = 0;
 	}
 
-	if (!devm_request_mem_region(dev, res->start, resource_size(res),
-				dev_name(&ndns->dev))) {
+	parent = devm_request_mem_region(dev, res->start, resource_size(res),
+				dev_name(&ndns->dev));
+	if (!res) {
 		dev_warn(dev, "could not reserve region %pR\n", res);
 		return -EBUSY;
 	}
 
+	pgmap_parent_resource(&pmem->pgmap, parent);
+
 	disk = blk_alloc_disk(nid);
 	if (!disk)
 		return -ENOMEM;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 4/7] dax: pmem: assign a parent resource for vmemmap region for the devdax
  2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
                   ` (2 preceding siblings ...)
  2024-03-06 10:28 ` [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax Li Zhijian
@ 2024-03-06 10:28 ` Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 5/7] resource: Introduce walk device_backed_vmemmap res() helper Li Zhijian
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Li Zhijian @ 2024-03-06 10:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: y-goto, Alison Schofield, Andrew Morton, Baoquan He,
	Borislav Petkov, Dan Williams, Dave Hansen, Dave Jiang,
	Greg Kroah-Hartman, hpa, Ingo Molnar, Ira Weiny, Thomas Gleixner,
	Vishal Verma, linux-cxl, linux-mm, nvdimm, x86, kexec,
	Li Zhijian

When the pmem is configured as devdax, set the vmemmap region as a child
of the namespace region so that it can be registered as a separate
resource later.

CC: Dan Williams <dan.j.williams@intel.com>
CC: Vishal Verma <vishal.l.verma@intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: Baoquan He <bhe@redhat.com>
CC: nvdimm@lists.linux.dev
CC: linux-cxl@vger.kernel.org
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 drivers/dax/pmem.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index f3c6c67b8412..6ffeb81e6c7c 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -21,6 +21,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 	struct nd_dax *nd_dax = to_nd_dax(dev);
 	struct nd_pfn *nd_pfn = &nd_dax->nd_pfn;
 	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct resource *parent;
 
 	ndns = nvdimm_namespace_common_probe(dev);
 	if (IS_ERR(ndns))
@@ -39,8 +40,9 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 	pfn_sb = nd_pfn->pfn_sb;
 	offset = le64_to_cpu(pfn_sb->dataoff);
 	nsio = to_nd_namespace_io(&ndns->dev);
-	if (!devm_request_mem_region(dev, nsio->res.start, offset,
-				dev_name(&ndns->dev))) {
+	parent = devm_request_mem_region(dev, nsio->res.start, offset,
+				dev_name(&ndns->dev));
+	if (!parent) {
 		dev_warn(dev, "could not reserve metadata\n");
 		return ERR_PTR(-EBUSY);
 	}
@@ -66,6 +68,8 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 		.memmap_on_memory = false,
 	};
 
+	pgmap_parent_resource(&pgmap, parent);
+
 	return devm_create_dev_dax(&data);
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 5/7] resource: Introduce walk device_backed_vmemmap res() helper
  2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
                   ` (3 preceding siblings ...)
  2024-03-06 10:28 ` [PATCH v3 4/7] dax: pmem: assign a parent resource for vmemmap region for the devdax Li Zhijian
@ 2024-03-06 10:28 ` Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 6/7] x86/crash: make device backed vmemmap dumpable for kexec_file_load Li Zhijian
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Li Zhijian @ 2024-03-06 10:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: y-goto, Alison Schofield, Andrew Morton, Baoquan He,
	Borislav Petkov, Dan Williams, Dave Hansen, Dave Jiang,
	Greg Kroah-Hartman, hpa, Ingo Molnar, Ira Weiny, Thomas Gleixner,
	Vishal Verma, linux-cxl, linux-mm, nvdimm, x86, kexec,
	Li Zhijian

It walks resources registered with flags
(IORESOURCE_DEVICE_BACKED_VMEMMAP | IORESOURCE_BUSY), usually used by
device backed vmemmap region. currently, it only sticks to the
persistent memory type since it is only one user.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Baoquan He <bhe@redhat.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Alison Schofield <alison.schofield@intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 include/linux/ioport.h |  3 +++
 kernel/resource.c      | 13 +++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 3b59e924f531..10a60227d6c2 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -332,6 +332,9 @@ extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
 		    int (*func)(struct resource *, void *));
 extern int
+walk_device_backed_vmemmap_res(u64 start, u64 end, void *arg,
+			       int (*func)(struct resource *, void *));
+extern int
 walk_system_ram_res_rev(u64 start, u64 end, void *arg,
 			int (*func)(struct resource *, void *));
 extern int
diff --git a/kernel/resource.c b/kernel/resource.c
index fcbca39dbc45..5f484266af07 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -431,6 +431,19 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
 				     func);
 }
 
+/*
+ * This function calls the @func callback against all memory ranges, which
+ * are ranges marked as (IORESOURCE_DEVICE_BACKED_VMEMMAP | IORESOURCE_BUSY)
+ * and IORES_DESC_PERSISTENT_MEMORY.
+ */
+int walk_device_backed_vmemmap_res(u64 start, u64 end, void *arg,
+			int (*func)(struct resource *, void *))
+{
+	return __walk_iomem_res_desc(start, end,
+			IORESOURCE_DEVICE_BACKED_VMEMMAP | IORESOURCE_BUSY,
+			IORES_DESC_PERSISTENT_MEMORY, arg, func);
+}
+
 /*
  * This function, being a variant of walk_system_ram_res(), calls the @func
  * callback against all memory ranges of type System RAM which are marked as
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 6/7] x86/crash: make device backed vmemmap dumpable for kexec_file_load
  2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
                   ` (4 preceding siblings ...)
  2024-03-06 10:28 ` [PATCH v3 5/7] resource: Introduce walk device_backed_vmemmap res() helper Li Zhijian
@ 2024-03-06 10:28 ` Li Zhijian
  2024-03-06 10:28 ` [PATCH v3 7/7] nvdimm: set force_raw=1 in kdump kernel Li Zhijian
  2024-03-21  5:40 ` [RFC PATCH v3 0/7] device backed vmemmap crash dump support Zhijian Li (Fujitsu)
  7 siblings, 0 replies; 14+ messages in thread
From: Li Zhijian @ 2024-03-06 10:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: y-goto, Alison Schofield, Andrew Morton, Baoquan He,
	Borislav Petkov, Dan Williams, Dave Hansen, Dave Jiang,
	Greg Kroah-Hartman, hpa, Ingo Molnar, Ira Weiny, Thomas Gleixner,
	Vishal Verma, linux-cxl, linux-mm, nvdimm, x86, kexec,
	Li Zhijian

Add resources with specific flags to PT_LOADs of the elfcorehdr so that
these resources can be dumpable. This change is for kexec_file_load(2)
while kexec_load(2) setups the PT_LOADs according to its parameters
by the callers which usually rely on resources' name from /proc/iomem

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Baoquan He <bhe@redhat.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 arch/x86/kernel/crash.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index b6b044356f1b..b8426fedd2cd 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -146,6 +146,8 @@ static struct crash_mem *fill_up_crash_elf_data(void)
 	if (!nr_ranges)
 		return NULL;
 
+	walk_device_backed_vmemmap_res(0, -1, &nr_ranges,
+				       get_nr_ram_ranges_callback);
 	/*
 	 * Exclusion of crash region and/or crashk_low_res may cause
 	 * another range split. So add extra two slots here.
@@ -212,6 +214,9 @@ static int prepare_elf_headers(void **addr, unsigned long *sz,
 	if (ret)
 		goto out;
 
+	walk_device_backed_vmemmap_res(0, -1, cmem,
+				       prepare_elf64_ram_headers_callback);
+
 	/* Exclude unwanted mem ranges */
 	ret = elf_header_exclude_ranges(cmem);
 	if (ret)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 7/7] nvdimm: set force_raw=1 in kdump kernel
  2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
                   ` (5 preceding siblings ...)
  2024-03-06 10:28 ` [PATCH v3 6/7] x86/crash: make device backed vmemmap dumpable for kexec_file_load Li Zhijian
@ 2024-03-06 10:28 ` Li Zhijian
  2024-03-21  5:40 ` [RFC PATCH v3 0/7] device backed vmemmap crash dump support Zhijian Li (Fujitsu)
  7 siblings, 0 replies; 14+ messages in thread
From: Li Zhijian @ 2024-03-06 10:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: y-goto, Alison Schofield, Andrew Morton, Baoquan He,
	Borislav Petkov, Dan Williams, Dave Hansen, Dave Jiang,
	Greg Kroah-Hartman, hpa, Ingo Molnar, Ira Weiny, Thomas Gleixner,
	Vishal Verma, linux-cxl, linux-mm, nvdimm, x86, kexec,
	Li Zhijian

The virtually mapped memory map allows storing struct page objects for
persistent memory devices in pre-allocated storage on those devices.
These 'struct page objects' on devices are also known as metadata.

During libnvdimm/nd_pmem are loading, the previous metadata will
be re-constructed to fit the current running kernel. For kdump purpose,
these metadata should not be touched until the dumping is done so that
the metadata is identical.

To achieve this, we have some options
1. Don't provide libnvdimm driver in kdump kernel rootfs/initramfs
2. Disable libnvdimm driver by specific comline parameters:
   (initcall_blacklist=libnvdimm_init libnvdimm.blacklist=1 rd.driver.blacklist=libnvdimm)
3. Enforce force_raw=1 for nvdimm namespace, because when force_raw=1,
   metadata will not be re-constructed again. This may also result in
   the pmem doesn't work before a few extra configurations.

Here apply the 3rd option.

CC: Dan Williams <dan.j.williams@intel.com>
CC: Vishal Verma <vishal.l.verma@intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: Ira Weiny <ira.weiny@intel.com>
CC: nvdimm@lists.linux.dev
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 drivers/nvdimm/namespace_devs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index d6d558f94d6b..04f855c7f0b1 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -9,6 +9,7 @@
 #include <linux/slab.h>
 #include <linux/list.h>
 #include <linux/nd.h>
+#include <linux/crash_dump.h>
 #include "nd-core.h"
 #include "pmem.h"
 #include "pfn.h"
@@ -1513,6 +1514,8 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
 			return ERR_PTR(-ENODEV);
 	}
 
+	if (is_kdump_kernel())
+		ndns->force_raw = true;
 	return ndns;
 }
 EXPORT_SYMBOL(nvdimm_namespace_common_probe);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax
  2024-03-06 10:28 ` [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax Li Zhijian
@ 2024-03-07  8:07   ` kernel test robot
  2024-03-07 11:08   ` kernel test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-03-07  8:07 UTC (permalink / raw)
  To: Li Zhijian, linux-kernel
  Cc: llvm, oe-kbuild-all, y-goto, Alison Schofield, Andrew Morton,
	Linux Memory Management List, Baoquan He, Borislav Petkov,
	Dan Williams, Dave Hansen, Dave Jiang, Greg Kroah-Hartman, hpa,
	Ingo Molnar, Ira Weiny, Thomas Gleixner, Vishal Verma, linux-cxl,
	nvdimm, x86, kexec, Li Zhijian

Hi Li,

kernel test robot noticed the following build errors:

[auto build test ERROR on nvdimm/libnvdimm-for-next]
[also build test ERROR on tip/x86/core linus/master v6.8-rc7]
[cannot apply to akpm-mm/mm-everything nvdimm/dax-misc next-20240306]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memremap-register-unregister-altmap-region-to-a-separate-resource/20240306-183118
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
patch link:    https://lore.kernel.org/r/20240306102846.1020868-4-lizhijian%40fujitsu.com
patch subject: [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax
config: riscv-defconfig (https://download.01.org/0day-ci/archive/20240307/202403071558.83GY9NwT-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 325f51237252e6dab8e4e1ea1fa7acbb4faee1cd)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240307/202403071558.83GY9NwT-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403071558.83GY9NwT-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/nvdimm/pmem.c:10:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:8:
   In file included from include/linux/cacheflush.h:5:
   In file included from arch/riscv/include/asm/cacheflush.h:9:
   In file included from include/linux/mm.h:2188:
   include/linux/vmstat.h:522:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     522 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
>> drivers/nvdimm/pmem.c:501:2: error: call to undeclared function 'pgmap_parent_resource'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     501 |         pgmap_parent_resource(&pmem->pgmap, parent);
         |         ^
   1 warning and 1 error generated.


vim +/pgmap_parent_resource +501 drivers/nvdimm/pmem.c

   448	
   449	static int pmem_attach_disk(struct device *dev,
   450			struct nd_namespace_common *ndns)
   451	{
   452		struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
   453		struct nd_region *nd_region = to_nd_region(dev->parent);
   454		int nid = dev_to_node(dev), fua;
   455		struct resource *res = &nsio->res, *parent;
   456		struct range bb_range;
   457		struct nd_pfn *nd_pfn = NULL;
   458		struct dax_device *dax_dev;
   459		struct nd_pfn_sb *pfn_sb;
   460		struct pmem_device *pmem;
   461		struct request_queue *q;
   462		struct gendisk *disk;
   463		void *addr;
   464		int rc;
   465	
   466		pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
   467		if (!pmem)
   468			return -ENOMEM;
   469	
   470		rc = devm_namespace_enable(dev, ndns, nd_info_block_reserve());
   471		if (rc)
   472			return rc;
   473	
   474		/* while nsio_rw_bytes is active, parse a pfn info block if present */
   475		if (is_nd_pfn(dev)) {
   476			nd_pfn = to_nd_pfn(dev);
   477			rc = nvdimm_setup_pfn(nd_pfn, &pmem->pgmap);
   478			if (rc)
   479				return rc;
   480		}
   481	
   482		/* we're attaching a block device, disable raw namespace access */
   483		devm_namespace_disable(dev, ndns);
   484	
   485		dev_set_drvdata(dev, pmem);
   486		pmem->phys_addr = res->start;
   487		pmem->size = resource_size(res);
   488		fua = nvdimm_has_flush(nd_region);
   489		if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) || fua < 0) {
   490			dev_warn(dev, "unable to guarantee persistence of writes\n");
   491			fua = 0;
   492		}
   493	
   494		parent = devm_request_mem_region(dev, res->start, resource_size(res),
   495					dev_name(&ndns->dev));
   496		if (!res) {
   497			dev_warn(dev, "could not reserve region %pR\n", res);
   498			return -EBUSY;
   499		}
   500	
 > 501		pgmap_parent_resource(&pmem->pgmap, parent);
   502	
   503		disk = blk_alloc_disk(nid);
   504		if (!disk)
   505			return -ENOMEM;
   506		q = disk->queue;
   507	
   508		pmem->disk = disk;
   509		pmem->pgmap.owner = pmem;
   510		pmem->pfn_flags = PFN_DEV;
   511		if (is_nd_pfn(dev)) {
   512			pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
   513			pmem->pgmap.ops = &fsdax_pagemap_ops;
   514			addr = devm_memremap_pages(dev, &pmem->pgmap);
   515			pfn_sb = nd_pfn->pfn_sb;
   516			pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
   517			pmem->pfn_pad = resource_size(res) -
   518				range_len(&pmem->pgmap.range);
   519			pmem->pfn_flags |= PFN_MAP;
   520			bb_range = pmem->pgmap.range;
   521			bb_range.start += pmem->data_offset;
   522		} else if (pmem_should_map_pages(dev)) {
   523			pmem->pgmap.range.start = res->start;
   524			pmem->pgmap.range.end = res->end;
   525			pmem->pgmap.nr_range = 1;
   526			pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
   527			pmem->pgmap.ops = &fsdax_pagemap_ops;
   528			addr = devm_memremap_pages(dev, &pmem->pgmap);
   529			pmem->pfn_flags |= PFN_MAP;
   530			bb_range = pmem->pgmap.range;
   531		} else {
   532			addr = devm_memremap(dev, pmem->phys_addr,
   533					pmem->size, ARCH_MEMREMAP_PMEM);
   534			bb_range.start =  res->start;
   535			bb_range.end = res->end;
   536		}
   537	
   538		if (IS_ERR(addr)) {
   539			rc = PTR_ERR(addr);
   540			goto out;
   541		}
   542		pmem->virt_addr = addr;
   543	
   544		blk_queue_write_cache(q, true, fua);
   545		blk_queue_physical_block_size(q, PAGE_SIZE);
   546		blk_queue_logical_block_size(q, pmem_sector_size(ndns));
   547		blk_queue_max_hw_sectors(q, UINT_MAX);
   548		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
   549		blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
   550		if (pmem->pfn_flags & PFN_MAP)
   551			blk_queue_flag_set(QUEUE_FLAG_DAX, q);
   552	
   553		disk->fops		= &pmem_fops;
   554		disk->private_data	= pmem;
   555		nvdimm_namespace_disk_name(ndns, disk->disk_name);
   556		set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
   557				/ 512);
   558		if (devm_init_badblocks(dev, &pmem->bb))
   559			return -ENOMEM;
   560		nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_range);
   561		disk->bb = &pmem->bb;
   562	
   563		dax_dev = alloc_dax(pmem, &pmem_dax_ops);
   564		if (IS_ERR(dax_dev)) {
   565			rc = PTR_ERR(dax_dev);
   566			goto out;
   567		}
   568		set_dax_nocache(dax_dev);
   569		set_dax_nomc(dax_dev);
   570		if (is_nvdimm_sync(nd_region))
   571			set_dax_synchronous(dax_dev);
   572		rc = dax_add_host(dax_dev, disk);
   573		if (rc)
   574			goto out_cleanup_dax;
   575		dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
   576		pmem->dax_dev = dax_dev;
   577	
   578		rc = device_add_disk(dev, disk, pmem_attribute_groups);
   579		if (rc)
   580			goto out_remove_host;
   581		if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
   582			return -ENOMEM;
   583	
   584		nvdimm_check_and_set_ro(disk);
   585	
   586		pmem->bb_state = sysfs_get_dirent(disk_to_dev(disk)->kobj.sd,
   587						  "badblocks");
   588		if (!pmem->bb_state)
   589			dev_warn(dev, "'badblocks' notification disabled\n");
   590		return 0;
   591	
   592	out_remove_host:
   593		dax_remove_host(pmem->disk);
   594	out_cleanup_dax:
   595		kill_dax(pmem->dax_dev);
   596		put_dax(pmem->dax_dev);
   597	out:
   598		put_disk(pmem->disk);
   599		return rc;
   600	}
   601	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax
  2024-03-06 10:28 ` [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax Li Zhijian
  2024-03-07  8:07   ` kernel test robot
@ 2024-03-07 11:08   ` kernel test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-03-07 11:08 UTC (permalink / raw)
  To: Li Zhijian, linux-kernel
  Cc: oe-kbuild-all, y-goto, Alison Schofield, Andrew Morton,
	Linux Memory Management List, Baoquan He, Borislav Petkov,
	Dan Williams, Dave Hansen, Dave Jiang, Greg Kroah-Hartman, hpa,
	Ingo Molnar, Ira Weiny, Thomas Gleixner, Vishal Verma, linux-cxl,
	nvdimm, x86, kexec, Li Zhijian

Hi Li,

kernel test robot noticed the following build errors:

[auto build test ERROR on nvdimm/libnvdimm-for-next]
[also build test ERROR on tip/x86/core linus/master v6.8-rc7]
[cannot apply to akpm-mm/mm-everything nvdimm/dax-misc next-20240306]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memremap-register-unregister-altmap-region-to-a-separate-resource/20240306-183118
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
patch link:    https://lore.kernel.org/r/20240306102846.1020868-4-lizhijian%40fujitsu.com
patch subject: [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax
config: loongarch-allmodconfig (https://download.01.org/0day-ci/archive/20240307/202403071804.b9EgMxWo-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240307/202403071804.b9EgMxWo-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403071804.b9EgMxWo-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/nvdimm/pmem.c: In function 'pmem_attach_disk':
>> drivers/nvdimm/pmem.c:501:9: error: implicit declaration of function 'pgmap_parent_resource' [-Werror=implicit-function-declaration]
     501 |         pgmap_parent_resource(&pmem->pgmap, parent);
         |         ^~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/pgmap_parent_resource +501 drivers/nvdimm/pmem.c

   448	
   449	static int pmem_attach_disk(struct device *dev,
   450			struct nd_namespace_common *ndns)
   451	{
   452		struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
   453		struct nd_region *nd_region = to_nd_region(dev->parent);
   454		int nid = dev_to_node(dev), fua;
   455		struct resource *res = &nsio->res, *parent;
   456		struct range bb_range;
   457		struct nd_pfn *nd_pfn = NULL;
   458		struct dax_device *dax_dev;
   459		struct nd_pfn_sb *pfn_sb;
   460		struct pmem_device *pmem;
   461		struct request_queue *q;
   462		struct gendisk *disk;
   463		void *addr;
   464		int rc;
   465	
   466		pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
   467		if (!pmem)
   468			return -ENOMEM;
   469	
   470		rc = devm_namespace_enable(dev, ndns, nd_info_block_reserve());
   471		if (rc)
   472			return rc;
   473	
   474		/* while nsio_rw_bytes is active, parse a pfn info block if present */
   475		if (is_nd_pfn(dev)) {
   476			nd_pfn = to_nd_pfn(dev);
   477			rc = nvdimm_setup_pfn(nd_pfn, &pmem->pgmap);
   478			if (rc)
   479				return rc;
   480		}
   481	
   482		/* we're attaching a block device, disable raw namespace access */
   483		devm_namespace_disable(dev, ndns);
   484	
   485		dev_set_drvdata(dev, pmem);
   486		pmem->phys_addr = res->start;
   487		pmem->size = resource_size(res);
   488		fua = nvdimm_has_flush(nd_region);
   489		if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) || fua < 0) {
   490			dev_warn(dev, "unable to guarantee persistence of writes\n");
   491			fua = 0;
   492		}
   493	
   494		parent = devm_request_mem_region(dev, res->start, resource_size(res),
   495					dev_name(&ndns->dev));
   496		if (!res) {
   497			dev_warn(dev, "could not reserve region %pR\n", res);
   498			return -EBUSY;
   499		}
   500	
 > 501		pgmap_parent_resource(&pmem->pgmap, parent);
   502	
   503		disk = blk_alloc_disk(nid);
   504		if (!disk)
   505			return -ENOMEM;
   506		q = disk->queue;
   507	
   508		pmem->disk = disk;
   509		pmem->pgmap.owner = pmem;
   510		pmem->pfn_flags = PFN_DEV;
   511		if (is_nd_pfn(dev)) {
   512			pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
   513			pmem->pgmap.ops = &fsdax_pagemap_ops;
   514			addr = devm_memremap_pages(dev, &pmem->pgmap);
   515			pfn_sb = nd_pfn->pfn_sb;
   516			pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
   517			pmem->pfn_pad = resource_size(res) -
   518				range_len(&pmem->pgmap.range);
   519			pmem->pfn_flags |= PFN_MAP;
   520			bb_range = pmem->pgmap.range;
   521			bb_range.start += pmem->data_offset;
   522		} else if (pmem_should_map_pages(dev)) {
   523			pmem->pgmap.range.start = res->start;
   524			pmem->pgmap.range.end = res->end;
   525			pmem->pgmap.nr_range = 1;
   526			pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
   527			pmem->pgmap.ops = &fsdax_pagemap_ops;
   528			addr = devm_memremap_pages(dev, &pmem->pgmap);
   529			pmem->pfn_flags |= PFN_MAP;
   530			bb_range = pmem->pgmap.range;
   531		} else {
   532			addr = devm_memremap(dev, pmem->phys_addr,
   533					pmem->size, ARCH_MEMREMAP_PMEM);
   534			bb_range.start =  res->start;
   535			bb_range.end = res->end;
   536		}
   537	
   538		if (IS_ERR(addr)) {
   539			rc = PTR_ERR(addr);
   540			goto out;
   541		}
   542		pmem->virt_addr = addr;
   543	
   544		blk_queue_write_cache(q, true, fua);
   545		blk_queue_physical_block_size(q, PAGE_SIZE);
   546		blk_queue_logical_block_size(q, pmem_sector_size(ndns));
   547		blk_queue_max_hw_sectors(q, UINT_MAX);
   548		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
   549		blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
   550		if (pmem->pfn_flags & PFN_MAP)
   551			blk_queue_flag_set(QUEUE_FLAG_DAX, q);
   552	
   553		disk->fops		= &pmem_fops;
   554		disk->private_data	= pmem;
   555		nvdimm_namespace_disk_name(ndns, disk->disk_name);
   556		set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
   557				/ 512);
   558		if (devm_init_badblocks(dev, &pmem->bb))
   559			return -ENOMEM;
   560		nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_range);
   561		disk->bb = &pmem->bb;
   562	
   563		dax_dev = alloc_dax(pmem, &pmem_dax_ops);
   564		if (IS_ERR(dax_dev)) {
   565			rc = PTR_ERR(dax_dev);
   566			goto out;
   567		}
   568		set_dax_nocache(dax_dev);
   569		set_dax_nomc(dax_dev);
   570		if (is_nvdimm_sync(nd_region))
   571			set_dax_synchronous(dax_dev);
   572		rc = dax_add_host(dax_dev, disk);
   573		if (rc)
   574			goto out_cleanup_dax;
   575		dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
   576		pmem->dax_dev = dax_dev;
   577	
   578		rc = device_add_disk(dev, disk, pmem_attribute_groups);
   579		if (rc)
   580			goto out_remove_host;
   581		if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
   582			return -ENOMEM;
   583	
   584		nvdimm_check_and_set_ro(disk);
   585	
   586		pmem->bb_state = sysfs_get_dirent(disk_to_dev(disk)->kobj.sd,
   587						  "badblocks");
   588		if (!pmem->bb_state)
   589			dev_warn(dev, "'badblocks' notification disabled\n");
   590		return 0;
   591	
   592	out_remove_host:
   593		dax_remove_host(pmem->disk);
   594	out_cleanup_dax:
   595		kill_dax(pmem->dax_dev);
   596		put_dax(pmem->dax_dev);
   597	out:
   598		put_disk(pmem->disk);
   599		return rc;
   600	}
   601	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v3 0/7] device backed vmemmap crash dump support
  2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
                   ` (6 preceding siblings ...)
  2024-03-06 10:28 ` [PATCH v3 7/7] nvdimm: set force_raw=1 in kdump kernel Li Zhijian
@ 2024-03-21  5:40 ` Zhijian Li (Fujitsu)
  2024-03-21  6:17   ` Baoquan He
  2024-04-19  2:05   ` Zhijian Li (Fujitsu)
  7 siblings, 2 replies; 14+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-03-21  5:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yasunori Gotou (Fujitsu),
	Alison Schofield, Andrew Morton, Baoquan He, Borislav Petkov,
	Dan Williams, Dave Hansen, Dave Jiang, Greg Kroah-Hartman, hpa,
	Ingo Molnar, Ira Weiny, Thomas Gleixner, Vishal Verma, linux-cxl,
	linux-mm, nvdimm, x86, kexec

ping


Any comment is welcome.


On 06/03/2024 18:28, Li Zhijian wrote:
> Hello folks,
> 
> Compared with the V2[1] I posted a long time ago, this time it is a
> completely new proposal design.
> 
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for troubleshooting the kernel. It is the
> final way to chase what happened at the kernel panic, slowdown, and so on. It
> is one of the most important tools for customer support.
> 
> Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
> configure the dumpable regions. Generally, (A)iomem resources registered with
> flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
> (B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
> are dumpable.
> 
> The pmem use cases including fsdax and devdax, could map their vmemmap to
> their own devices. In this case, these part of vmemmap will not be dumped when
> crash happened since these regions are satisfied with neither the above (A)
> nor (B).
> 
> In fsdax, the vmemmap(struct page array) becomes very important, it is one of
> the key data to find status of reverse map. Lacking of the information may
> cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
> That means troubleshooters are unable to check more details about pmem from
> the dumpfile.
> 
> ### Proposal ###
> ---
> In this proposal, register the device backed vmemmap as a separate resource.
> This resource has its own new flag and name, and then teaches kexec_file_load(2)
> and kexec_load(2) to mark it as dumpable.
> 
> Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
> Proposed name: "Device Backed Vmemmap"
> 
> NOTE: crash-utils also needs to adapt to this new name for kexec_load()
> 
> With current proposal, the /proc/iomem should show as following for device
> backed vmemmap
> # cat /proc/iomem
> ...
> fffc0000-ffffffff : Reserved
> 100000000-13fffffff : Persistent Memory
>    100000000-10fffffff : namespace0.0
>      100000000-1005fffff : Device Backed Vmemmap  # fsdax
> a80000000-b7fffffff : CXL Window 0
>    a80000000-affffffff : Persistent Memory
>      a80000000-affffffff : region1
>        a80000000-a811fffff : namespace1.0
>          a80000000-a811fffff : Device Backed Vmemmap # devdax
>        a81200000-abfffffff : dax1.0
> b80000000-c7fffffff : CXL Window 1
> c80000000-147fffffff : PCI Bus 0000:00
>    c80000000-c801fffff : PCI Bus 0000:01
> ...
> 
> ### Kdump service reloading ###
> ---
> Once the kdump service is loaded, if changes to CPUs or memory occur,
> either by hot un/plug or off/onlining, the crash elfcorehdr should also
> be updated. There are 2 approaches to make the reloading more efficient.
> 1) Use udev rules to watch CPU and memory events, then reload kdump
> 2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
> 
> This reloading also needed when device backed vmemmap layouts change, Similar
> to what 1) does now, one could add the following as the first lines to the
> RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
> 
> # namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
> # devdax <-> system-ram updated: watch daxX.Y of dax
> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
> 
> Regarding 2), my idea is that it would need to call the memory_notify() in
> devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
> hotplug. This part is not yet mature, but it does not affect the whole feature
> because we can still use method 1) alternatively.
> 
> [1] https://lore.kernel.org/lkml/02066f0f-dbc0-0388-4233-8e24b6f8435b@fujitsu.com/T/
> --------------------------------------------
> changes from V2[1]
> - new proposal design
> 
> CC: Alison Schofield <alison.schofield@intel.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Baoquan He <bhe@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: Dan Williams <dan.j.williams@intel.com>
> CC: Dave Hansen <dave.hansen@linux.intel.com>
> CC: Dave Jiang <dave.jiang@intel.com>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Ira Weiny <ira.weiny@intel.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Vishal Verma <vishal.l.verma@intel.com>
> CC: linux-cxl@vger.kernel.org
> CC: linux-mm@kvack.org
> CC: nvdimm@lists.linux.dev
> CC: x86@kernel.org
> 
> Li Zhijian (7):
>    mm: memremap: register/unregister altmap region to a separate resource
>    mm: memremap: add pgmap_parent_resource() helper
>    nvdimm: pmem: assign a parent resource for vmemmap region for the
>      fsdax
>    dax: pmem: assign a parent resource for vmemmap region for the devdax
>    resource: Introduce walk device_backed_vmemmap res() helper
>    x86/crash: make device backed vmemmap dumpable for kexec_file_load
>    nvdimm: set force_raw=1 in kdump kernel
> 
>   arch/x86/kernel/crash.c         |  5 +++++
>   drivers/dax/pmem.c              |  8 ++++++--
>   drivers/nvdimm/namespace_devs.c |  3 +++
>   drivers/nvdimm/pmem.c           |  9 ++++++---
>   include/linux/ioport.h          |  4 ++++
>   include/linux/memremap.h        |  4 ++++
>   kernel/resource.c               | 13 +++++++++++++
>   mm/memremap.c                   | 30 +++++++++++++++++++++++++++++-
>   8 files changed, 70 insertions(+), 6 deletions(-)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v3 0/7] device backed vmemmap crash dump support
  2024-03-21  5:40 ` [RFC PATCH v3 0/7] device backed vmemmap crash dump support Zhijian Li (Fujitsu)
@ 2024-03-21  6:17   ` Baoquan He
  2024-03-21  6:57     ` Zhijian Li (Fujitsu)
  2024-04-19  2:05   ` Zhijian Li (Fujitsu)
  1 sibling, 1 reply; 14+ messages in thread
From: Baoquan He @ 2024-03-21  6:17 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: linux-kernel, Yasunori Gotou (Fujitsu),
	Alison Schofield, Andrew Morton, Borislav Petkov, Dan Williams,
	Dave Hansen, Dave Jiang, Greg Kroah-Hartman, hpa, Ingo Molnar,
	Ira Weiny, Thomas Gleixner, Vishal Verma, linux-cxl, linux-mm,
	nvdimm, x86, kexec

On 03/21/24 at 05:40am, Zhijian Li (Fujitsu) wrote:
> ping
> 
> 
> Any comment is welcome.

I will have a look at this from kdump side. How do you test your code?

By the way, there's issue reported by test robot.

Thanks
Baoquan

> 
> 
> On 06/03/2024 18:28, Li Zhijian wrote:
> > Hello folks,
> > 
> > Compared with the V2[1] I posted a long time ago, this time it is a
> > completely new proposal design.
> > 
> > ### Background and motivate overview ###
> > ---
> > Crash dump is an important feature for troubleshooting the kernel. It is the
> > final way to chase what happened at the kernel panic, slowdown, and so on. It
> > is one of the most important tools for customer support.
> > 
> > Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
> > configure the dumpable regions. Generally, (A)iomem resources registered with
> > flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
> > (B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
> > are dumpable.
> > 
> > The pmem use cases including fsdax and devdax, could map their vmemmap to
> > their own devices. In this case, these part of vmemmap will not be dumped when
> > crash happened since these regions are satisfied with neither the above (A)
> > nor (B).
> > 
> > In fsdax, the vmemmap(struct page array) becomes very important, it is one of
> > the key data to find status of reverse map. Lacking of the information may
> > cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
> > That means troubleshooters are unable to check more details about pmem from
> > the dumpfile.
> > 
> > ### Proposal ###
> > ---
> > In this proposal, register the device backed vmemmap as a separate resource.
> > This resource has its own new flag and name, and then teaches kexec_file_load(2)
> > and kexec_load(2) to mark it as dumpable.
> > 
> > Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
> > Proposed name: "Device Backed Vmemmap"
> > 
> > NOTE: crash-utils also needs to adapt to this new name for kexec_load()
> > 
> > With current proposal, the /proc/iomem should show as following for device
> > backed vmemmap
> > # cat /proc/iomem
> > ...
> > fffc0000-ffffffff : Reserved
> > 100000000-13fffffff : Persistent Memory
> >    100000000-10fffffff : namespace0.0
> >      100000000-1005fffff : Device Backed Vmemmap  # fsdax
> > a80000000-b7fffffff : CXL Window 0
> >    a80000000-affffffff : Persistent Memory
> >      a80000000-affffffff : region1
> >        a80000000-a811fffff : namespace1.0
> >          a80000000-a811fffff : Device Backed Vmemmap # devdax
> >        a81200000-abfffffff : dax1.0
> > b80000000-c7fffffff : CXL Window 1
> > c80000000-147fffffff : PCI Bus 0000:00
> >    c80000000-c801fffff : PCI Bus 0000:01
> > ...
> > 
> > ### Kdump service reloading ###
> > ---
> > Once the kdump service is loaded, if changes to CPUs or memory occur,
> > either by hot un/plug or off/onlining, the crash elfcorehdr should also
> > be updated. There are 2 approaches to make the reloading more efficient.
> > 1) Use udev rules to watch CPU and memory events, then reload kdump
> > 2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
> > 
> > This reloading also needed when device backed vmemmap layouts change, Similar
> > to what 1) does now, one could add the following as the first lines to the
> > RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
> > 
> > # namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
> > SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
> > SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
> > # devdax <-> system-ram updated: watch daxX.Y of dax
> > SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
> > SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
> > 
> > Regarding 2), my idea is that it would need to call the memory_notify() in
> > devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
> > hotplug. This part is not yet mature, but it does not affect the whole feature
> > because we can still use method 1) alternatively.
> > 
> > [1] https://lore.kernel.org/lkml/02066f0f-dbc0-0388-4233-8e24b6f8435b@fujitsu.com/T/
> > --------------------------------------------
> > changes from V2[1]
> > - new proposal design
> > 
> > CC: Alison Schofield <alison.schofield@intel.com>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > CC: Baoquan He <bhe@redhat.com>
> > CC: Borislav Petkov <bp@alien8.de>
> > CC: Dan Williams <dan.j.williams@intel.com>
> > CC: Dave Hansen <dave.hansen@linux.intel.com>
> > CC: Dave Jiang <dave.jiang@intel.com>
> > CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > CC: "H. Peter Anvin" <hpa@zytor.com>
> > CC: Ingo Molnar <mingo@redhat.com>
> > CC: Ira Weiny <ira.weiny@intel.com>
> > CC: Thomas Gleixner <tglx@linutronix.de>
> > CC: Vishal Verma <vishal.l.verma@intel.com>
> > CC: linux-cxl@vger.kernel.org
> > CC: linux-mm@kvack.org
> > CC: nvdimm@lists.linux.dev
> > CC: x86@kernel.org
> > 
> > Li Zhijian (7):
> >    mm: memremap: register/unregister altmap region to a separate resource
> >    mm: memremap: add pgmap_parent_resource() helper
> >    nvdimm: pmem: assign a parent resource for vmemmap region for the
> >      fsdax
> >    dax: pmem: assign a parent resource for vmemmap region for the devdax
> >    resource: Introduce walk device_backed_vmemmap res() helper
> >    x86/crash: make device backed vmemmap dumpable for kexec_file_load
> >    nvdimm: set force_raw=1 in kdump kernel
> > 
> >   arch/x86/kernel/crash.c         |  5 +++++
> >   drivers/dax/pmem.c              |  8 ++++++--
> >   drivers/nvdimm/namespace_devs.c |  3 +++
> >   drivers/nvdimm/pmem.c           |  9 ++++++---
> >   include/linux/ioport.h          |  4 ++++
> >   include/linux/memremap.h        |  4 ++++
> >   kernel/resource.c               | 13 +++++++++++++
> >   mm/memremap.c                   | 30 +++++++++++++++++++++++++++++-
> >   8 files changed, 70 insertions(+), 6 deletions(-)
> > 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v3 0/7] device backed vmemmap crash dump support
  2024-03-21  6:17   ` Baoquan He
@ 2024-03-21  6:57     ` Zhijian Li (Fujitsu)
  0 siblings, 0 replies; 14+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-03-21  6:57 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-kernel, Yasunori Gotou (Fujitsu),
	Alison Schofield, Andrew Morton, Borislav Petkov, Dan Williams,
	Dave Hansen, Dave Jiang, Greg Kroah-Hartman, hpa, Ingo Molnar,
	Ira Weiny, Thomas Gleixner, Vishal Verma, linux-cxl, linux-mm,
	nvdimm, x86, kexec

Baoquan,



On 21/03/2024 14:17, Baoquan He wrote:
> On 03/21/24 at 05:40am, Zhijian Li (Fujitsu) wrote:
>> ping
>>
>>
>> Any comment is welcome.
> 
> I will have a look at this from kdump side. How do you test your code?

Thanks for your support.

- nothing change is required for makedumpfile and crash-utils
- nothing change is required for kexe-utils if you are using kexec_file_load(2)
- kexec-tool needs apply below patch if you are using kexec_load(2),


After the coredump is collected, crash-utils is able to inspect the device backed vmemmap memory.


 From 4a2f0f91cdd8b084bb4ebafdc4f71b8e2a333720 Mon Sep 17 00:00:00 2001
From: Li Zhijian <lizhijian@fujitsu.com>
Date: Fri, 1 Mar 2024 13:44:36 +0800
Subject: [PATCH] consider device_backed_vmemmap dumpable
                                                          
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
  kexec/arch/i386/crashdump-x86.c    | 2 ++
  kexec/arch/i386/kexec-x86-common.c | 3 +++
  kexec/crashdump-elf.c              | 2 +-
  kexec/kexec.h                      | 1 +
  4 files changed, 7 insertions(+), 1 deletion(-)
                                                          
diff --git a/kexec/arch/i386/crashdump-x86.c b/kexec/arch/i386/crashdump-x86.c
index a01031e570aa..757922dc3a57 100644
--- a/kexec/arch/i386/crashdump-x86.c
+++ b/kexec/arch/i386/crashdump-x86.c
@@ -284,6 +284,8 @@ static int get_crash_memory_ranges(struct memory_range **range, int *ranges,
                         type = RANGE_PRAM;
                 } else if(memcmp(str,"Persistent Memory\n",18) == 0 ) {
                         type = RANGE_PMEM;
+               } else if (memcmp(str, "Device Backed Vmemmap\n", 22)) {
+                       type = RANGE_DEVICE_BACKED_VMEMMAP;
                 } else if(memcmp(str,"reserved\n",9) == 0 ) {
                         type = RANGE_RESERVED;
                 } else if (memcmp(str, "Reserved\n", 9) == 0) {
diff --git a/kexec/arch/i386/kexec-x86-common.c b/kexec/arch/i386/kexec-x86-common.c
index ffc95a9e43f8..0dd76903e7fc 100644
--- a/kexec/arch/i386/kexec-x86-common.c
+++ b/kexec/arch/i386/kexec-x86-common.c
@@ -111,6 +111,9 @@ static int get_memory_ranges_proc_iomem(struct memory_range **range, int *ranges
                 else if (memcmp(str, "Persistent Memory\n", 18) == 0) {
                         type = RANGE_PMEM;
                 }
+               else if (memcmp(str, "Device Backed Vmemmap\n", 22)) {
+                       type = RANGE_DEVICE_BACKED_VMEMMAP;
+               }
                 else {
                         continue;
                 }
diff --git a/kexec/crashdump-elf.c b/kexec/crashdump-elf.c
index b8bb686a17ca..98ad854e3f8d 100644
--- a/kexec/crashdump-elf.c
+++ b/kexec/crashdump-elf.c
@@ -199,7 +199,7 @@ int FUNC(struct kexec_info *info,
          * A seprate program header for Backup Region*/
         for (i = 0; i < ranges; i++, range++) {
                 unsigned long long mstart, mend;
-               if (range->type != RANGE_RAM)
+               if (range->type != RANGE_RAM || range->type != RANGE_DEVICE_BACKED_VMEMMAP)
                         continue;
                 mstart = range->start;
                 mend = range->end;
diff --git a/kexec/kexec.h b/kexec/kexec.h
index 1004aff15945..c0481bb2727d 100644
--- a/kexec/kexec.h
+++ b/kexec/kexec.h
@@ -139,6 +139,7 @@ struct memory_range {
  #define RANGE_UNCACHED 4
  #define RANGE_PMEM             6
  #define RANGE_PRAM             11
+#define RANGE_DEVICE_BACKED_VMEMMAP 12
  };
   
  struct memory_ranges {
-- 
2.29.2



> 
> By the way, there's issue reported by test robot.


I see, i have fixed it locally which doesn't matter for this feature(it fails in !ZONE_DEVICE)


Thanks
Zhijian

> 
> Thanks
> Baoquan
> 
>>
>>
>> On 06/03/2024 18:28, Li Zhijian wrote:
>>> Hello folks,
>>>
>>> Compared with the V2[1] I posted a long time ago, this time it is a
>>> completely new proposal design.
>>>
>>> ### Background and motivate overview ###
>>> ---
>>> Crash dump is an important feature for troubleshooting the kernel. It is the
>>> final way to chase what happened at the kernel panic, slowdown, and so on. It
>>> is one of the most important tools for customer support.
>>>
>>> Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
>>> configure the dumpable regions. Generally, (A)iomem resources registered with
>>> flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
>>> (B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
>>> are dumpable.
>>>
>>> The pmem use cases including fsdax and devdax, could map their vmemmap to
>>> their own devices. In this case, these part of vmemmap will not be dumped when
>>> crash happened since these regions are satisfied with neither the above (A)
>>> nor (B).
>>>
>>> In fsdax, the vmemmap(struct page array) becomes very important, it is one of
>>> the key data to find status of reverse map. Lacking of the information may
>>> cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
>>> That means troubleshooters are unable to check more details about pmem from
>>> the dumpfile.
>>>
>>> ### Proposal ###
>>> ---
>>> In this proposal, register the device backed vmemmap as a separate resource.
>>> This resource has its own new flag and name, and then teaches kexec_file_load(2)
>>> and kexec_load(2) to mark it as dumpable.
>>>
>>> Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
>>> Proposed name: "Device Backed Vmemmap"
>>>
>>> NOTE: crash-utils also needs to adapt to this new name for kexec_load()
>>>
>>> With current proposal, the /proc/iomem should show as following for device
>>> backed vmemmap
>>> # cat /proc/iomem
>>> ...
>>> fffc0000-ffffffff : Reserved
>>> 100000000-13fffffff : Persistent Memory
>>>     100000000-10fffffff : namespace0.0
>>>       100000000-1005fffff : Device Backed Vmemmap  # fsdax
>>> a80000000-b7fffffff : CXL Window 0
>>>     a80000000-affffffff : Persistent Memory
>>>       a80000000-affffffff : region1
>>>         a80000000-a811fffff : namespace1.0
>>>           a80000000-a811fffff : Device Backed Vmemmap # devdax
>>>         a81200000-abfffffff : dax1.0
>>> b80000000-c7fffffff : CXL Window 1
>>> c80000000-147fffffff : PCI Bus 0000:00
>>>     c80000000-c801fffff : PCI Bus 0000:01
>>> ...
>>>
>>> ### Kdump service reloading ###
>>> ---
>>> Once the kdump service is loaded, if changes to CPUs or memory occur,
>>> either by hot un/plug or off/onlining, the crash elfcorehdr should also
>>> be updated. There are 2 approaches to make the reloading more efficient.
>>> 1) Use udev rules to watch CPU and memory events, then reload kdump
>>> 2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
>>>
>>> This reloading also needed when device backed vmemmap layouts change, Similar
>>> to what 1) does now, one could add the following as the first lines to the
>>> RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
>>>
>>> # namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
>>> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
>>> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>>> # devdax <-> system-ram updated: watch daxX.Y of dax
>>> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
>>> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>>>
>>> Regarding 2), my idea is that it would need to call the memory_notify() in
>>> devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
>>> hotplug. This part is not yet mature, but it does not affect the whole feature
>>> because we can still use method 1) alternatively.
>>>
>>> [1] https://lore.kernel.org/lkml/02066f0f-dbc0-0388-4233-8e24b6f8435b@fujitsu.com/T/
>>> --------------------------------------------
>>> changes from V2[1]
>>> - new proposal design
>>>
>>> CC: Alison Schofield <alison.schofield@intel.com>
>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>> CC: Baoquan He <bhe@redhat.com>
>>> CC: Borislav Petkov <bp@alien8.de>
>>> CC: Dan Williams <dan.j.williams@intel.com>
>>> CC: Dave Hansen <dave.hansen@linux.intel.com>
>>> CC: Dave Jiang <dave.jiang@intel.com>
>>> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> CC: "H. Peter Anvin" <hpa@zytor.com>
>>> CC: Ingo Molnar <mingo@redhat.com>
>>> CC: Ira Weiny <ira.weiny@intel.com>
>>> CC: Thomas Gleixner <tglx@linutronix.de>
>>> CC: Vishal Verma <vishal.l.verma@intel.com>
>>> CC: linux-cxl@vger.kernel.org
>>> CC: linux-mm@kvack.org
>>> CC: nvdimm@lists.linux.dev
>>> CC: x86@kernel.org
>>>
>>> Li Zhijian (7):
>>>     mm: memremap: register/unregister altmap region to a separate resource
>>>     mm: memremap: add pgmap_parent_resource() helper
>>>     nvdimm: pmem: assign a parent resource for vmemmap region for the
>>>       fsdax
>>>     dax: pmem: assign a parent resource for vmemmap region for the devdax
>>>     resource: Introduce walk device_backed_vmemmap res() helper
>>>     x86/crash: make device backed vmemmap dumpable for kexec_file_load
>>>     nvdimm: set force_raw=1 in kdump kernel
>>>
>>>    arch/x86/kernel/crash.c         |  5 +++++
>>>    drivers/dax/pmem.c              |  8 ++++++--
>>>    drivers/nvdimm/namespace_devs.c |  3 +++
>>>    drivers/nvdimm/pmem.c           |  9 ++++++---
>>>    include/linux/ioport.h          |  4 ++++
>>>    include/linux/memremap.h        |  4 ++++
>>>    kernel/resource.c               | 13 +++++++++++++
>>>    mm/memremap.c                   | 30 +++++++++++++++++++++++++++++-
>>>    8 files changed, 70 insertions(+), 6 deletions(-)
>>>
> 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v3 0/7] device backed vmemmap crash dump support
  2024-03-21  5:40 ` [RFC PATCH v3 0/7] device backed vmemmap crash dump support Zhijian Li (Fujitsu)
  2024-03-21  6:17   ` Baoquan He
@ 2024-04-19  2:05   ` Zhijian Li (Fujitsu)
  1 sibling, 0 replies; 14+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-04-19  2:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yasunori Gotou (Fujitsu),
	Alison Schofield, Andrew Morton, Baoquan He, Borislav Petkov,
	Dan Williams, Dave Hansen, Dave Jiang, Greg Kroah-Hartman, hpa,
	Ingo Molnar, Ira Weiny, Thomas Gleixner, Vishal Verma, linux-cxl,
	linux-mm, nvdimm, x86, kexec


ping again,

I would appreciate feedback from the nvdimm and linux-mm communities

Thanks
Zhijian


On 21/03/2024 13:40, Li Zhijian wrote:
> ping
> 
> 
> Any comment is welcome.
> 
> 
> On 06/03/2024 18:28, Li Zhijian wrote:
>> Hello folks,
>>
>> Compared with the V2[1] I posted a long time ago, this time it is a
>> completely new proposal design.
>>
>> ### Background and motivate overview ###
>> ---
>> Crash dump is an important feature for troubleshooting the kernel. It is the
>> final way to chase what happened at the kernel panic, slowdown, and so on. It
>> is one of the most important tools for customer support.
>>
>> Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
>> configure the dumpable regions. Generally, (A)iomem resources registered with
>> flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
>> (B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
>> are dumpable.
>>
>> The pmem use cases including fsdax and devdax, could map their vmemmap to
>> their own devices. In this case, these part of vmemmap will not be dumped when
>> crash happened since these regions are satisfied with neither the above (A)
>> nor (B).
>>
>> In fsdax, the vmemmap(struct page array) becomes very important, it is one of
>> the key data to find status of reverse map. Lacking of the information may
>> cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
>> That means troubleshooters are unable to check more details about pmem from
>> the dumpfile.
>>
>> ### Proposal ###
>> ---
>> In this proposal, register the device backed vmemmap as a separate resource.
>> This resource has its own new flag and name, and then teaches kexec_file_load(2)
>> and kexec_load(2) to mark it as dumpable.
>>
>> Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
>> Proposed name: "Device Backed Vmemmap"
>>
>> NOTE: crash-utils also needs to adapt to this new name for kexec_load()
>>
>> With current proposal, the /proc/iomem should show as following for device
>> backed vmemmap
>> # cat /proc/iomem
>> ...
>> fffc0000-ffffffff : Reserved
>> 100000000-13fffffff : Persistent Memory
>>    100000000-10fffffff : namespace0.0
>>      100000000-1005fffff : Device Backed Vmemmap  # fsdax
>> a80000000-b7fffffff : CXL Window 0
>>    a80000000-affffffff : Persistent Memory
>>      a80000000-affffffff : region1
>>        a80000000-a811fffff : namespace1.0
>>          a80000000-a811fffff : Device Backed Vmemmap # devdax
>>        a81200000-abfffffff : dax1.0
>> b80000000-c7fffffff : CXL Window 1
>> c80000000-147fffffff : PCI Bus 0000:00
>>    c80000000-c801fffff : PCI Bus 0000:01
>> ...
>>
>> ### Kdump service reloading ###
>> ---
>> Once the kdump service is loaded, if changes to CPUs or memory occur,
>> either by hot un/plug or off/onlining, the crash elfcorehdr should also
>> be updated. There are 2 approaches to make the reloading more efficient.
>> 1) Use udev rules to watch CPU and memory events, then reload kdump
>> 2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
>>
>> This reloading also needed when device backed vmemmap layouts change, Similar
>> to what 1) does now, one could add the following as the first lines to the
>> RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
>>
>> # namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
>> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
>> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>> # devdax <-> system-ram updated: watch daxX.Y of dax
>> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
>> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>>
>> Regarding 2), my idea is that it would need to call the memory_notify() in
>> devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
>> hotplug. This part is not yet mature, but it does not affect the whole feature
>> because we can still use method 1) alternatively.
>>
>> [1] https://lore.kernel.org/lkml/02066f0f-dbc0-0388-4233-8e24b6f8435b@fujitsu.com/T/
>> --------------------------------------------
>> changes from V2[1]
>> - new proposal design
>>
>> CC: Alison Schofield <alison.schofield@intel.com>
>> CC: Andrew Morton <akpm@linux-foundation.org>
>> CC: Baoquan He <bhe@redhat.com>
>> CC: Borislav Petkov <bp@alien8.de>
>> CC: Dan Williams <dan.j.williams@intel.com>
>> CC: Dave Hansen <dave.hansen@linux.intel.com>
>> CC: Dave Jiang <dave.jiang@intel.com>
>> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> CC: "H. Peter Anvin" <hpa@zytor.com>
>> CC: Ingo Molnar <mingo@redhat.com>
>> CC: Ira Weiny <ira.weiny@intel.com>
>> CC: Thomas Gleixner <tglx@linutronix.de>
>> CC: Vishal Verma <vishal.l.verma@intel.com>
>> CC: linux-cxl@vger.kernel.org
>> CC: linux-mm@kvack.org
>> CC: nvdimm@lists.linux.dev
>> CC: x86@kernel.org
>>
>> Li Zhijian (7):
>>    mm: memremap: register/unregister altmap region to a separate resource
>>    mm: memremap: add pgmap_parent_resource() helper
>>    nvdimm: pmem: assign a parent resource for vmemmap region for the
>>      fsdax
>>    dax: pmem: assign a parent resource for vmemmap region for the devdax
>>    resource: Introduce walk device_backed_vmemmap res() helper
>>    x86/crash: make device backed vmemmap dumpable for kexec_file_load
>>    nvdimm: set force_raw=1 in kdump kernel
>>
>>   arch/x86/kernel/crash.c         |  5 +++++
>>   drivers/dax/pmem.c              |  8 ++++++--
>>   drivers/nvdimm/namespace_devs.c |  3 +++
>>   drivers/nvdimm/pmem.c           |  9 ++++++---
>>   include/linux/ioport.h          |  4 ++++
>>   include/linux/memremap.h        |  4 ++++
>>   kernel/resource.c               | 13 +++++++++++++
>>   mm/memremap.c                   | 30 +++++++++++++++++++++++++++++-
>>   8 files changed, 70 insertions(+), 6 deletions(-)
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-04-19  2:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-06 10:28 [RFC PATCH v3 0/7] device backed vmemmap crash dump support Li Zhijian
2024-03-06 10:28 ` [PATCH v3 1/7] mm: memremap: register/unregister altmap region to a separate resource Li Zhijian
2024-03-06 10:28 ` [PATCH v3 2/7] mm: memremap: add pgmap_parent_resource() helper Li Zhijian
2024-03-06 10:28 ` [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax Li Zhijian
2024-03-07  8:07   ` kernel test robot
2024-03-07 11:08   ` kernel test robot
2024-03-06 10:28 ` [PATCH v3 4/7] dax: pmem: assign a parent resource for vmemmap region for the devdax Li Zhijian
2024-03-06 10:28 ` [PATCH v3 5/7] resource: Introduce walk device_backed_vmemmap res() helper Li Zhijian
2024-03-06 10:28 ` [PATCH v3 6/7] x86/crash: make device backed vmemmap dumpable for kexec_file_load Li Zhijian
2024-03-06 10:28 ` [PATCH v3 7/7] nvdimm: set force_raw=1 in kdump kernel Li Zhijian
2024-03-21  5:40 ` [RFC PATCH v3 0/7] device backed vmemmap crash dump support Zhijian Li (Fujitsu)
2024-03-21  6:17   ` Baoquan He
2024-03-21  6:57     ` Zhijian Li (Fujitsu)
2024-04-19  2:05   ` Zhijian Li (Fujitsu)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).