linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Manual definition of Soft Reserved memory devices
@ 2020-03-22 16:12 Dan Williams
  2020-03-22 16:12 ` [PATCH v2 1/6] x86/numa: Cleanup configuration dependent command-line options Dan Williams
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Dan Williams @ 2020-03-22 16:12 UTC (permalink / raw)
  To: linux-acpi
  Cc: Jason Gunthorpe, Rafael J. Wysocki, Peter Zijlstra,
	Ard Biesheuvel, Jonathan Cameron, Borislav Petkov, Wei Yang, x86,
	H. Peter Anvin, Brice Goglin, Thomas Gleixner, Catalin Marinas,
	Jeff Moyer, Ingo Molnar, Dave Hansen, Will Deacon,
	Rafael J. Wysocki, Ard Biesheuvel, Andy Lutomirski, Tom Lendacky,
	linux-nvdimm, linux-kernel, x86, joao.m.martins

Changes since v1 [1]:
- Kill the ifdef'ery in arch/x86/mm/numa.c (Rafael)

- Add a dummy phys_to_target_node() for ARM64 (0day-robot)

- Initialize ->child and ->sibling to NULL in the resource returned by
  find_next_iomem_res() (Inspired by Tom's feedback even though it does
  not set them like he suggested)

- Collect Ard's Ack

[1]: http://lore.kernel.org/r/158318759687.2216124.4684754859068906007.stgit@dwillia2-desk3.amr.corp.intel.com

---

My primary motivation is making the dax_kmem facility useful to
shipping platforms that have performance differentiated memory, but
may not have EFI-defined soft-reservations / HMAT (or
non-EFI-ACPI-platform equivalent). I'm anticipating HMAT enabled
platforms where the platform firmware policy for what is
soft-reserved, or not, is not the policy the system owner would pick.
I'd also highlight Joao's work [2] (see the TODO section) as an
indication of the demand for custom carving memory resources and
applying the device-dax memory management interface.

Given the current dearth of systems that supply an ACPI HMAT table, and
the utility of being able to manually define device-dax "hmem" instances
via the efi_fake_mem= option, relax the requirements for creating these
devices. Specifically, add an option (numa=nohmat) to optionally disable
consideration of the HMAT and update efi_fake_mem= to behave like
memmap=nn!ss in terms of delimiting device boundaries.

[2]: https://lore.kernel.org/lkml/20200110190313.17144-1-joao.m.martins@oracle.com/

With Ard's and Rafael's ack I'd feel ok taking this through the nvdimm
tree, please holler if anything still needs some fixups.

Dependencies:

b2ca916ce392 ACPI: NUMA: Up-level "map to online node" functionality
4fcbe96e4d0b mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
575e23b6e13c powerpc/papr_scm: Switch to numa_map_to_online_node()
1e5d8e1e47af x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
5d30f92e7631 x86/NUMA: Provide a range-to-target_node lookup facility
7b27a8622f80 libnvdimm/e820: Retrieve and populate correct 'target_node' info

Tested with:

        numa=nohmat efi_fake_mem=4G@9G:0x40000,4G@13G:0x40000

...to create to device-dax instances:

	# daxctl list -RDu
	[
	  {
	    "path":"\/platform\/hmem.1",
	    "id":1,
	    "size":"4.00 GiB (4.29 GB)",
	    "align":2097152,
	    "devices":[
	      {
	        "chardev":"dax1.0",
	        "size":"4.00 GiB (4.29 GB)",
	        "target_node":3,
	        "mode":"devdax"
	      }
	    ]
	  },
	  {
	    "path":"\/platform\/hmem.0",
	    "id":0,
	    "size":"4.00 GiB (4.29 GB)",
	    "align":2097152,
	    "devices":[
	      {
	        "chardev":"dax0.0",
	        "size":"4.00 GiB (4.29 GB)",
	        "target_node":2,
	        "mode":"devdax"
	      }
	    ]
	  }
	]


---

Dan Williams (6):
      x86/numa: Cleanup configuration dependent command-line options
      x86/numa: Add 'nohmat' option
      efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
      ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
      resource: Report parent to walk_iomem_res_desc() callback
      ACPI: HMAT: Attach a device for each soft-reserved range

 arch/arm64/mm/numa.c                |   13 +++++
 arch/x86/include/asm/numa.h         |    8 +++
 arch/x86/kernel/e820.c              |   16 +++++-
 arch/x86/mm/numa.c                  |   10 +---
 arch/x86/mm/numa_emulation.c        |    3 +
 arch/x86/xen/enlighten_pv.c         |    2 -
 drivers/acpi/numa/hmat.c            |   76 +++++----------------------
 drivers/acpi/numa/srat.c            |    9 +++
 drivers/dax/Kconfig                 |    5 ++
 drivers/dax/Makefile                |    3 -
 drivers/dax/hmem/Makefile           |    6 ++
 drivers/dax/hmem/device.c           |   97 +++++++++++++++++++++++++++++++++++
 drivers/dax/hmem/hmem.c             |    2 -
 drivers/firmware/efi/x86_fake_mem.c |   12 +++-
 include/acpi/acpi_numa.h            |   14 +++++
 include/linux/dax.h                 |    8 +++
 kernel/resource.c                   |   11 +++-
 17 files changed, 209 insertions(+), 86 deletions(-)
 create mode 100644 drivers/dax/hmem/Makefile
 create mode 100644 drivers/dax/hmem/device.c
 rename drivers/dax/{hmem.c => hmem/hmem.c} (98%)

base-commit: 7b27a8622f802761d5c6abd6c37b22312a35343c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/6] x86/numa: Cleanup configuration dependent command-line options
  2020-03-22 16:12 [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Dan Williams
@ 2020-03-22 16:12 ` Dan Williams
  2020-03-22 16:12 ` [PATCH v2 2/6] x86/numa: Add 'nohmat' option Dan Williams
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2020-03-22 16:12 UTC (permalink / raw)
  To: linux-acpi
  Cc: Rafael J. Wysocki, peterz, dave.hansen, ard.biesheuvel,
	linux-nvdimm, linux-kernel, x86, joao.m.martins

In preparation for adding a new numa= option clean up the existing ones
to avoid ifdefs in numa_setup(), and provide feedback when the option is
numa=fake= option is invalid due to kernel config. The same does not
need to be done for numa=noacpi, since the capability is already hard
disabled at compile-time.

Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/numa.h  |    8 +++++++-
 arch/x86/mm/numa.c           |    8 ++------
 arch/x86/mm/numa_emulation.c |    3 ++-
 arch/x86/xen/enlighten_pv.c  |    2 +-
 drivers/acpi/numa/srat.c     |    9 +++++++--
 include/acpi/acpi_numa.h     |    6 +++++-
 6 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index bbfde3d2662f..0aecc0b629e0 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_NUMA_H
 
 #include <linux/nodemask.h>
+#include <linux/errno.h>
 
 #include <asm/topology.h>
 #include <asm/apicdef.h>
@@ -77,7 +78,12 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
 #ifdef CONFIG_NUMA_EMU
 #define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
 #define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
-void numa_emu_cmdline(char *);
+int numa_emu_cmdline(char *str);
+#else /* CONFIG_NUMA_EMU */
+static inline int numa_emu_cmdline(char *str)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_NUMA_EMU */
 
 #endif	/* _ASM_X86_NUMA_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 59ba008504dc..9c1266b2628c 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -37,14 +37,10 @@ static __init int numa_setup(char *opt)
 		return -EINVAL;
 	if (!strncmp(opt, "off", 3))
 		numa_off = 1;
-#ifdef CONFIG_NUMA_EMU
 	if (!strncmp(opt, "fake=", 5))
-		numa_emu_cmdline(opt + 5);
-#endif
-#ifdef CONFIG_ACPI_NUMA
+		return numa_emu_cmdline(opt + 5);
 	if (!strncmp(opt, "noacpi", 6))
-		acpi_numa = -1;
-#endif
+		disable_srat();
 	return 0;
 }
 early_param("numa", numa_setup);
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 7f1d2034df1e..0bd27231e038 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -13,9 +13,10 @@
 static int emu_nid_to_phys[MAX_NUMNODES];
 static char *emu_cmdline __initdata;
 
-void __init numa_emu_cmdline(char *str)
+int __init numa_emu_cmdline(char *str)
 {
 	emu_cmdline = str;
+	return 0;
 }
 
 static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 1f756ffffe8b..7ca8f257b47b 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -1283,7 +1283,7 @@ asmlinkage __visible void __init xen_start_kernel(void)
 	 * any NUMA information the kernel tries to get from ACPI will
 	 * be meaningless.  Prevent it from trying.
 	 */
-	acpi_numa = -1;
+	disable_srat();
 #endif
 	WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_pv, xen_cpu_dead_pv));
 
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 47b4969d9b93..aa8e5d53f646 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -27,7 +27,12 @@ static int node_to_pxm_map[MAX_NUMNODES]
 			= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };
 
 unsigned char acpi_srat_revision __initdata;
-int acpi_numa __initdata;
+static int acpi_numa __initdata;
+
+void __init disable_srat(void)
+{
+	acpi_numa = -1;
+}
 
 int pxm_to_node(int pxm)
 {
@@ -162,7 +167,7 @@ static int __init slit_valid(struct acpi_table_slit *slit)
 void __init bad_srat(void)
 {
 	pr_err("SRAT: SRAT not used.\n");
-	acpi_numa = -1;
+	disable_srat();
 }
 
 int __init srat_disabled(void)
diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
index fdebcfc6c8df..8784183b2204 100644
--- a/include/acpi/acpi_numa.h
+++ b/include/acpi/acpi_numa.h
@@ -17,10 +17,14 @@ extern int pxm_to_node(int);
 extern int node_to_pxm(int);
 extern int acpi_map_pxm_to_node(int);
 extern unsigned char acpi_srat_revision;
-extern int acpi_numa __initdata;
+extern void disable_srat(void);
 
 extern void bad_srat(void);
 extern int srat_disabled(void);
 
+#else				/* CONFIG_ACPI_NUMA */
+static inline void disable_srat(void)
+{
+}
 #endif				/* CONFIG_ACPI_NUMA */
 #endif				/* __ACP_NUMA_H */


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/6] x86/numa: Add 'nohmat' option
  2020-03-22 16:12 [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Dan Williams
  2020-03-22 16:12 ` [PATCH v2 1/6] x86/numa: Cleanup configuration dependent command-line options Dan Williams
@ 2020-03-22 16:12 ` Dan Williams
  2020-03-22 16:12 ` [PATCH v2 3/6] efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance Dan Williams
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2020-03-22 16:12 UTC (permalink / raw)
  To: linux-acpi
  Cc: x86, Rafael J. Wysocki, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, ard.biesheuvel, linux-nvdimm, linux-kernel, x86,
	joao.m.martins

Disable parsing of the HMAT for debug, to workaround broken platform
instances, or cases where it is otherwise not wanted.

Cc: x86@kernel.org
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/numa.c       |    2 ++
 drivers/acpi/numa/hmat.c |    8 +++++++-
 include/acpi/acpi_numa.h |    8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 9c1266b2628c..250a013cbbe0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -41,6 +41,8 @@ static __init int numa_setup(char *opt)
 		return numa_emu_cmdline(opt + 5);
 	if (!strncmp(opt, "noacpi", 6))
 		disable_srat();
+	if (!strncmp(opt, "nohmat", 6))
+		disable_hmat();
 	return 0;
 }
 early_param("numa", numa_setup);
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 2c32cfb72370..a12e36a12618 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -26,6 +26,12 @@
 #include <linux/sysfs.h>
 
 static u8 hmat_revision;
+static int hmat_disable __initdata;
+
+void __init disable_hmat(void)
+{
+	hmat_disable = 1;
+}
 
 static LIST_HEAD(targets);
 static LIST_HEAD(initiators);
@@ -814,7 +820,7 @@ static __init int hmat_init(void)
 	enum acpi_hmat_type i;
 	acpi_status status;
 
-	if (srat_disabled())
+	if (srat_disabled() || hmat_disable)
 		return 0;
 
 	status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
index 8784183b2204..0e9302285f14 100644
--- a/include/acpi/acpi_numa.h
+++ b/include/acpi/acpi_numa.h
@@ -27,4 +27,12 @@ static inline void disable_srat(void)
 {
 }
 #endif				/* CONFIG_ACPI_NUMA */
+
+#ifdef CONFIG_ACPI_HMAT
+extern void disable_hmat(void);
+#else				/* CONFIG_ACPI_HMAT */
+static inline void disable_hmat(void)
+{
+}
+#endif				/* CONFIG_ACPI_HMAT */
 #endif				/* __ACP_NUMA_H */


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/6] efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
  2020-03-22 16:12 [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Dan Williams
  2020-03-22 16:12 ` [PATCH v2 1/6] x86/numa: Cleanup configuration dependent command-line options Dan Williams
  2020-03-22 16:12 ` [PATCH v2 2/6] x86/numa: Add 'nohmat' option Dan Williams
@ 2020-03-22 16:12 ` Dan Williams
  2020-03-22 16:12 ` [PATCH v2 4/6] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device Dan Williams
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2020-03-22 16:12 UTC (permalink / raw)
  To: linux-acpi
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Ard Biesheuvel, peterz, dave.hansen, ard.biesheuvel,
	linux-nvdimm, linux-kernel, x86, joao.m.martins

In preparation for attaching a platform device per iomem resource teach
the efi_fake_mem code to create an e820 entry per instance. Similar to
E820_TYPE_PRAM, bypass merging resource when the e820 map is sanitized.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/kernel/e820.c              |   16 +++++++++++++++-
 drivers/firmware/efi/x86_fake_mem.c |   12 +++++++++---
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c5399e80c59c..96babb3a6629 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -305,6 +305,20 @@ static int __init cpcompare(const void *a, const void *b)
 	return (ap->addr != ap->entry->addr) - (bp->addr != bp->entry->addr);
 }
 
+static bool e820_nomerge(enum e820_type type)
+{
+	/*
+	 * These types may indicate distinct platform ranges aligned to
+	 * numa node, protection domain, performance domain, or other
+	 * boundaries. Do not merge them.
+	 */
+	if (type == E820_TYPE_PRAM)
+		return true;
+	if (type == E820_TYPE_SOFT_RESERVED)
+		return true;
+	return false;
+}
+
 int __init e820__update_table(struct e820_table *table)
 {
 	struct e820_entry *entries = table->entries;
@@ -380,7 +394,7 @@ int __init e820__update_table(struct e820_table *table)
 		}
 
 		/* Continue building up new map based on this information: */
-		if (current_type != last_type || current_type == E820_TYPE_PRAM) {
+		if (current_type != last_type || e820_nomerge(current_type)) {
 			if (last_type != 0)	 {
 				new_entries[new_nr_entries].size = change_point[chg_idx]->addr - last_addr;
 				/* Move forward only if the new size was non-zero: */
diff --git a/drivers/firmware/efi/x86_fake_mem.c b/drivers/firmware/efi/x86_fake_mem.c
index e5d6d5a1b240..0bafcc1bb0f6 100644
--- a/drivers/firmware/efi/x86_fake_mem.c
+++ b/drivers/firmware/efi/x86_fake_mem.c
@@ -38,7 +38,7 @@ void __init efi_fake_memmap_early(void)
 		m_start = mem->range.start;
 		m_end = mem->range.end;
 		for_each_efi_memory_desc(md) {
-			u64 start, end;
+			u64 start, end, size;
 
 			if (md->type != EFI_CONVENTIONAL_MEMORY)
 				continue;
@@ -58,11 +58,17 @@ void __init efi_fake_memmap_early(void)
 			 */
 			start = max(start, m_start);
 			end = min(end, m_end);
+			size = end - start + 1;
 
 			if (end <= start)
 				continue;
-			e820__range_update(start, end - start + 1, E820_TYPE_RAM,
-					E820_TYPE_SOFT_RESERVED);
+
+			/*
+			 * Ensure each efi_fake_mem instance results in
+			 * a unique e820 resource
+			 */
+			e820__range_remove(start, size, E820_TYPE_RAM, 1);
+			e820__range_add(start, size, E820_TYPE_SOFT_RESERVED);
 			e820__update_table(e820_table);
 		}
 	}


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/6] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
  2020-03-22 16:12 [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Dan Williams
                   ` (2 preceding siblings ...)
  2020-03-22 16:12 ` [PATCH v2 3/6] efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance Dan Williams
@ 2020-03-22 16:12 ` Dan Williams
  2020-03-24 19:40   ` Joao Martins
  2020-03-22 16:12 ` [PATCH v2 5/6] resource: Report parent to walk_iomem_res_desc() callback Dan Williams
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2020-03-22 16:12 UTC (permalink / raw)
  To: linux-acpi
  Cc: Rafael J. Wysocki, peterz, dave.hansen, ard.biesheuvel,
	linux-nvdimm, linux-kernel, x86, joao.m.martins

In preparation for exposing "Soft Reserved" memory ranges without an
HMAT, move the hmem device registration to its own compilation unit and
make the implementation generic.

The generic implementation drops usage acpi_map_pxm_to_online_node()
that was translating ACPI proximity domain values and instead relies on
numa_map_to_online_node() to determine the numa node for the device.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://lore.kernel.org/r/158318761484.2216124.2049322072599482736.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/numa/hmat.c  |   68 ++++-----------------------------------------
 drivers/dax/Kconfig       |    4 +++
 drivers/dax/Makefile      |    3 +-
 drivers/dax/hmem.c        |   56 -------------------------------------
 drivers/dax/hmem/Makefile |    5 +++
 drivers/dax/hmem/device.c |   64 ++++++++++++++++++++++++++++++++++++++++++
 drivers/dax/hmem/hmem.c   |   56 +++++++++++++++++++++++++++++++++++++
 include/linux/dax.h       |    8 +++++
 8 files changed, 144 insertions(+), 120 deletions(-)
 delete mode 100644 drivers/dax/hmem.c
 create mode 100644 drivers/dax/hmem/Makefile
 create mode 100644 drivers/dax/hmem/device.c
 create mode 100644 drivers/dax/hmem/hmem.c

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index a12e36a12618..134bcb40b2af 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -24,6 +24,7 @@
 #include <linux/mutex.h>
 #include <linux/node.h>
 #include <linux/sysfs.h>
+#include <linux/dax.h>
 
 static u8 hmat_revision;
 static int hmat_disable __initdata;
@@ -640,66 +641,6 @@ static void hmat_register_target_perf(struct memory_target *target)
 	node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
 }
 
-static void hmat_register_target_device(struct memory_target *target,
-		struct resource *r)
-{
-	/* define a clean / non-busy resource for the platform device */
-	struct resource res = {
-		.start = r->start,
-		.end = r->end,
-		.flags = IORESOURCE_MEM,
-	};
-	struct platform_device *pdev;
-	struct memregion_info info;
-	int rc, id;
-
-	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
-			IORES_DESC_SOFT_RESERVED);
-	if (rc != REGION_INTERSECTS)
-		return;
-
-	id = memregion_alloc(GFP_KERNEL);
-	if (id < 0) {
-		pr_err("memregion allocation failure for %pr\n", &res);
-		return;
-	}
-
-	pdev = platform_device_alloc("hmem", id);
-	if (!pdev) {
-		pr_err("hmem device allocation failure for %pr\n", &res);
-		goto out_pdev;
-	}
-
-	pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->memory_pxm);
-	info = (struct memregion_info) {
-		.target_node = acpi_map_pxm_to_node(target->memory_pxm),
-	};
-	rc = platform_device_add_data(pdev, &info, sizeof(info));
-	if (rc < 0) {
-		pr_err("hmem memregion_info allocation failure for %pr\n", &res);
-		goto out_pdev;
-	}
-
-	rc = platform_device_add_resources(pdev, &res, 1);
-	if (rc < 0) {
-		pr_err("hmem resource allocation failure for %pr\n", &res);
-		goto out_resource;
-	}
-
-	rc = platform_device_add(pdev);
-	if (rc < 0) {
-		dev_err(&pdev->dev, "device add failed for %pr\n", &res);
-		goto out_resource;
-	}
-
-	return;
-
-out_resource:
-	put_device(&pdev->dev);
-out_pdev:
-	memregion_free(id);
-}
-
 static void hmat_register_target_devices(struct memory_target *target)
 {
 	struct resource *res;
@@ -711,8 +652,11 @@ static void hmat_register_target_devices(struct memory_target *target)
 	if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
 		return;
 
-	for (res = target->memregions.child; res; res = res->sibling)
-		hmat_register_target_device(target, res);
+	for (res = target->memregions.child; res; res = res->sibling) {
+		int target_nid = acpi_map_pxm_to_node(target->memory_pxm);
+
+		hmem_register_device(target_nid, res);
+	}
 }
 
 static void hmat_register_target(struct memory_target *target)
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 3b6c06f07326..a229f45d34aa 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -48,6 +48,10 @@ config DEV_DAX_HMEM
 
 	  Say M if unsure.
 
+config DEV_DAX_HMEM_DEVICES
+	depends on DEV_DAX_HMEM
+	def_bool y
+
 config DEV_DAX_KMEM
 	tristate "KMEM DAX: volatile-use of persistent memory"
 	default DEV_DAX
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 80065b38b3c4..9d4ba672d305 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -2,11 +2,10 @@
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
-obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
 
 dax-y := super.o
 dax-y += bus.o
 device_dax-y := device.o
-dax_hmem-y := hmem.o
 
 obj-y += pmem/
+obj-y += hmem/
diff --git a/drivers/dax/hmem.c b/drivers/dax/hmem.c
deleted file mode 100644
index fe7214daf62e..000000000000
--- a/drivers/dax/hmem.c
+++ /dev/null
@@ -1,56 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/platform_device.h>
-#include <linux/memregion.h>
-#include <linux/module.h>
-#include <linux/pfn_t.h>
-#include "bus.h"
-
-static int dax_hmem_probe(struct platform_device *pdev)
-{
-	struct device *dev = &pdev->dev;
-	struct dev_pagemap pgmap = { };
-	struct dax_region *dax_region;
-	struct memregion_info *mri;
-	struct dev_dax *dev_dax;
-	struct resource *res;
-
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (!res)
-		return -ENOMEM;
-
-	mri = dev->platform_data;
-	memcpy(&pgmap.res, res, sizeof(*res));
-
-	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
-			PMD_SIZE, PFN_DEV|PFN_MAP);
-	if (!dax_region)
-		return -ENOMEM;
-
-	dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
-	if (IS_ERR(dev_dax))
-		return PTR_ERR(dev_dax);
-
-	/* child dev_dax instances now own the lifetime of the dax_region */
-	dax_region_put(dax_region);
-	return 0;
-}
-
-static int dax_hmem_remove(struct platform_device *pdev)
-{
-	/* devm handles teardown */
-	return 0;
-}
-
-static struct platform_driver dax_hmem_driver = {
-	.probe = dax_hmem_probe,
-	.remove = dax_hmem_remove,
-	.driver = {
-		.name = "hmem",
-	},
-};
-
-module_platform_driver(dax_hmem_driver);
-
-MODULE_ALIAS("platform:hmem*");
-MODULE_LICENSE("GPL v2");
-MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
new file mode 100644
index 000000000000..a9d353d0c9ed
--- /dev/null
+++ b/drivers/dax/hmem/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
+obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o
+
+dax_hmem-y := hmem.o
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
new file mode 100644
index 000000000000..99bc15a8b031
--- /dev/null
+++ b/drivers/dax/hmem/device.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/platform_device.h>
+#include <linux/memregion.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+
+void hmem_register_device(int target_nid, struct resource *r)
+{
+	/* define a clean / non-busy resource for the platform device */
+	struct resource res = {
+		.start = r->start,
+		.end = r->end,
+		.flags = IORESOURCE_MEM,
+	};
+	struct platform_device *pdev;
+	struct memregion_info info;
+	int rc, id;
+
+	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
+			IORES_DESC_SOFT_RESERVED);
+	if (rc != REGION_INTERSECTS)
+		return;
+
+	id = memregion_alloc(GFP_KERNEL);
+	if (id < 0) {
+		pr_err("memregion allocation failure for %pr\n", &res);
+		return;
+	}
+
+	pdev = platform_device_alloc("hmem", id);
+	if (!pdev) {
+		pr_err("hmem device allocation failure for %pr\n", &res);
+		goto out_pdev;
+	}
+
+	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
+	info = (struct memregion_info) {
+		.target_node = target_nid,
+	};
+	rc = platform_device_add_data(pdev, &info, sizeof(info));
+	if (rc < 0) {
+		pr_err("hmem memregion_info allocation failure for %pr\n", &res);
+		goto out_pdev;
+	}
+
+	rc = platform_device_add_resources(pdev, &res, 1);
+	if (rc < 0) {
+		pr_err("hmem resource allocation failure for %pr\n", &res);
+		goto out_resource;
+	}
+
+	rc = platform_device_add(pdev);
+	if (rc < 0) {
+		dev_err(&pdev->dev, "device add failed for %pr\n", &res);
+		goto out_resource;
+	}
+
+	return;
+
+out_resource:
+	put_device(&pdev->dev);
+out_pdev:
+	memregion_free(id);
+}
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
new file mode 100644
index 000000000000..29ceb5795297
--- /dev/null
+++ b/drivers/dax/hmem/hmem.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/platform_device.h>
+#include <linux/memregion.h>
+#include <linux/module.h>
+#include <linux/pfn_t.h>
+#include "../bus.h"
+
+static int dax_hmem_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct dev_pagemap pgmap = { };
+	struct dax_region *dax_region;
+	struct memregion_info *mri;
+	struct dev_dax *dev_dax;
+	struct resource *res;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENOMEM;
+
+	mri = dev->platform_data;
+	memcpy(&pgmap.res, res, sizeof(*res));
+
+	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
+			PMD_SIZE, PFN_DEV|PFN_MAP);
+	if (!dax_region)
+		return -ENOMEM;
+
+	dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
+	if (IS_ERR(dev_dax))
+		return PTR_ERR(dev_dax);
+
+	/* child dev_dax instances now own the lifetime of the dax_region */
+	dax_region_put(dax_region);
+	return 0;
+}
+
+static int dax_hmem_remove(struct platform_device *pdev)
+{
+	/* devm handles teardown */
+	return 0;
+}
+
+static struct platform_driver dax_hmem_driver = {
+	.probe = dax_hmem_probe,
+	.remove = dax_hmem_remove,
+	.driver = {
+		.name = "hmem",
+	},
+};
+
+module_platform_driver(dax_hmem_driver);
+
+MODULE_ALIAS("platform:hmem*");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9bd8528bd305..9f6c282e9140 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -239,4 +239,12 @@ static inline bool dax_mapping(struct address_space *mapping)
 	return mapping->host && IS_DAX(mapping->host);
 }
 
+#ifdef CONFIG_DEV_DAX_HMEM_DEVICES
+void hmem_register_device(int target_nid, struct resource *r);
+#else
+static inline void hmem_register_device(int target_nid, struct resource *r)
+{
+}
+#endif
+
 #endif


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/6] resource: Report parent to walk_iomem_res_desc() callback
  2020-03-22 16:12 [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Dan Williams
                   ` (3 preceding siblings ...)
  2020-03-22 16:12 ` [PATCH v2 4/6] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device Dan Williams
@ 2020-03-22 16:12 ` Dan Williams
  2020-03-22 16:12 ` [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range Dan Williams
  2020-03-25 10:02 ` [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Rafael J. Wysocki
  6 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2020-03-22 16:12 UTC (permalink / raw)
  To: linux-acpi
  Cc: Jason Gunthorpe, Dave Hansen, Wei Yang, Tom Lendacky, peterz,
	ard.biesheuvel, linux-nvdimm, linux-kernel, x86, joao.m.martins

In support of detecting whether a resource might have been been claimed,
report the parent to the walk_iomem_res_desc() callback. For example,
the ACPI HMAT parser publishes "hmem" platform devices per target range.
However, if the HMAT is disabled / missing a fallback driver can attach
devices to the raw memory ranges as a fallback if it sees unclaimed /
orphan "Soft Reserved" resources in the resource tree.

Otherwise, find_next_iomem_res() returns a resource with garbage data
from the stack allocation in __walk_iomem_res_desc() for the res->parent
field.

There are currently no users that expect ->child and ->sibling to be
valid, and the resource_lock would be needed to traverse them. Use a
compound literal to implicitly zero initialize the fields that are not
being returned in addition to setting ->parent.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/resource.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 76036a41143b..f54ccf7a1009 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -382,10 +382,13 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 
 	if (p) {
 		/* copy data */
-		res->start = max(start, p->start);
-		res->end = min(end, p->end);
-		res->flags = p->flags;
-		res->desc = p->desc;
+		*res = (struct resource) {
+			.start = max(start, p->start),
+			.end = min(end, p->end),
+			.flags = p->flags,
+			.desc = p->desc,
+			.parent = p->parent,
+		};
 	}
 
 	read_unlock(&resource_lock);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range
  2020-03-22 16:12 [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Dan Williams
                   ` (4 preceding siblings ...)
  2020-03-22 16:12 ` [PATCH v2 5/6] resource: Report parent to walk_iomem_res_desc() callback Dan Williams
@ 2020-03-22 16:12 ` Dan Williams
  2020-03-24 19:41   ` Joao Martins
  2020-03-25 11:10   ` Will Deacon
  2020-03-25 10:02 ` [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Rafael J. Wysocki
  6 siblings, 2 replies; 16+ messages in thread
From: Dan Williams @ 2020-03-22 16:12 UTC (permalink / raw)
  To: linux-acpi
  Cc: Jonathan Cameron, Brice Goglin, Ard Biesheuvel,
	Rafael J. Wysocki, Jeff Moyer, Catalin Marinas, Will Deacon,
	peterz, dave.hansen, linux-nvdimm, linux-kernel, x86,
	joao.m.martins

The hmem enabling in commit 'cf8741ac57ed ("ACPI: NUMA: HMAT: Register
"soft reserved" memory as an "hmem" device")' only registered ranges to
the hmem driver for each soft-reservation that also appeared in the
HMAT. While this is meant to encourage platform firmware to "do the
right thing" and publish an HMAT, the corollary is that platforms that
fail to publish an accurate HMAT will strand memory from Linux usage.
Additionally, the "efi_fake_mem" kernel command line option enabling
will strand memory by default without an HMAT.

Arrange for "soft reserved" memory that goes unclaimed by HMAT entries
to be published as raw resource ranges for the hmem driver to consume.

Include a module parameter to disable either this fallback behavior, or
the hmat enabling from creating hmem devices. The module parameter
requires the hmem device enabling to have unique name in the module
namespace: "device_hmem".

Rather than mark this x86-only, include an interim phys_to_target_node()
implementation for arm64.

Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm64/mm/numa.c      |   13 +++++++++++++
 drivers/dax/Kconfig       |    1 +
 drivers/dax/hmem/Makefile |    3 ++-
 drivers/dax/hmem/device.c |   33 +++++++++++++++++++++++++++++++++
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 4decf1659700..00fba21eaec0 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -468,3 +468,16 @@ int memory_add_physaddr_to_nid(u64 addr)
 	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
 	return 0;
 }
+
+/*
+ * device-dax instance registrations want a valid target-node in case
+ * they are ever onlined as memory (see hmem_register_device()).
+ *
+ * TODO: consult cached numa info
+ */
+int phys_to_target_node(phys_addr_t addr)
+{
+	pr_warn_once("Unknown target node for memory at 0x%llx, assuming node 0\n",
+			addr);
+	return 0;
+}
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a229f45d34aa..163edde6ba41 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -50,6 +50,7 @@ config DEV_DAX_HMEM
 
 config DEV_DAX_HMEM_DEVICES
 	depends on DEV_DAX_HMEM
+	select NUMA_KEEP_MEMINFO if NUMA
 	def_bool y
 
 config DEV_DAX_KMEM
diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
index a9d353d0c9ed..57377b4c3d47 100644
--- a/drivers/dax/hmem/Makefile
+++ b/drivers/dax/hmem/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
-obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o
+obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o
 
+device_hmem-y := device.o
 dax_hmem-y := hmem.o
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index 99bc15a8b031..f9c5fa8b1880 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -4,6 +4,9 @@
 #include <linux/module.h>
 #include <linux/mm.h>
 
+static bool nohmem;
+module_param_named(disable, nohmem, bool, 0444);
+
 void hmem_register_device(int target_nid, struct resource *r)
 {
 	/* define a clean / non-busy resource for the platform device */
@@ -16,6 +19,9 @@ void hmem_register_device(int target_nid, struct resource *r)
 	struct memregion_info info;
 	int rc, id;
 
+	if (nohmem)
+		return;
+
 	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
 			IORES_DESC_SOFT_RESERVED);
 	if (rc != REGION_INTERSECTS)
@@ -62,3 +68,30 @@ void hmem_register_device(int target_nid, struct resource *r)
 out_pdev:
 	memregion_free(id);
 }
+
+static __init int hmem_register_one(struct resource *res, void *data)
+{
+	/*
+	 * If the resource is not a top-level resource it was already
+	 * assigned to a device by the HMAT parsing.
+	 */
+	if (res->parent != &iomem_resource)
+		return 0;
+
+	hmem_register_device(phys_to_target_node(res->start), res);
+
+	return 0;
+}
+
+static __init int hmem_init(void)
+{
+	walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
+			IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
+	return 0;
+}
+
+/*
+ * As this is a fallback for address ranges unclaimed by the ACPI HMAT
+ * parsing it must be at an initcall level greater than hmat_init().
+ */
+late_initcall(hmem_init);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
  2020-03-22 16:12 ` [PATCH v2 4/6] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device Dan Williams
@ 2020-03-24 19:40   ` Joao Martins
  2020-03-24 21:04     ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Joao Martins @ 2020-03-24 19:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-acpi, Rafael J. Wysocki, peterz, dave.hansen,
	ard.biesheuvel, linux-nvdimm, linux-kernel, x86


On 3/22/20 4:12 PM, Dan Williams wrote:
> In preparation for exposing "Soft Reserved" memory ranges without an
> HMAT, move the hmem device registration to its own compilation unit and
> make the implementation generic.
> 
> The generic implementation drops usage acpi_map_pxm_to_online_node()
> that was translating ACPI proximity domain values and instead relies on
> numa_map_to_online_node() to determine the numa node for the device.
> 
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Link: https://lore.kernel.org/r/158318761484.2216124.2049322072599482736.stgit@dwillia2-desk3.amr.corp.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/acpi/numa/hmat.c  |   68 ++++-----------------------------------------
>  drivers/dax/Kconfig       |    4 +++
>  drivers/dax/Makefile      |    3 +-
>  drivers/dax/hmem.c        |   56 -------------------------------------
>  drivers/dax/hmem/Makefile |    5 +++
>  drivers/dax/hmem/device.c |   64 ++++++++++++++++++++++++++++++++++++++++++
>  drivers/dax/hmem/hmem.c   |   56 +++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h       |    8 +++++
>  8 files changed, 144 insertions(+), 120 deletions(-)
>  delete mode 100644 drivers/dax/hmem.c
>  create mode 100644 drivers/dax/hmem/Makefile
>  create mode 100644 drivers/dax/hmem/device.c
>  create mode 100644 drivers/dax/hmem/hmem.c
> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index a12e36a12618..134bcb40b2af 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -24,6 +24,7 @@
>  #include <linux/mutex.h>
>  #include <linux/node.h>
>  #include <linux/sysfs.h>
> +#include <linux/dax.h>
>  
>  static u8 hmat_revision;
>  static int hmat_disable __initdata;
> @@ -640,66 +641,6 @@ static void hmat_register_target_perf(struct memory_target *target)
>  	node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
>  }
>  
> -static void hmat_register_target_device(struct memory_target *target,
          ^^^^ int ?

> -		struct resource *r)
> -{
> -	/* define a clean / non-busy resource for the platform device */
> -	struct resource res = {
> -		.start = r->start,
> -		.end = r->end,
> -		.flags = IORESOURCE_MEM,
> -	};
> -	struct platform_device *pdev;
> -	struct memregion_info info;
> -	int rc, id;
> -
> -	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
> -			IORES_DESC_SOFT_RESERVED);
> -	if (rc != REGION_INTERSECTS)
> -		return;
		^ return -ENXIO;

> -
> -	id = memregion_alloc(GFP_KERNEL);
> -	if (id < 0) {
> -		pr_err("memregion allocation failure for %pr\n", &res);
> -		return;
		^ return -ENOMEM;

> -	}
> -
> -	pdev = platform_device_alloc("hmem", id);
> -	if (!pdev) {

		rc = -ENOMEM;

> -		pr_err("hmem device allocation failure for %pr\n", &res);
> -		goto out_pdev;
> -	}
> -
> -	pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->memory_pxm);
> -	info = (struct memregion_info) {
> -		.target_node = acpi_map_pxm_to_node(target->memory_pxm),
> -	};
> -	rc = platform_device_add_data(pdev, &info, sizeof(info));
> -	if (rc < 0) {
> -		pr_err("hmem memregion_info allocation failure for %pr\n", &res);
> -		goto out_pdev;
> -	}
> -
> -	rc = platform_device_add_resources(pdev, &res, 1);
> -	if (rc < 0) {
> -		pr_err("hmem resource allocation failure for %pr\n", &res);
> -		goto out_resource;
> -	}
> -
> -	rc = platform_device_add(pdev);
> -	if (rc < 0) {
> -		dev_err(&pdev->dev, "device add failed for %pr\n", &res);
> -		goto out_resource;
> -	}
> -
> -	return;
	^^^^^^ return 0;
> -
> -out_resource:
> -	put_device(&pdev->dev);
> -out_pdev:
> -	memregion_free(id);

	return rc;

> -}
> -
>  static void hmat_register_target_devices(struct memory_target *target)
>  {
>  	struct resource *res;
> @@ -711,8 +652,11 @@ static void hmat_register_target_devices(struct memory_target *target)
>  	if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
>  		return;
>  
> -	for (res = target->memregions.child; res; res = res->sibling)
> -		hmat_register_target_device(target, res);
> +	for (res = target->memregions.child; res; res = res->sibling) {
> +		int target_nid = acpi_map_pxm_to_node(target->memory_pxm);
> +
> +		hmem_register_device(target_nid, res);
> +	}
>  }
>  

If it makes sense to propagate error to hmem_register_device (as noted above),
then here perhaps a pr_err() if hmem registration fails mainly for reporting
purposes?

Regardless of the error nits, looks good overall. So, for what is worth:

  Reviewed-by: Joao Martins <joao.m.martins@oracle.com>

	Joao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range
  2020-03-22 16:12 ` [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range Dan Williams
@ 2020-03-24 19:41   ` Joao Martins
  2020-03-24 21:06     ` Dan Williams
  2020-03-25 11:10   ` Will Deacon
  1 sibling, 1 reply; 16+ messages in thread
From: Joao Martins @ 2020-03-24 19:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-acpi, Jonathan Cameron, Brice Goglin, Ard Biesheuvel,
	Rafael J. Wysocki, Catalin Marinas, Will Deacon, peterz,
	dave.hansen, linux-nvdimm, linux-kernel, x86

On 3/22/20 4:12 PM, Dan Williams wrote:
> The hmem enabling in commit 'cf8741ac57ed ("ACPI: NUMA: HMAT: Register
> "soft reserved" memory as an "hmem" device")' only registered ranges to
> the hmem driver for each soft-reservation that also appeared in the
> HMAT. While this is meant to encourage platform firmware to "do the
> right thing" and publish an HMAT, the corollary is that platforms that
> fail to publish an accurate HMAT will strand memory from Linux usage.
> Additionally, the "efi_fake_mem" kernel command line option enabling
> will strand memory by default without an HMAT.
> 
> Arrange for "soft reserved" memory that goes unclaimed by HMAT entries
> to be published as raw resource ranges for the hmem driver to consume.
> 
> Include a module parameter to disable either this fallback behavior, or
> the hmat enabling from creating hmem devices. The module parameter
> requires the hmem device enabling to have unique name in the module
> namespace: "device_hmem".
> 
> Rather than mark this x86-only, include an interim phys_to_target_node()
> implementation for arm64.
> 
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Brice Goglin <Brice.Goglin@inria.fr>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/arm64/mm/numa.c      |   13 +++++++++++++
>  drivers/dax/Kconfig       |    1 +
>  drivers/dax/hmem/Makefile |    3 ++-
>  drivers/dax/hmem/device.c |   33 +++++++++++++++++++++++++++++++++
>  4 files changed, 49 insertions(+), 1 deletion(-)
> 

[...]

> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index 99bc15a8b031..f9c5fa8b1880 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -4,6 +4,9 @@
>  #include <linux/module.h>
>  #include <linux/mm.h>
>  
> +static bool nohmem;
> +module_param_named(disable, nohmem, bool, 0444);
> +
>  void hmem_register_device(int target_nid, struct resource *r)
>  {
>  	/* define a clean / non-busy resource for the platform device */
> @@ -16,6 +19,9 @@ void hmem_register_device(int target_nid, struct resource *r)
>  	struct memregion_info info;
>  	int rc, id;
>  
> +	if (nohmem)
> +		return;
> +
>  	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
>  			IORES_DESC_SOFT_RESERVED);
>  	if (rc != REGION_INTERSECTS)
> @@ -62,3 +68,30 @@ void hmem_register_device(int target_nid, struct resource *r)
>  out_pdev:
>  	memregion_free(id);
>  }
> +
> +static __init int hmem_register_one(struct resource *res, void *data)
> +{
> +	/*
> +	 * If the resource is not a top-level resource it was already
> +	 * assigned to a device by the HMAT parsing.
> +	 */
> +	if (res->parent != &iomem_resource)
> +		return 0;
> +
> +	hmem_register_device(phys_to_target_node(res->start), res);
> +
> +	return 0;

Should we add an error returning value to hmem_register_device() perhaps this
ought to be reflected in hmem_register_one().

> +}
> +
> +static __init int hmem_init(void)
> +{
> +	walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
> +			IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
> +	return 0;
> +}
> +

(...) and then perhaps here returning in the initcall if any of the resources
failed hmem registration?

> +/*
> + * As this is a fallback for address ranges unclaimed by the ACPI HMAT
> + * parsing it must be at an initcall level greater than hmat_init().
> + */
> +late_initcall(hmem_init);

Regardless of the nit (which ties in to patch 4), looks good. So, FWIW:

  Reviewed-by: Joao Martins <joao.m.martins@oracle.com>

For the hmem changes.

	Joao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
  2020-03-24 19:40   ` Joao Martins
@ 2020-03-24 21:04     ` Dan Williams
  2020-03-25 22:32       ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2020-03-24 21:04 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux ACPI, Rafael J. Wysocki, Peter Zijlstra, Dave Hansen,
	Ard Biesheuvel, linux-nvdimm, Linux Kernel Mailing List, X86 ML

On Tue, Mar 24, 2020 at 12:41 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
> On 3/22/20 4:12 PM, Dan Williams wrote:
> > In preparation for exposing "Soft Reserved" memory ranges without an
> > HMAT, move the hmem device registration to its own compilation unit and
> > make the implementation generic.
> >
> > The generic implementation drops usage acpi_map_pxm_to_online_node()
> > that was translating ACPI proximity domain values and instead relies on
> > numa_map_to_online_node() to determine the numa node for the device.
> >
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Link: https://lore.kernel.org/r/158318761484.2216124.2049322072599482736.stgit@dwillia2-desk3.amr.corp.intel.com
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  drivers/acpi/numa/hmat.c  |   68 ++++-----------------------------------------
> >  drivers/dax/Kconfig       |    4 +++
> >  drivers/dax/Makefile      |    3 +-
> >  drivers/dax/hmem.c        |   56 -------------------------------------
> >  drivers/dax/hmem/Makefile |    5 +++
> >  drivers/dax/hmem/device.c |   64 ++++++++++++++++++++++++++++++++++++++++++
> >  drivers/dax/hmem/hmem.c   |   56 +++++++++++++++++++++++++++++++++++++
> >  include/linux/dax.h       |    8 +++++
> >  8 files changed, 144 insertions(+), 120 deletions(-)
> >  delete mode 100644 drivers/dax/hmem.c
> >  create mode 100644 drivers/dax/hmem/Makefile
> >  create mode 100644 drivers/dax/hmem/device.c
> >  create mode 100644 drivers/dax/hmem/hmem.c
> >
> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > index a12e36a12618..134bcb40b2af 100644
> > --- a/drivers/acpi/numa/hmat.c
> > +++ b/drivers/acpi/numa/hmat.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/mutex.h>
> >  #include <linux/node.h>
> >  #include <linux/sysfs.h>
> > +#include <linux/dax.h>
> >
> >  static u8 hmat_revision;
> >  static int hmat_disable __initdata;
> > @@ -640,66 +641,6 @@ static void hmat_register_target_perf(struct memory_target *target)
> >       node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
> >  }
> >
> > -static void hmat_register_target_device(struct memory_target *target,
>           ^^^^ int ?
>
> > -             struct resource *r)
> > -{
> > -     /* define a clean / non-busy resource for the platform device */
> > -     struct resource res = {
> > -             .start = r->start,
> > -             .end = r->end,
> > -             .flags = IORESOURCE_MEM,
> > -     };
> > -     struct platform_device *pdev;
> > -     struct memregion_info info;
> > -     int rc, id;
> > -
> > -     rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
> > -                     IORES_DESC_SOFT_RESERVED);
> > -     if (rc != REGION_INTERSECTS)
> > -             return;
>                 ^ return -ENXIO;
>
> > -
> > -     id = memregion_alloc(GFP_KERNEL);
> > -     if (id < 0) {
> > -             pr_err("memregion allocation failure for %pr\n", &res);
> > -             return;
>                 ^ return -ENOMEM;
>
> > -     }
> > -
> > -     pdev = platform_device_alloc("hmem", id);
> > -     if (!pdev) {
>
>                 rc = -ENOMEM;
>
> > -             pr_err("hmem device allocation failure for %pr\n", &res);
> > -             goto out_pdev;
> > -     }
> > -
> > -     pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->memory_pxm);
> > -     info = (struct memregion_info) {
> > -             .target_node = acpi_map_pxm_to_node(target->memory_pxm),
> > -     };
> > -     rc = platform_device_add_data(pdev, &info, sizeof(info));
> > -     if (rc < 0) {
> > -             pr_err("hmem memregion_info allocation failure for %pr\n", &res);
> > -             goto out_pdev;
> > -     }
> > -
> > -     rc = platform_device_add_resources(pdev, &res, 1);
> > -     if (rc < 0) {
> > -             pr_err("hmem resource allocation failure for %pr\n", &res);
> > -             goto out_resource;
> > -     }
> > -
> > -     rc = platform_device_add(pdev);
> > -     if (rc < 0) {
> > -             dev_err(&pdev->dev, "device add failed for %pr\n", &res);
> > -             goto out_resource;
> > -     }
> > -
> > -     return;
>         ^^^^^^ return 0;
> > -
> > -out_resource:
> > -     put_device(&pdev->dev);
> > -out_pdev:
> > -     memregion_free(id);
>
>         return rc;
>
> > -}
> > -
> >  static void hmat_register_target_devices(struct memory_target *target)
> >  {
> >       struct resource *res;
> > @@ -711,8 +652,11 @@ static void hmat_register_target_devices(struct memory_target *target)
> >       if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
> >               return;
> >
> > -     for (res = target->memregions.child; res; res = res->sibling)
> > -             hmat_register_target_device(target, res);
> > +     for (res = target->memregions.child; res; res = res->sibling) {
> > +             int target_nid = acpi_map_pxm_to_node(target->memory_pxm);
> > +
> > +             hmem_register_device(target_nid, res);
> > +     }
> >  }
> >
>
> If it makes sense to propagate error to hmem_register_device (as noted above),
> then here perhaps a pr_err() if hmem registration fails mainly for reporting
> purposes?

Yeah, I guess it makes sense to at least log that something went wrong
if someone wonders where their memory device went. I'll add that
rework as a follow-on.

> Regardless of the error nits, looks good overall. So, for what is worth:
>
>   Reviewed-by: Joao Martins <joao.m.martins@oracle.com>

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range
  2020-03-24 19:41   ` Joao Martins
@ 2020-03-24 21:06     ` Dan Williams
  2020-03-24 21:30       ` Joao Martins
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2020-03-24 21:06 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux ACPI, Jonathan Cameron, Brice Goglin, Ard Biesheuvel,
	Rafael J. Wysocki, Catalin Marinas, Will Deacon, Peter Zijlstra,
	Dave Hansen, linux-nvdimm, Linux Kernel Mailing List, X86 ML

On Tue, Mar 24, 2020 at 12:41 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 3/22/20 4:12 PM, Dan Williams wrote:
> > The hmem enabling in commit 'cf8741ac57ed ("ACPI: NUMA: HMAT: Register
> > "soft reserved" memory as an "hmem" device")' only registered ranges to
> > the hmem driver for each soft-reservation that also appeared in the
> > HMAT. While this is meant to encourage platform firmware to "do the
> > right thing" and publish an HMAT, the corollary is that platforms that
> > fail to publish an accurate HMAT will strand memory from Linux usage.
> > Additionally, the "efi_fake_mem" kernel command line option enabling
> > will strand memory by default without an HMAT.
> >
> > Arrange for "soft reserved" memory that goes unclaimed by HMAT entries
> > to be published as raw resource ranges for the hmem driver to consume.
> >
> > Include a module parameter to disable either this fallback behavior, or
> > the hmat enabling from creating hmem devices. The module parameter
> > requires the hmem device enabling to have unique name in the module
> > namespace: "device_hmem".
> >
> > Rather than mark this x86-only, include an interim phys_to_target_node()
> > implementation for arm64.
> >
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Cc: Brice Goglin <Brice.Goglin@inria.fr>
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  arch/arm64/mm/numa.c      |   13 +++++++++++++
> >  drivers/dax/Kconfig       |    1 +
> >  drivers/dax/hmem/Makefile |    3 ++-
> >  drivers/dax/hmem/device.c |   33 +++++++++++++++++++++++++++++++++
> >  4 files changed, 49 insertions(+), 1 deletion(-)
> >
>
> [...]
>
> > diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> > index 99bc15a8b031..f9c5fa8b1880 100644
> > --- a/drivers/dax/hmem/device.c
> > +++ b/drivers/dax/hmem/device.c
> > @@ -4,6 +4,9 @@
> >  #include <linux/module.h>
> >  #include <linux/mm.h>
> >
> > +static bool nohmem;
> > +module_param_named(disable, nohmem, bool, 0444);
> > +
> >  void hmem_register_device(int target_nid, struct resource *r)
> >  {
> >       /* define a clean / non-busy resource for the platform device */
> > @@ -16,6 +19,9 @@ void hmem_register_device(int target_nid, struct resource *r)
> >       struct memregion_info info;
> >       int rc, id;
> >
> > +     if (nohmem)
> > +             return;
> > +
> >       rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
> >                       IORES_DESC_SOFT_RESERVED);
> >       if (rc != REGION_INTERSECTS)
> > @@ -62,3 +68,30 @@ void hmem_register_device(int target_nid, struct resource *r)
> >  out_pdev:
> >       memregion_free(id);
> >  }
> > +
> > +static __init int hmem_register_one(struct resource *res, void *data)
> > +{
> > +     /*
> > +      * If the resource is not a top-level resource it was already
> > +      * assigned to a device by the HMAT parsing.
> > +      */
> > +     if (res->parent != &iomem_resource)
> > +             return 0;
> > +
> > +     hmem_register_device(phys_to_target_node(res->start), res);
> > +
> > +     return 0;
>
> Should we add an error returning value to hmem_register_device() perhaps this
> ought to be reflected in hmem_register_one().
>
> > +}
> > +
> > +static __init int hmem_init(void)
> > +{
> > +     walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
> > +                     IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
> > +     return 0;
> > +}
> > +
>
> (...) and then perhaps here returning in the initcall if any of the resources
> failed hmem registration?

Except that hmem_register_one() is a stop-gap to collect soft-reserved
ranges that were not already registered, and it's not an error to find
already registered devices. However, I do think it's a good idea to
log registrations that failed for other reasons.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range
  2020-03-24 21:06     ` Dan Williams
@ 2020-03-24 21:30       ` Joao Martins
  0 siblings, 0 replies; 16+ messages in thread
From: Joao Martins @ 2020-03-24 21:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux ACPI, Jonathan Cameron, Brice Goglin, Ard Biesheuvel,
	Rafael J. Wysocki, Catalin Marinas, Will Deacon, Peter Zijlstra,
	Dave Hansen, linux-nvdimm, Linux Kernel Mailing List, X86 ML

On 3/24/20 9:06 PM, Dan Williams wrote:
> On Tue, Mar 24, 2020 at 12:41 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 3/22/20 4:12 PM, Dan Williams wrote:
>>> The hmem enabling in commit 'cf8741ac57ed ("ACPI: NUMA: HMAT: Register
>>> "soft reserved" memory as an "hmem" device")' only registered ranges to
>>> the hmem driver for each soft-reservation that also appeared in the
>>> HMAT. While this is meant to encourage platform firmware to "do the
>>> right thing" and publish an HMAT, the corollary is that platforms that
>>> fail to publish an accurate HMAT will strand memory from Linux usage.
>>> Additionally, the "efi_fake_mem" kernel command line option enabling
>>> will strand memory by default without an HMAT.
>>>
>>> Arrange for "soft reserved" memory that goes unclaimed by HMAT entries
>>> to be published as raw resource ranges for the hmem driver to consume.
>>>
>>> Include a module parameter to disable either this fallback behavior, or
>>> the hmat enabling from creating hmem devices. The module parameter
>>> requires the hmem device enabling to have unique name in the module
>>> namespace: "device_hmem".
>>>
>>> Rather than mark this x86-only, include an interim phys_to_target_node()
>>> implementation for arm64.
>>>
>>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Cc: Brice Goglin <Brice.Goglin@inria.fr>
>>> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
>>> Cc: Jeff Moyer <jmoyer@redhat.com>
>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Will Deacon <will@kernel.org>
>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>>  arch/arm64/mm/numa.c      |   13 +++++++++++++
>>>  drivers/dax/Kconfig       |    1 +
>>>  drivers/dax/hmem/Makefile |    3 ++-
>>>  drivers/dax/hmem/device.c |   33 +++++++++++++++++++++++++++++++++
>>>  4 files changed, 49 insertions(+), 1 deletion(-)
>>>
>>
>> [...]
>>
>>> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
>>> index 99bc15a8b031..f9c5fa8b1880 100644
>>> --- a/drivers/dax/hmem/device.c
>>> +++ b/drivers/dax/hmem/device.c
>>> @@ -4,6 +4,9 @@
>>>  #include <linux/module.h>
>>>  #include <linux/mm.h>
>>>
>>> +static bool nohmem;
>>> +module_param_named(disable, nohmem, bool, 0444);
>>> +
>>>  void hmem_register_device(int target_nid, struct resource *r)
>>>  {
>>>       /* define a clean / non-busy resource for the platform device */
>>> @@ -16,6 +19,9 @@ void hmem_register_device(int target_nid, struct resource *r)
>>>       struct memregion_info info;
>>>       int rc, id;
>>>
>>> +     if (nohmem)
>>> +             return;
>>> +
>>>       rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
>>>                       IORES_DESC_SOFT_RESERVED);
>>>       if (rc != REGION_INTERSECTS)
>>> @@ -62,3 +68,30 @@ void hmem_register_device(int target_nid, struct resource *r)
>>>  out_pdev:
>>>       memregion_free(id);
>>>  }
>>> +
>>> +static __init int hmem_register_one(struct resource *res, void *data)
>>> +{
>>> +     /*
>>> +      * If the resource is not a top-level resource it was already
>>> +      * assigned to a device by the HMAT parsing.
>>> +      */
>>> +     if (res->parent != &iomem_resource)
>>> +             return 0;
>>> +
>>> +     hmem_register_device(phys_to_target_node(res->start), res);
>>> +
>>> +     return 0;
>>
>> Should we add an error returning value to hmem_register_device() perhaps this
>> ought to be reflected in hmem_register_one().
>>
>>> +}
>>> +
>>> +static __init int hmem_init(void)
>>> +{
>>> +     walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
>>> +                     IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
>>> +     return 0;
>>> +}
>>> +
>>
>> (...) and then perhaps here returning in the initcall if any of the resources
>> failed hmem registration?
> 
> Except that hmem_register_one() is a stop-gap to collect soft-reserved
> ranges that were not already registered, and it's not an error to find
> already registered devices. 
> 
/nods

And if we were to return an error (say for hmem0 out of 4 hmem ones)  before
walking through all soft-reserved found resources, if would skip registration
for the remaining ones.

  Joao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/6] Manual definition of Soft Reserved memory devices
  2020-03-22 16:12 [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Dan Williams
                   ` (5 preceding siblings ...)
  2020-03-22 16:12 ` [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range Dan Williams
@ 2020-03-25 10:02 ` Rafael J. Wysocki
  6 siblings, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2020-03-25 10:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: ACPI Devel Maling List, Jason Gunthorpe, Rafael J. Wysocki,
	Peter Zijlstra, Ard Biesheuvel, Jonathan Cameron,
	Borislav Petkov, Wei Yang, the arch/x86 maintainers,
	H. Peter Anvin, Brice Goglin, Thomas Gleixner, Catalin Marinas,
	Jeff Moyer, Ingo Molnar, Dave Hansen, Will Deacon,
	Rafael J. Wysocki, Ard Biesheuvel, Andy Lutomirski, Tom Lendacky,
	linux-nvdimm, Linux Kernel Mailing List, Joao Martins

On Sun, Mar 22, 2020 at 5:28 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Changes since v1 [1]:
> - Kill the ifdef'ery in arch/x86/mm/numa.c (Rafael)
>
> - Add a dummy phys_to_target_node() for ARM64 (0day-robot)
>
> - Initialize ->child and ->sibling to NULL in the resource returned by
>   find_next_iomem_res() (Inspired by Tom's feedback even though it does
>   not set them like he suggested)
>
> - Collect Ard's Ack
>
> [1]: http://lore.kernel.org/r/158318759687.2216124.4684754859068906007.stgit@dwillia2-desk3.amr.corp.intel.com
>
> ---
>
> My primary motivation is making the dax_kmem facility useful to
> shipping platforms that have performance differentiated memory, but
> may not have EFI-defined soft-reservations / HMAT (or
> non-EFI-ACPI-platform equivalent). I'm anticipating HMAT enabled
> platforms where the platform firmware policy for what is
> soft-reserved, or not, is not the policy the system owner would pick.
> I'd also highlight Joao's work [2] (see the TODO section) as an
> indication of the demand for custom carving memory resources and
> applying the device-dax memory management interface.
>
> Given the current dearth of systems that supply an ACPI HMAT table, and
> the utility of being able to manually define device-dax "hmem" instances
> via the efi_fake_mem= option, relax the requirements for creating these
> devices. Specifically, add an option (numa=nohmat) to optionally disable
> consideration of the HMAT and update efi_fake_mem= to behave like
> memmap=nn!ss in terms of delimiting device boundaries.
>
> [2]: https://lore.kernel.org/lkml/20200110190313.17144-1-joao.m.martins@oracle.com/
>
> With Ard's and Rafael's ack I'd feel ok taking this through the nvdimm
> tree, please holler if anything still needs some fixups.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

for the whole series.

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range
  2020-03-22 16:12 ` [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range Dan Williams
  2020-03-24 19:41   ` Joao Martins
@ 2020-03-25 11:10   ` Will Deacon
  2020-03-25 17:10     ` Dan Williams
  1 sibling, 1 reply; 16+ messages in thread
From: Will Deacon @ 2020-03-25 11:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-acpi, Jonathan Cameron, Brice Goglin, Ard Biesheuvel,
	Rafael J. Wysocki, Jeff Moyer, Catalin Marinas, peterz,
	dave.hansen, linux-nvdimm, linux-kernel, x86, joao.m.martins

On Sun, Mar 22, 2020 at 09:12:58AM -0700, Dan Williams wrote:
> The hmem enabling in commit 'cf8741ac57ed ("ACPI: NUMA: HMAT: Register
> "soft reserved" memory as an "hmem" device")' only registered ranges to
> the hmem driver for each soft-reservation that also appeared in the
> HMAT. While this is meant to encourage platform firmware to "do the
> right thing" and publish an HMAT, the corollary is that platforms that
> fail to publish an accurate HMAT will strand memory from Linux usage.
> Additionally, the "efi_fake_mem" kernel command line option enabling
> will strand memory by default without an HMAT.
> 
> Arrange for "soft reserved" memory that goes unclaimed by HMAT entries
> to be published as raw resource ranges for the hmem driver to consume.
> 
> Include a module parameter to disable either this fallback behavior, or
> the hmat enabling from creating hmem devices. The module parameter
> requires the hmem device enabling to have unique name in the module
> namespace: "device_hmem".
> 
> Rather than mark this x86-only, include an interim phys_to_target_node()
> implementation for arm64.
> 
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Brice Goglin <Brice.Goglin@inria.fr>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/arm64/mm/numa.c      |   13 +++++++++++++
>  drivers/dax/Kconfig       |    1 +
>  drivers/dax/hmem/Makefile |    3 ++-
>  drivers/dax/hmem/device.c |   33 +++++++++++++++++++++++++++++++++
>  4 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index 4decf1659700..00fba21eaec0 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -468,3 +468,16 @@ int memory_add_physaddr_to_nid(u64 addr)
>  	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>  	return 0;
>  }
> +
> +/*
> + * device-dax instance registrations want a valid target-node in case
> + * they are ever onlined as memory (see hmem_register_device()).
> + *
> + * TODO: consult cached numa info
> + */
> +int phys_to_target_node(phys_addr_t addr)
> +{
> +	pr_warn_once("Unknown target node for memory at 0x%llx, assuming node 0\n",
> +			addr);
> +	return 0;
> +}

Could you implement a generic version of this by iterating over the nodes
with for_each_{,online_}node() and checking for intersection with
node_{start,end}_pfn()?

Will

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range
  2020-03-25 11:10   ` Will Deacon
@ 2020-03-25 17:10     ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2020-03-25 17:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux ACPI, Jonathan Cameron, Brice Goglin, Ard Biesheuvel,
	Rafael J. Wysocki, Jeff Moyer, Catalin Marinas, Peter Zijlstra,
	Dave Hansen, linux-nvdimm, Linux Kernel Mailing List, X86 ML,
	Joao Martins

On Wed, Mar 25, 2020 at 4:10 AM Will Deacon <will@kernel.org> wrote:
>
> On Sun, Mar 22, 2020 at 09:12:58AM -0700, Dan Williams wrote:
> > The hmem enabling in commit 'cf8741ac57ed ("ACPI: NUMA: HMAT: Register
> > "soft reserved" memory as an "hmem" device")' only registered ranges to
> > the hmem driver for each soft-reservation that also appeared in the
> > HMAT. While this is meant to encourage platform firmware to "do the
> > right thing" and publish an HMAT, the corollary is that platforms that
> > fail to publish an accurate HMAT will strand memory from Linux usage.
> > Additionally, the "efi_fake_mem" kernel command line option enabling
> > will strand memory by default without an HMAT.
> >
> > Arrange for "soft reserved" memory that goes unclaimed by HMAT entries
> > to be published as raw resource ranges for the hmem driver to consume.
> >
> > Include a module parameter to disable either this fallback behavior, or
> > the hmat enabling from creating hmem devices. The module parameter
> > requires the hmem device enabling to have unique name in the module
> > namespace: "device_hmem".
> >
> > Rather than mark this x86-only, include an interim phys_to_target_node()
> > implementation for arm64.
> >
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Cc: Brice Goglin <Brice.Goglin@inria.fr>
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  arch/arm64/mm/numa.c      |   13 +++++++++++++
> >  drivers/dax/Kconfig       |    1 +
> >  drivers/dax/hmem/Makefile |    3 ++-
> >  drivers/dax/hmem/device.c |   33 +++++++++++++++++++++++++++++++++
> >  4 files changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > index 4decf1659700..00fba21eaec0 100644
> > --- a/arch/arm64/mm/numa.c
> > +++ b/arch/arm64/mm/numa.c
> > @@ -468,3 +468,16 @@ int memory_add_physaddr_to_nid(u64 addr)
> >       pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >       return 0;
> >  }
> > +
> > +/*
> > + * device-dax instance registrations want a valid target-node in case
> > + * they are ever onlined as memory (see hmem_register_device()).
> > + *
> > + * TODO: consult cached numa info
> > + */
> > +int phys_to_target_node(phys_addr_t addr)
> > +{
> > +     pr_warn_once("Unknown target node for memory at 0x%llx, assuming node 0\n",
> > +                     addr);
> > +     return 0;
> > +}
>
> Could you implement a generic version of this by iterating over the nodes
> with for_each_{,online_}node() and checking for intersection with
> node_{start,end}_pfn()?

Interesting. The gap is that node_{start,end}_pfn() requires
node_data[] which to date architectures have only setup for online
nodes. Recall a target node is an offline node that could come online
later. However, reworking offline data into node_data could be the
local solution for arm64, I'd just ask that each of the 6
memory-hotplug capable archs go make that opt-in decision themselves.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
  2020-03-24 21:04     ` Dan Williams
@ 2020-03-25 22:32       ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2020-03-25 22:32 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux ACPI, Rafael J. Wysocki, Peter Zijlstra, Dave Hansen,
	Ard Biesheuvel, linux-nvdimm, Linux Kernel Mailing List, X86 ML

On Tue, Mar 24, 2020 at 2:04 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Mar 24, 2020 at 12:41 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >
> >
> > On 3/22/20 4:12 PM, Dan Williams wrote:
> > > In preparation for exposing "Soft Reserved" memory ranges without an
> > > HMAT, move the hmem device registration to its own compilation unit and
> > > make the implementation generic.
> > >
> > > The generic implementation drops usage acpi_map_pxm_to_online_node()
> > > that was translating ACPI proximity domain values and instead relies on
> > > numa_map_to_online_node() to determine the numa node for the device.
> > >
> > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > Link: https://lore.kernel.org/r/158318761484.2216124.2049322072599482736.stgit@dwillia2-desk3.amr.corp.intel.com
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > > ---
> > >  drivers/acpi/numa/hmat.c  |   68 ++++-----------------------------------------
> > >  drivers/dax/Kconfig       |    4 +++
> > >  drivers/dax/Makefile      |    3 +-
> > >  drivers/dax/hmem.c        |   56 -------------------------------------
> > >  drivers/dax/hmem/Makefile |    5 +++
> > >  drivers/dax/hmem/device.c |   64 ++++++++++++++++++++++++++++++++++++++++++
> > >  drivers/dax/hmem/hmem.c   |   56 +++++++++++++++++++++++++++++++++++++
> > >  include/linux/dax.h       |    8 +++++
> > >  8 files changed, 144 insertions(+), 120 deletions(-)
> > >  delete mode 100644 drivers/dax/hmem.c
> > >  create mode 100644 drivers/dax/hmem/Makefile
> > >  create mode 100644 drivers/dax/hmem/device.c
> > >  create mode 100644 drivers/dax/hmem/hmem.c
> > >
> > > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > > index a12e36a12618..134bcb40b2af 100644
> > > --- a/drivers/acpi/numa/hmat.c
> > > +++ b/drivers/acpi/numa/hmat.c
> > > @@ -24,6 +24,7 @@
> > >  #include <linux/mutex.h>
> > >  #include <linux/node.h>
> > >  #include <linux/sysfs.h>
> > > +#include <linux/dax.h>
> > >
> > >  static u8 hmat_revision;
> > >  static int hmat_disable __initdata;
> > > @@ -640,66 +641,6 @@ static void hmat_register_target_perf(struct memory_target *target)
> > >       node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
> > >  }
> > >
> > > -static void hmat_register_target_device(struct memory_target *target,
> >           ^^^^ int ?
> >
> > > -             struct resource *r)
> > > -{
> > > -     /* define a clean / non-busy resource for the platform device */
> > > -     struct resource res = {
> > > -             .start = r->start,
> > > -             .end = r->end,
> > > -             .flags = IORESOURCE_MEM,
> > > -     };
> > > -     struct platform_device *pdev;
> > > -     struct memregion_info info;
> > > -     int rc, id;
> > > -
> > > -     rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
> > > -                     IORES_DESC_SOFT_RESERVED);
> > > -     if (rc != REGION_INTERSECTS)
> > > -             return;
> >                 ^ return -ENXIO;
> >
> > > -
> > > -     id = memregion_alloc(GFP_KERNEL);
> > > -     if (id < 0) {
> > > -             pr_err("memregion allocation failure for %pr\n", &res);
> > > -             return;
> >                 ^ return -ENOMEM;
> >
> > > -     }
> > > -
> > > -     pdev = platform_device_alloc("hmem", id);
> > > -     if (!pdev) {
> >
> >                 rc = -ENOMEM;
> >
> > > -             pr_err("hmem device allocation failure for %pr\n", &res);
> > > -             goto out_pdev;
> > > -     }
> > > -
> > > -     pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->memory_pxm);
> > > -     info = (struct memregion_info) {
> > > -             .target_node = acpi_map_pxm_to_node(target->memory_pxm),
> > > -     };
> > > -     rc = platform_device_add_data(pdev, &info, sizeof(info));
> > > -     if (rc < 0) {
> > > -             pr_err("hmem memregion_info allocation failure for %pr\n", &res);
> > > -             goto out_pdev;
> > > -     }
> > > -
> > > -     rc = platform_device_add_resources(pdev, &res, 1);
> > > -     if (rc < 0) {
> > > -             pr_err("hmem resource allocation failure for %pr\n", &res);
> > > -             goto out_resource;
> > > -     }
> > > -
> > > -     rc = platform_device_add(pdev);
> > > -     if (rc < 0) {
> > > -             dev_err(&pdev->dev, "device add failed for %pr\n", &res);
> > > -             goto out_resource;
> > > -     }
> > > -
> > > -     return;
> >         ^^^^^^ return 0;
> > > -
> > > -out_resource:
> > > -     put_device(&pdev->dev);
> > > -out_pdev:
> > > -     memregion_free(id);
> >
> >         return rc;
> >
> > > -}
> > > -
> > >  static void hmat_register_target_devices(struct memory_target *target)
> > >  {
> > >       struct resource *res;
> > > @@ -711,8 +652,11 @@ static void hmat_register_target_devices(struct memory_target *target)
> > >       if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
> > >               return;
> > >
> > > -     for (res = target->memregions.child; res; res = res->sibling)
> > > -             hmat_register_target_device(target, res);
> > > +     for (res = target->memregions.child; res; res = res->sibling) {
> > > +             int target_nid = acpi_map_pxm_to_node(target->memory_pxm);
> > > +
> > > +             hmem_register_device(target_nid, res);
> > > +     }
> > >  }
> > >
> >
> > If it makes sense to propagate error to hmem_register_device (as noted above),
> > then here perhaps a pr_err() if hmem registration fails mainly for reporting
> > purposes?
>
> Yeah, I guess it makes sense to at least log that something went wrong
> if someone wonders where their memory device went. I'll add that
> rework as a follow-on.

Now that I look again hmat_register_target_device() already has
multiple pr_err() indicating that something went wrong. The error code
would not go anywhere useful.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-03-25 22:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-22 16:12 [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Dan Williams
2020-03-22 16:12 ` [PATCH v2 1/6] x86/numa: Cleanup configuration dependent command-line options Dan Williams
2020-03-22 16:12 ` [PATCH v2 2/6] x86/numa: Add 'nohmat' option Dan Williams
2020-03-22 16:12 ` [PATCH v2 3/6] efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance Dan Williams
2020-03-22 16:12 ` [PATCH v2 4/6] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device Dan Williams
2020-03-24 19:40   ` Joao Martins
2020-03-24 21:04     ` Dan Williams
2020-03-25 22:32       ` Dan Williams
2020-03-22 16:12 ` [PATCH v2 5/6] resource: Report parent to walk_iomem_res_desc() callback Dan Williams
2020-03-22 16:12 ` [PATCH v2 6/6] ACPI: HMAT: Attach a device for each soft-reserved range Dan Williams
2020-03-24 19:41   ` Joao Martins
2020-03-24 21:06     ` Dan Williams
2020-03-24 21:30       ` Joao Martins
2020-03-25 11:10   ` Will Deacon
2020-03-25 17:10     ` Dan Williams
2020-03-25 10:02 ` [PATCH v2 0/6] Manual definition of Soft Reserved memory devices Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).